modelstop.top
Home/All Models

AI Model Catalogue

Browse 449 models across providers, modalities, and use cases.

πŸ‘οΈ Vision & Multimodal

449 models Β· Page 9 of 13

wan-2.7-image

wan-video

Generate and edit images with Alibaba's Wan 2.7

visionimagefree
ctxFree in
Explore specs and pricingView details β†’

siglip-large-patch16-384

devarsh-mavani-19

Get embeddings for image using siglip-large-patch16-384

visionfree
ctxFree in
Explore specs and pricingView details β†’

metric3dv2

visionaix

Metric3D v2 (TPAMI 2024): Monocular metric depth and surface normals from a single image. Predicts real-world depth in meters. Works indoor and outdoor.

visionfree
ctxFree in
Explore specs and pricingView details β†’

flux-2-pro

black-forest-labs

High-quality image generation and editing with support for eight reference images

visionimagefree
ctxFree in
Explore specs and pricingView details β†’

grok-imagine-r2v

xai

Generate videos guided by reference images using xAI's Grok Imagine Video model

visionimagefree
ctxFree in
Explore specs and pricingView details β†’

seedvr2

papina

πŸ”₯ SeedVR2: one-step video & image restoration with 7B and Adjustable Resolution

visionfree
ctxFree in
Explore specs and pricingView details β†’

sam3-video

lucataco

A unified foundation model for prompt-based segmentation in images and videos

visionfree
ctxFree in
Explore specs and pricingView details β†’

wan-2.7-i2v

wan-video

Generate videos from images, with support for first-and-last-frame control, clip continuation, and audio synchronization using Alibaba's Wan 2.7 model

visionimageaudio
ctxFree in
Explore specs and pricingView details β†’

p-video

prunaai

Fast video generation with built-in draft mode for rapid creative iteration. Text-to-video, image-to-video, and audio-to-video in a single endpoint.

visionimageaudio
ctxFree in
Explore specs and pricingView details β†’

depth-anything-v3-metric-pano

vufinder

Monocular metric depth estimation for panoramic images

visionfree
ctxFree in
Explore specs and pricingView details β†’

wan-2.7-image-pro

wan-video

Generate and edit high-quality images with Alibaba's Wan 2.7 Pro with 4K output, thinking mode, text-to-image, multi-image editing, and image set generation

visionimagereasoning
ctxFree in
Explore specs and pricingView details β†’

lyria-3-pro

google

Generate full-length songs up to 3 minutes from text prompts or images with Lyria 3 Pro, Google's most capable music generation model

visionimagefree
ctxFree in
Explore specs and pricingView details β†’

stems-separator

triadmusic

Image to separate stems from a song, using demucs and spleeter

visionfree
ctxFree in
Explore specs and pricingView details β†’

seedance-2.0

bytedance

ByteDance's multimodal video generation model with native audio, multimodal reference inputs, and intelligent duration control.

visionaudiofree
Run locally
ctxFree in
Explore specs and pricingView details β†’

microsoft/Phi-3.5-vision-instruct

microsoft

microsoft/Phi-3.5-vision-instruct is a image text to text model on Hugging Face with ~1,482,472 monthly downloads. Open access.

visioninstructopen-source
Run locally
ctx$0.00/1M in
Explore specs and pricingView details β†’

Stable Diffusion 3.5 Large

Stability AI

Stable Diffusion 3.5 Large is Stability AI's most capable text-to-image model, delivering photorealistic and creative imagery with excellent prompt adherence and detail. Features multimodal diffusion transformer architecture.

visionopen-source
Run locally
ctx$0.00/1M in
Explore specs and pricingView details β†’

Amazon Nova Pro

amazon

Amazon Nova Pro is a highly capable multimodal model with the best combination of accuracy, speed, and cost across a wide range of tasks. Supports text, image, and video inputs.

visionmultimodallong-context
300,000 ctx$0.80/1M in
Explore specs and pricingView details β†’

Amazon Nova Lite

amazon

Amazon Nova Lite is a very low-cost multimodal model that can process image, video, and text inputs. Fast and accurate for a wide range of tasks requiring visual and language understanding.

visionmultimodalcheap
300,000 ctx$0.06/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4

openai

OpenAI's flagship model, GPT-4 is a large-scale multimodal language model capable of solving difficult problems with greater accuracy than previous models due to its broader general knowledge and advanced reasoning...

textvisionreasoning
8,191 ctx$30.00/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4 Turbo (older v1106)

openai

The latest GPT-4 Turbo model with vision capabilities. Vision requests can now use JSON mode and function calling. Training data: up to April 2023.

textvisionlong-context
128,000 ctx$10.00/1M in
Explore specs and pricingView details β†’

Auto Router

openrouter

Your prompt will be processed by a meta-model and routed to one of dozens of models (see below), optimizing for the best possible output. To see which model was used,...

textvisionmultimodal
2,000,000 ctxFree in
Explore specs and pricingView details β†’

Anthropic: Claude 3 Haiku

anthropic

Claude 3 Haiku is Anthropic's fastest and most compact model for near-instant responsiveness. Quick and accurate targeted performance. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-haiku) #multimodal

textvisionmultimodal
Run locally
200,000 ctx$0.25/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4 Turbo

openai

The latest GPT-4 Turbo model with vision capabilities. Vision requests can now use JSON mode and function calling. Training data: up to December 2023.

textvisionmultimodal
Run locally
128,000 ctx$10.00/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o

openai

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as...

textvisionmultimodal
Run locally
128,000 ctx$2.50/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o (2024-05-13)

openai

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as...

textvisionmultimodal
Run locally
128,000 ctx$5.00/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o-mini

openai

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting both text and image inputs with text outputs. As their most advanced small model, it is many multiples more affordable...

textvisionmultimodal
Run locally
128,000 ctx$0.15/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o-mini (2024-07-18)

openai

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting both text and image inputs with text outputs. As their most advanced small model, it is many multiples more affordable...

textvisionmultimodal
128,000 ctx$0.15/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o (2024-08-06)

openai

The 2024-08-06 version of GPT-4o offers improved performance in structured outputs, with the ability to supply a JSON schema in the respone_format. Read more [here](https://openai.com/index/introducing-structured-outputs-in-the-api/). GPT-4o ("o" for "omni") is...

textvisionmultimodal
128,000 ctx$2.50/1M in
Explore specs and pricingView details β†’

Meta: Llama 3.2 11B Vision Instruct

meta-llama

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and...

textvisionmultimodal
131,072 ctx$0.24/1M in
Explore specs and pricingView details β†’

Anthropic: Claude 3.5 Haiku

anthropic

Claude 3.5 Haiku features offers enhanced capabilities in speed, coding accuracy, and tool use. Engineered to excel in real-time applications, it delivers quick response times that are essential for dynamic...

textvisionmultimodal
Run locally
200,000 ctx$0.80/1M in
Explore specs and pricingView details β†’

Mistral: Pixtral Large 2411

mistralai

Pixtral Large is a 124B parameter, open-weight, multimodal model built on top of [Mistral Large 2](/mistralai/mistral-large-2411). The model is able to understand documents, charts and natural images. The model is...

textvisionmultimodal
131,072 ctx$2.00/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o (2024-11-20)

openai

The 2024-11-20 version of GPT-4o offers a leveled-up creative writing ability with more natural, engaging, and tailored writing to improve relevance & readability. It’s also better at working with uploaded...

textvisionmultimodal
Run locally
128,000 ctx$2.50/1M in
Explore specs and pricingView details β†’

Amazon: Nova Pro 1.0

amazon

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December...

textvisionmultimodal
Run locally
300,000 ctx$0.80/1M in
Explore specs and pricingView details β†’

Amazon: Nova Lite 1.0

amazon

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite...

textvisionimage
Run locally
300,000 ctx$0.06/1M in
Explore specs and pricingView details β†’

OpenAI: o1

openai

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 model series is trained with large-scale reinforcement learning to reason...

textvisionmultimodal
Run locally
200,000 ctx$15.00/1M in
Explore specs and pricingView details β†’

MiniMax: MiniMax-01

minimax

MiniMax-01 is a combines MiniMax-Text-01 for text generation and MiniMax-VL-01 for image understanding. It has 456 billion parameters, with 45.9 billion parameters activated per inference, and can handle a context...

textvisionimage
Run locally
1,000,192 ctx$0.20/1M in
Explore specs and pricingView details β†’