modelstop.top
Home/All Models

AI Model Catalogue

Browse 267 models across providers, modalities, and use cases.

🌐 All Models

267 models Β· Page 4 of 8

veo-3.1

google

New and improved version of Veo 3, with higher-fidelity video, context-aware audio, reference image and last frame support

visionaudiofree
ctxFree in
Explore specs and pricingView details β†’

lyria-3-pro

google

Generate full-length songs up to 3 minutes from text prompts or images with Lyria 3 Pro, Google's most capable music generation model

visionimagefree
ctxFree in
Explore specs and pricingView details β†’

seedream-4.5

bytedance

Seedream 4.5: Upgraded Bytedance image model with stronger spatial understanding and world knowledge

visionfree
ctxFree in
Explore specs and pricingView details β†’

depth-anything-v3-metric-pano

vufinder

Monocular metric depth estimation for panoramic images

visionfree
ctxFree in
Explore specs and pricingView details β†’

q3-turbo

vidu

Fast video generation with text-to-video, image-to-video, and start-end-to-video modes. Up to 16 seconds at 1080p with synchronized audio.

visionimageaudio
ctxFree in
Explore specs and pricingView details β†’

siglip-large-patch16-384

devarsh-mavani-19

Get embeddings for image using siglip-large-patch16-384

visionfree
ctxFree in
Explore specs and pricingView details β†’

kling-v2.6-motion-control

kwaivgi

Enables precise control of character actions and expressions from a reference image.

visionfree
ctxFree in
Explore specs and pricingView details β†’

kling-v2.5-turbo-pro

kwaivgi

Kling 2.5 Turbo Pro: Unlock pro-level text-to-video and image-to-video creation with smooth motion, cinematic depth, and remarkable prompt adherence.

visionfree
ctxFree in
Explore specs and pricingView details β†’

metric3dv2

visionaix

Metric3D v2 (TPAMI 2024): Monocular metric depth and surface normals from a single image. Predicts real-world depth in meters. Works indoor and outdoor.

visionfree
ctxFree in
Explore specs and pricingView details β†’

ernie-image

prunaai

ERNIE-Image is an open text-to-image generation model developed by the ERNIE-Image team at Baidu

visionimagefree
ctxFree in
Explore specs and pricingView details β†’

wan-2.7-image

wan-video

Generate and edit images with Alibaba's Wan 2.7

visionimagefree
ctxFree in
Explore specs and pricingView details β†’

q3-pro

vidu

High-fidelity video generation with text-to-video, image-to-video, and start-end-to-video modes. Up to 16 seconds at 1080p with synchronized audio.

visionimageaudio
ctxFree in
Explore specs and pricingView details β†’

microsoft/Phi-3.5-vision-instruct

microsoft

microsoft/Phi-3.5-vision-instruct is a image text to text model on Hugging Face with ~1,482,472 monthly downloads. Open access.

visioninstructopen-source
ctx$0.00/1M in
Explore specs and pricingView details β†’

Stable Diffusion 3.5 Large

Stability AI

Stable Diffusion 3.5 Large is Stability AI's most capable text-to-image model, delivering photorealistic and creative imagery with excellent prompt adherence and detail. Features multimodal diffusion transformer architecture.

visionopen-source
ctx$0.00/1M in
Explore specs and pricingView details β†’

Amazon Nova Pro

amazon

Amazon Nova Pro is a highly capable multimodal model with the best combination of accuracy, speed, and cost across a wide range of tasks. Supports text, image, and video inputs.

visionmultimodallong-context
300,000 ctx$0.80/1M in
Explore specs and pricingView details β†’

Amazon Nova Lite

amazon

Amazon Nova Lite is a very low-cost multimodal model that can process image, video, and text inputs. Fast and accurate for a wide range of tasks requiring visual and language understanding.

visionmultimodalcheap
300,000 ctx$0.06/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4

openai

OpenAI's flagship model, GPT-4 is a large-scale multimodal language model capable of solving difficult problems with greater accuracy than previous models due to its broader general knowledge and advanced reasoning...

textvisionreasoning
8,191 ctx$30.00/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4 Turbo (older v1106)

openai

The latest GPT-4 Turbo model with vision capabilities. Vision requests can now use JSON mode and function calling. Training data: up to April 2023.

textvisionlong-context
128,000 ctx$10.00/1M in
Explore specs and pricingView details β†’

Auto Router

openrouter

Your prompt will be processed by a meta-model and routed to one of dozens of models (see below), optimizing for the best possible output. To see which model was used,...

textvisionmultimodal
2,000,000 ctxFree in
Explore specs and pricingView details β†’

Anthropic: Claude 3 Haiku

anthropic

Claude 3 Haiku is Anthropic's fastest and most compact model for near-instant responsiveness. Quick and accurate targeted performance. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-haiku) #multimodal

textvisionmultimodal
200,000 ctx$0.25/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4 Turbo

openai

The latest GPT-4 Turbo model with vision capabilities. Vision requests can now use JSON mode and function calling. Training data: up to December 2023.

textvisionmultimodal
128,000 ctx$10.00/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o

openai

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as...

textvisionmultimodal
128,000 ctx$2.50/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o (2024-05-13)

openai

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as...

textvisionmultimodal
128,000 ctx$5.00/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o-mini

openai

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting both text and image inputs with text outputs. As their most advanced small model, it is many multiples more affordable...

textvisionmultimodal
128,000 ctx$0.15/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o-mini (2024-07-18)

openai

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting both text and image inputs with text outputs. As their most advanced small model, it is many multiples more affordable...

textvisionmultimodal
128,000 ctx$0.15/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o (2024-08-06)

openai

The 2024-08-06 version of GPT-4o offers improved performance in structured outputs, with the ability to supply a JSON schema in the respone_format. Read more [here](https://openai.com/index/introducing-structured-outputs-in-the-api/). GPT-4o ("o" for "omni") is...

textvisionmultimodal
128,000 ctx$2.50/1M in
Explore specs and pricingView details β†’

Meta: Llama 3.2 11B Vision Instruct

meta-llama

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and...

textvisionmultimodal
131,072 ctx$0.24/1M in
Explore specs and pricingView details β†’

Anthropic: Claude 3.5 Haiku

anthropic

Claude 3.5 Haiku features offers enhanced capabilities in speed, coding accuracy, and tool use. Engineered to excel in real-time applications, it delivers quick response times that are essential for dynamic...

textvisionmultimodal
200,000 ctx$0.80/1M in
Explore specs and pricingView details β†’

Mistral: Pixtral Large 2411

mistralai

Pixtral Large is a 124B parameter, open-weight, multimodal model built on top of [Mistral Large 2](/mistralai/mistral-large-2411). The model is able to understand documents, charts and natural images. The model is...

textvisionmultimodal
131,072 ctx$2.00/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-4o (2024-11-20)

openai

The 2024-11-20 version of GPT-4o offers a leveled-up creative writing ability with more natural, engaging, and tailored writing to improve relevance & readability. It’s also better at working with uploaded...

textvisionmultimodal
128,000 ctx$2.50/1M in
Explore specs and pricingView details β†’

Amazon: Nova Pro 1.0

amazon

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December...

textvisionmultimodal
300,000 ctx$0.80/1M in
Explore specs and pricingView details β†’

Amazon: Nova Lite 1.0

amazon

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite...

textvisionimage
300,000 ctx$0.06/1M in
Explore specs and pricingView details β†’

OpenAI: o1

openai

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 model series is trained with large-scale reinforcement learning to reason...

textvisionmultimodal
200,000 ctx$15.00/1M in
Explore specs and pricingView details β†’

MiniMax: MiniMax-01

minimax

MiniMax-01 is a combines MiniMax-Text-01 for text generation and MiniMax-VL-01 for image understanding. It has 456 billion parameters, with 45.9 billion parameters activated per inference, and can handle a context...

textvisionimage
1,000,192 ctx$0.20/1M in
Explore specs and pricingView details β†’

Perplexity: Sonar

perplexity

Sonar is lightweight, affordable, fast, and simple to use β€” now featuring citations and the ability to customize sources. It is designed for companies seeking to integrate lightweight question-and-answer features...

textvisionmultimodal
127,072 ctx$1.00/1M in
Explore specs and pricingView details β†’

Qwen: Qwen2.5 VL 72B Instruct

qwen

Qwen2.5-VL is proficient in recognizing common objects such as flowers, birds, fish, and insects. It is also highly capable of analyzing texts, charts, icons, graphics, and layouts within images.

textvisionmultimodal
32,000 ctx$0.80/1M in
Explore specs and pricingView details β†’