modelstop.top
Home/All Models

AI Model Catalogue

Browse 267 models across providers, modalities, and use cases.

πŸ‘οΈ Vision & Multimodal

267 models Β· Page 8 of 8

ByteDance Seed: Seed-2.0-Lite

bytedance-seed

Seed-2.0-Lite is a versatile, cost‑efficient enterprise workhorse that delivers strong multimodal and agent capabilities while offering noticeably lower latency, making it a practical default choice for most production workloads across...

textvisionmultimodal
262,144 ctx$0.25/1M in
Explore specs and pricingView details β†’

Mistral: Mistral Small 4

mistralai

Mistral Small 4 is the next major release in the Mistral Small family, unifying the capabilities of several flagship Mistral models into a single system. It combines strong reasoning from...

textvisionmultimodal
262,144 ctx$0.15/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-5.4 Mini

openai

GPT-5.4 mini brings the core capabilities of GPT-5.4 to a faster, more efficient model optimized for high-throughput workloads. It supports text and image inputs with strong performance across reasoning, coding,...

textvisionmultimodal
400,000 ctx$0.75/1M in
Explore specs and pricingView details β†’

OpenAI: GPT-5.4 Nano

openai

GPT-5.4 nano is the most lightweight and cost-efficient variant of the GPT-5.4 family, optimized for speed-critical and high-volume tasks. It supports text and image inputs and is designed for low-latency...

textvisionmultimodal
400,000 ctx$0.20/1M in
Explore specs and pricingView details β†’

Xiaomi: MiMo-V2-Omni

xiaomi

MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step...

textvisionmultimodal
262,144 ctx$0.40/1M in
Explore specs and pricingView details β†’

Reka Edge

rekaai

Reka Edge is an extremely efficient 7B multimodal vision-language model that accepts image/video+text inputs and generates text outputs. This model is optimized specifically to deliver industry-leading performance in image understanding,...

textvisionimage
16,384 ctx$0.10/1M in
Explore specs and pricingView details β†’

Google: Lyria 3 Clip Preview

google

30 second duration clips are priced at $0.04 per clip. Lyria 3 is Google's family of music generation models, available through the Gemini API. With Lyria 3, you can generate...

textvisionimage
1,048,576 ctx$0.00/1M in
Explore specs and pricingView details β†’

Google: Lyria 3 Pro Preview

google

Full-length songs are priced at $0.08 per song. Lyria 3 is Google's family of music generation models, available through the Gemini API. With Lyria 3, you can generate high-quality, 48kHz...

textvisionimage
1,048,576 ctx$0.00/1M in
Explore specs and pricingView details β†’

xAI: Grok 4.20

x-ai

Grok 4.20 is xAI's newest flagship model with industry-leading speed and agentic tool calling capabilities. It combines the lowest hallucination rate on the market with strict prompt adherance, delivering consistently...

textvisionmultimodal
2,000,000 ctx$2.00/1M in
Explore specs and pricingView details β†’

xAI: Grok 4.20 Multi-Agent

x-ai

Grok 4.20 Multi-Agent is a variant of xAI’s Grok 4.20 designed for collaborative, agent-based workflows. Multiple agents operate in parallel to conduct deep research, coordinate tool use, and synthesize information...

textvisionmultimodal
2,000,000 ctx$2.00/1M in
Explore specs and pricingView details β†’

Z.ai: GLM 5V Turbo

z-ai

GLM-5V-Turbo is Z.ai’s first native multimodal agent foundation model, built for vision-based coding and agent-driven tasks. It natively handles image, video, and text inputs, excels at long-horizon planning, complex coding,...

textvisionmultimodal
202,752 ctx$1.20/1M in
Explore specs and pricingView details β†’

Qwen: Qwen3.6 Plus

qwen

Qwen 3.6 Plus builds on a hybrid architecture that combines efficient linear attention with sparse mixture-of-experts routing, enabling strong scalability and high-performance inference. Compared to the 3.5 series, it delivers...

textvisionmultimodal
1,000,000 ctx$0.33/1M in
Explore specs and pricingView details β†’

Google: Gemma 4 31B (free)

google

Gemma 4 31B Instruct is Google DeepMind's 30.7B dense multimodal model supporting text and image input with text output. Features a 256K token context window, configurable thinking/reasoning mode, native function...

textvisionmultimodal
262,144 ctx$0.14/1M in
Explore specs and pricingView details β†’

Google: Gemma 4 26B A4B (free)

google

Gemma 4 26B A4B IT is an instruction-tuned Mixture-of-Experts (MoE) model from Google DeepMind. Despite 25.2B total parameters, only 3.8B activate per token during inference β€” delivering near-31B quality at...

textvisionmultimodal
262,144 ctx$0.12/1M in
Explore specs and pricingView details β†’

Anthropic: Claude Opus 4.6 (Fast)

anthropic

Fast-mode variant of [Opus 4.6](/anthropic/claude-opus-4.6) - identical capabilities with higher output speed at premium 6x pricing. Learn more in Anthropic's docs: https://platform.claude.com/docs/en/build-with-claude/fast-mode

textvisionmultimodal
1,000,000 ctx$30.00/1M in
Explore specs and pricingView details β†’