modelstop.top
Home/All Models

AI Model Catalogue

Browse 267 models across providers, modalities, and use cases.

๐ŸŒ All Models

267 models ยท Page 1 of 8

nim/meta/llama-3.2-11b-vision-instruct

nim

textvisioninstruct
16,384 ctxFree in
Explore specs and pricingView details โ†’

nim/meta/llama-3.2-90b-vision-instruct

nim

textvisioninstruct
16,384 ctxFree in
Explore specs and pricingView details โ†’

Qwen3-VL-8B-Instruct

qwen

Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built for high-fidelity understanding and reasoning across text, images, and video. It features improved multimodal fusion with Interleaved-MRoPE for long-horizon...

textvisionreasoning
262,144 ctx$0.18/1M in
Explore specs and pricingView details โ†’

Gemma 3 4b it

google

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities,...

textvisionreasoning
65,536 ctxFree in
Explore specs and pricingView details โ†’

Qwen3-VL-32B-Instruct

qwen

Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-precision understanding and reasoning across text, images, and video. With 32 billion parameters, it combines deep visual perception with advanced text...

textvisionreasoning
262,144 ctx$0.50/1M in
Explore specs and pricingView details โ†’

Llama Guard 4 12B

meta-llama

Llama Guard 4 is a Llama 4 Scout-derived multimodal pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM...

textvisioncheap
1,048,576 ctx$0.20/1M in
Explore specs and pricingView details โ†’

meta-llama/Llama-Guard-4-12B

deepinfra

Llama Guard 4 is a Llama 4 Scout-derived multimodal pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM...

textvisioncheap
163,840 ctxFree in
Explore specs and pricingView details โ†’

meta-llama/Llama-3.2-11B-Vision-Instruct

deepinfra

textvisioninstruct
ctxFree in
Explore specs and pricingView details โ†’

google/gemma-3-4b-it

deepinfra

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities,...

textvisionreasoning
32,768 ctxFree in
Explore specs and pricingView details โ†’

google/gemma-3-12b-it

deepinfra

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities,...

textvisionreasoning
32,768 ctxFree in
Explore specs and pricingView details โ†’

Qwen/Qwen3-VL-235B-A22B-Instruct

deepinfra

Qwen3-VL-235B-A22B Instruct is an open-weight multimodal model that unifies strong text generation with visual understanding across images and video. The Instruct model targets general vision-language use (VQA, document parsing, chart/table...

textvisioninstruct
262,144 ctxFree in
Explore specs and pricingView details โ†’

embed-v4.0

cohere

Cohere's latest multimodal embedding model supporting text and images for advanced semantic search.

textvisionfree
8,192 ctxFree in
Explore specs and pricingView details โ†’

Qwen/Qwen3-VL-30B-A3B-Instruct

deepinfra

Qwen3-VL-30B-A3B-Instruct is a multimodal model that unifies strong text generation with visual understanding for images and videos. Its Instruct variant optimizes instruction-following for general multimodal tasks. It excels in perception...

textvisioninstruct
131,072 ctxFree in
Explore specs and pricingView details โ†’

google/gemma-4-31B-it

deepinfra

Gemma 4 31B Instruct is Google DeepMind's 30.7B dense multimodal model supporting text and image input with text output. Features a 256K token context window, configurable thinking/reasoning mode, native function...

textvisionreasoning
262,144 ctxFree in
Explore specs and pricingView details โ†’

google/gemma-3-27b-it

deepinfra

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities,...

textvisionreasoning
131,072 ctxFree in
Explore specs and pricingView details โ†’

c4ai-aya-vision-32b

cohere

textvisionfree
16,384 ctxFree in
Explore specs and pricingView details โ†’

command-a-vision-07-2025

cohere

textvisionfree
128,000 ctxFree in
Explore specs and pricingView details โ†’

mistral-medium-2505

mistralai

Our frontier-class multimodal model released May 2025.

textvisionfree
131,072 ctxFree in
Explore specs and pricingView details โ†’

increase-resolution

bria

Bria Increase resolution upscales the resolution of any image. It increases resolution using a dedicated upscaling method that preserves the original image content without regeneration.

visionimagefree
ctxFree in
Explore specs and pricingView details โ†’

image-colorization

topazlabs

Image colorization model from Topaz Labs

visionfree
ctxFree in
Explore specs and pricingView details โ†’

generate-background

bria

Bria Background Generation allows for efficient swapping of backgrounds in images via text prompts or reference image, delivering realistic and polished results. Trained exclusively on licensed data for safe and risk-free commercial use

visionimagefree
ctxFree in
Explore specs and pricingView details โ†’

firered-image-edit

prunaai

FireRed-Image-Edit is a general-purpose image editing model that delivers high-fidelity and consistent editing across a wide range of scenarios.

visionfree
ctxFree in
Explore specs and pricingView details โ†’

imagen-3

google

Google's highest quality text-to-image model, capable of generating images with detail, rich lighting and beauty

visionimagefree
ctxFree in
Explore specs and pricingView details โ†’

eraser

bria

SOTA Object removal, enables precise removal of unwanted objects from images while maintaining high-quality outputs. Trained exclusively on licensed data for safe and risk-free commercial use

visionfree
ctxFree in
Explore specs and pricingView details โ†’

nano-banana

google

Google's latest image editing model in Gemini 2.5

visionfree
ctxFree in
Explore specs and pricingView details โ†’

riverflow-2.0-pro

sourceful

Agentic image model optimized for robust, high-precision generations supporting font control

visionimageagents
ctxFree in
Explore specs and pricingView details โ†’

fibo

bria

SOTA Open source model trained on licensed data, transforming intent into structured control for precise, high-quality AI image generation in enterprise and agentic workflows.

visionimageagents
ctxFree in
Explore specs and pricingView details โ†’

dreamactor-m2.0

bytedance

Animate any character, humans, cartoons, animals, even non-humans, from a single image + driving video

visionfree
ctxFree in
Explore specs and pricingView details โ†’

p-image-edit-lora

prunaai

Use trained LoRAs from the https://replicate.com/prunaai/p-image-edit-trainer. Find or contribute LoRAs here: https://huggingface.co/collections/PrunaAI/p-image-edit-loras.

visionfree
ctxFree in
Explore specs and pricingView details โ†’

imagen-4-fast

google

Use this fast version of Imagen 4 when speed and cost are more important than quality

visionfree
ctxFree in
Explore specs and pricingView details โ†’

fabric-1.0

veed

VEED Fabric 1.0 is an image-to-video API that turns any image into a talking video

visionfree
ctxFree in
Explore specs and pricingView details โ†’

image-3.2

bria

Commercial-ready, trained entirely on licensed data, text-to-image model. With only 4B parameters provides exceptional aesthetics and text rendering. Evaluated to be on par to other leading models in the market

visionimagefree
ctxFree in
Explore specs and pricingView details โ†’

upscaler

google

Upscale images 2x or 4x times

visionfree
ctxFree in
Explore specs and pricingView details โ†’

wan2.6-i2v-flash

wan-video

Image-to-video generation with optional audio, multi-shot narrative support, and faster inference

visionimageaudio
ctxFree in
Explore specs and pricingView details โ†’

fibo-edit

bria

FIBO-Edit brings the power of structured prompt generation to image editing

visionimagefree
ctxFree in
Explore specs and pricingView details โ†’

imagen-3-fast

google

A faster and cheaper Imagen 3 model, for when price or speed are more important than final image quality

visionfree
ctxFree in
Explore specs and pricingView details โ†’