๐๏ธ Vision & Multimodal
267 models ยท Page 1 of 8
nim/meta/llama-3.2-11b-vision-instruct
nim/meta/llama-3.2-90b-vision-instruct
Qwen3-VL-8B-Instruct
Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built for high-fidelity understanding and reasoning across text, images, and video. It features improved multimodal fusion with Interleaved-MRoPE for long-horizon...
Gemma 3 4b it
Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities,...
Qwen3-VL-32B-Instruct
Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-precision understanding and reasoning across text, images, and video. With 32 billion parameters, it combines deep visual perception with advanced text...
Llama Guard 4 12B
Llama Guard 4 is a Llama 4 Scout-derived multimodal pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM...
meta-llama/Llama-Guard-4-12B
Llama Guard 4 is a Llama 4 Scout-derived multimodal pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM...
meta-llama/Llama-3.2-11B-Vision-Instruct
google/gemma-3-4b-it
Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities,...
google/gemma-3-12b-it
Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities,...
Qwen/Qwen3-VL-235B-A22B-Instruct
Qwen3-VL-235B-A22B Instruct is an open-weight multimodal model that unifies strong text generation with visual understanding across images and video. The Instruct model targets general vision-language use (VQA, document parsing, chart/table...
embed-v4.0
Cohere's latest multimodal embedding model supporting text and images for advanced semantic search.
Qwen/Qwen3-VL-30B-A3B-Instruct
Qwen3-VL-30B-A3B-Instruct is a multimodal model that unifies strong text generation with visual understanding for images and videos. Its Instruct variant optimizes instruction-following for general multimodal tasks. It excels in perception...
google/gemma-4-31B-it
Gemma 4 31B Instruct is Google DeepMind's 30.7B dense multimodal model supporting text and image input with text output. Features a 256K token context window, configurable thinking/reasoning mode, native function...
google/gemma-3-27b-it
Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities,...
c4ai-aya-vision-32b
command-a-vision-07-2025
mistral-medium-2505
Our frontier-class multimodal model released May 2025.
increase-resolution
Bria Increase resolution upscales the resolution of any image. It increases resolution using a dedicated upscaling method that preserves the original image content without regeneration.
image-colorization
Image colorization model from Topaz Labs
generate-background
Bria Background Generation allows for efficient swapping of backgrounds in images via text prompts or reference image, delivering realistic and polished results. Trained exclusively on licensed data for safe and risk-free commercial use
firered-image-edit
FireRed-Image-Edit is a general-purpose image editing model that delivers high-fidelity and consistent editing across a wide range of scenarios.
imagen-3
Google's highest quality text-to-image model, capable of generating images with detail, rich lighting and beauty
eraser
SOTA Object removal, enables precise removal of unwanted objects from images while maintaining high-quality outputs. Trained exclusively on licensed data for safe and risk-free commercial use
nano-banana
Google's latest image editing model in Gemini 2.5
riverflow-2.0-pro
Agentic image model optimized for robust, high-precision generations supporting font control
fibo
SOTA Open source model trained on licensed data, transforming intent into structured control for precise, high-quality AI image generation in enterprise and agentic workflows.
dreamactor-m2.0
Animate any character, humans, cartoons, animals, even non-humans, from a single image + driving video
p-image-edit-lora
Use trained LoRAs from the https://replicate.com/prunaai/p-image-edit-trainer. Find or contribute LoRAs here: https://huggingface.co/collections/PrunaAI/p-image-edit-loras.
imagen-4-fast
Use this fast version of Imagen 4 when speed and cost are more important than quality
fabric-1.0
VEED Fabric 1.0 is an image-to-video API that turns any image into a talking video
image-3.2
Commercial-ready, trained entirely on licensed data, text-to-image model. With only 4B parameters provides exceptional aesthetics and text rendering. Evaluated to be on par to other leading models in the market
upscaler
Upscale images 2x or 4x times
wan2.6-i2v-flash
Image-to-video generation with optional audio, multi-shot narrative support, and faster inference
fibo-edit
FIBO-Edit brings the power of structured prompt generation to image editing
imagen-3-fast
A faster and cheaper Imagen 3 model, for when price or speed are more important than final image quality
