Playground Find a Model ⚡ Pro Tools Pulse API Advertise PricingLoading...

Loading...

The most comprehensive directory of AI models, providers, and agents. Updated daily.

Explore

All Models
Collections
Leaderboard
Compare
Pro Tools
Pulse Feed
API Docs

Categories

Language Models
Inference Providers
Agents & SaaS
Open Source

Stay Updated

Weekly digest of new models and price changes.

Business contact

Support: support@modelstop.top

Enquiries: hello@modelstop.top

Billing: billing@modelstop.top

Privacy: privacy@modelstop.top

Legal: legal@modelstop.top

© 2026 modelstop.top. All rights reserved.Updated daily · 4695+ models indexed

Home/All Models

AI Model Catalogue

Browse 129 models across providers, modalities, and use cases.

🌐All Models 💬Text Generation 💻Code & Reasoning 👁️Vision & Multimodal 🎨Image Generation 🎙️Audio & Speech 🤖Agents & Tools 📄Long Context 🆓Free & Open

🧠

Reasoning

🌍Multilingual

Providers:⚡OpenAI 🔷Anthropic 🔍Google 🦙Meta 🌀Mistral ✕xAI 🚀Groq 🐋DeepSeek 🌐Cohere ☁️Amazon

Filter & Sort

🎙️ Audio & Speech

129 models · Page 3 of 4

voxtral-small-2507

A small audio understanding model released in July 2025

Explore specs and pricingView details →

gpt-audio-mini-2025-12-15

Explore specs and pricingView details →

gpt-audio-2025-08-28

Explore specs and pricingView details →

gpt-4o-mini-audio-preview

⭐1270.0%score

Explore specs and pricingView details →

sam-audio-large

SAM-Audio is a foundation model for isolating any sound in audio using text

Explore specs and pricingView details →

gpt-4o-mini-audio-preview-2024-12-17

Explore specs and pricingView details →

wan2.6-i2v-flash

Image-to-video generation with optional audio, multi-shot narrative support, and faster inference

visionimageaudio

Explore specs and pricingView details →

sam-audio-base

A foundation model for isolating any sound in audio using text, visual, or temporal prompts

Explore specs and pricingView details →

speech-2.8-hd

Minimax Speech 2.8 HD focuses on high-fidelity audio generation with features like studio-grade quality, flexible emotion control, multilingual support, and voice cloning capabilities

audiomultilingualfree

Explore specs and pricingView details →

gpt-4o-audio-preview-2024-12-17

⭐1270.0%score

Explore specs and pricingView details →

ace-step-1.5

Music generation

Explore specs and pricingView details →

gpt-audio-mini-2025-10-06

Explore specs and pricingView details →

gpt-4o-audio-preview-2025-06-03

Explore specs and pricingView details →

kling-v3-video

Kling Video 3.0: Generate cinematic videos up to 15 seconds with multi-shot control, native audio, and improved consistency

Explore specs and pricingView details →

ace-step-1.5

Ace Step 1.5 open source music generation model

Explore specs and pricingView details →

ultimate_rvc

meta-innovation

An extension of AiCoverGen, which provides several new features and improvements, enabling users to generate audio-related content using RVC with ease. Ideal for people who want to incorporate singing functionality into their AI assistant/chatbot/vtuber,

Explore specs and pricingView details →

ltx-2.3-pro

High-fidelity video generation with portrait support, audio-to-video, retake, and extend. Text, image, and audio-driven creation up to 4K at 50 FPS.

visionimageaudio

Explore specs and pricingView details →

ltx-2.3-fast

Lightning-fast video generation with portrait support, camera controls, and synchronized audio. Up to 20 seconds at 1080p, 4K at 50 FPS.

Explore specs and pricingView details →

kling-v3-omni-video

Kling Video 3.0 Omni: Unified multimodal video generation with reference images, video editing, native audio, and multi-shot control

visionimageaudio

Explore specs and pricingView details →

heart_mula

meta-innovation

HeartMuLa: A Family of Open Sourced Music Foundation Models

Explore specs and pricingView details →

wan-2.7-i2v

Generate videos from images, with support for first-and-last-frame control, clip continuation, and audio synchronization using Alibaba's Wan 2.7 model

visionimageaudio

Explore specs and pricingView details →

seedance-2.0

ByteDance's multimodal video generation model with native audio, multimodal reference inputs, and intelligent duration control.

visionaudiofree

Explore specs and pricingView details →

p-video

Fast video generation with built-in draft mode for rapid creative iteration. Text-to-video, image-to-video, and audio-to-video in a single endpoint.

visionimageaudio

Explore specs and pricingView details →

q3-turbo

Fast video generation with text-to-video, image-to-video, and start-end-to-video modes. Up to 16 seconds at 1080p with synchronized audio.

visionimageaudio

Explore specs and pricingView details →

lofi

Lo-fi hip-hop music generation with ACE-Step 1.5 + LoRA

Explore specs and pricingView details →

music-2.6

Generate full-length songs or instrumentals from a text prompt, with optional auto-generated lyrics

Explore specs and pricingView details →

veo-3.1-lite

Google's cost-efficient video generation model with native audio, optimized for high-volume applications

Explore specs and pricingView details →

music-cover

Reimagine any song in a different style — change voice, instruments, genre, and arrangement while keeping the original melody

Explore specs and pricingView details →

q3-pro

High-fidelity video generation with text-to-video, image-to-video, and start-end-to-video modes. Up to 16 seconds at 1080p with synchronized audio.

visionimageaudio

Explore specs and pricingView details →

seedance-2.0-fast

A faster variant of Seedance 2.0 for quicker video generation with multimodal inputs and native audio.

visionaudiofree

Explore specs and pricingView details →

dotted-waveform-visualizer

Create a dotted waveform video from an audio file

Explore specs and pricingView details →

veo-3.1-fast

New and improved version of Veo 3 Fast, with higher-fidelity video, context-aware audio and last frame support

Explore specs and pricingView details →

veo-3.1

New and improved version of Veo 3, with higher-fidelity video, context-aware audio, reference image and last frame support

visionaudiofree

Explore specs and pricingView details →

wan-2.7-t2v

Generate videos with audio from text prompts using Alibaba's Wan 2.7 model. 1080p, up to 15 seconds, with audio synchronization.

Explore specs and pricingView details →

music-2.5

Generate full-length songs with vocals, lyrics, and rich instrumentation from a text prompt

Explore specs and pricingView details →

zai-org/GLM-ASR-Nano-2512

zai-org/GLM-ASR-Nano-2512 is a automatic speech recognition model on Hugging Face with ~160,973 monthly downloads. Open access.

audioopen-source

Output$0.0000/1M

Explore specs and pricingView details →

← Prev 1 2 3 4 Next →