modelstop.top
Home/All Models

AI Model Catalogue

Browse 125 models across providers, modalities, and use cases.

πŸŽ™οΈ Audio & Speech

125 models Β· Page 3 of 4

sam-audio-base

geopti

A foundation model for isolating any sound in audio using text, visual, or temporal prompts

audiofree
ctxFree in
Explore specs and pricingView details β†’

gpt-4o-audio-preview-2025-06-03

openai

textaudiofree
ctxFree in
Explore specs and pricingView details β†’

speech-2.8-hd

minimax

Minimax Speech 2.8 HD focuses on high-fidelity audio generation with features like studio-grade quality, flexible emotion control, multilingual support, and voice cloning capabilities

audiomultilingualfree
ctxFree in
Explore specs and pricingView details β†’

sam-audio-large

geopti

SAM-Audio is a foundation model for isolating any sound in audio using text

audiofree
ctxFree in
Explore specs and pricingView details β†’

gpt-audio-mini-2025-10-06

openai

textaudiofree
ctxFree in
Explore specs and pricingView details β†’

gpt-4o-mini-audio-preview

openai

textaudiofree
ctxFree in
Explore specs and pricingView details β†’

gpt-4o-audio-preview-2024-12-17

openai

textaudiofree
ctxFree in
Explore specs and pricingView details β†’

ace-step-1.5

visoar

Music generation

audiofree
ctxFree in
Explore specs and pricingView details β†’

gpt-audio-2025-08-28

openai

textaudiofree
ctxFree in
Explore specs and pricingView details β†’

wan2.6-i2v-flash

wan-video

Image-to-video generation with optional audio, multi-shot narrative support, and faster inference

visionimageaudio
ctxFree in
Explore specs and pricingView details β†’

ultimate_rvc

meta-innovation

An extension of AiCoverGen, which provides several new features and improvements, enabling users to generate audio-related content using RVC with ease. Ideal for people who want to incorporate singing functionality into their AI assistant/chatbot/vtuber,

audiofree
ctxFree in
Explore specs and pricingView details β†’

heart_mula

meta-innovation

HeartMuLa: A Family of Open Sourced Music Foundation Models

audiofree
ctxFree in
Explore specs and pricingView details β†’

ltx-2.3-pro

lightricks

High-fidelity video generation with portrait support, audio-to-video, retake, and extend. Text, image, and audio-driven creation up to 4K at 50 FPS.

visionimageaudio
ctxFree in
Explore specs and pricingView details β†’

ltx-2.3-fast

lightricks

Lightning-fast video generation with portrait support, camera controls, and synchronized audio. Up to 20 seconds at 1080p, 4K at 50 FPS.

audiofree
ctxFree in
Explore specs and pricingView details β†’

ace-step-1.5

fishaudio

Ace Step 1.5 open source music generation model

audiofree
ctxFree in
Explore specs and pricingView details β†’

kling-v3-omni-video

kwaivgi

Kling Video 3.0 Omni: Unified multimodal video generation with reference images, video editing, native audio, and multi-shot control

visionimageaudio
Run locally
ctxFree in
Explore specs and pricingView details β†’

q3-turbo

vidu

Fast video generation with text-to-video, image-to-video, and start-end-to-video modes. Up to 16 seconds at 1080p with synchronized audio.

visionimageaudio
ctxFree in
Explore specs and pricingView details β†’

veo-3.1-lite

google

Google's cost-efficient video generation model with native audio, optimized for high-volume applications

audiofree
ctxFree in
Explore specs and pricingView details β†’

q3-pro

vidu

High-fidelity video generation with text-to-video, image-to-video, and start-end-to-video modes. Up to 16 seconds at 1080p with synchronized audio.

visionimageaudio
ctxFree in
Explore specs and pricingView details β†’

p-video

prunaai

Fast video generation with built-in draft mode for rapid creative iteration. Text-to-video, image-to-video, and audio-to-video in a single endpoint.

visionimageaudio
ctxFree in
Explore specs and pricingView details β†’

seedance-2.0-fast

bytedance

A faster variant of Seedance 2.0 for quicker video generation with multimodal inputs and native audio.

visionaudiofree
ctxFree in
Explore specs and pricingView details β†’

lofi

frow

Lo-fi hip-hop music generation with ACE-Step 1.5 + LoRA

audiofree
ctxFree in
Explore specs and pricingView details β†’

music-cover

minimax

Reimagine any song in a different style β€” change voice, instruments, genre, and arrangement while keeping the original melody

audiofree
Run locally
ctxFree in
Explore specs and pricingView details β†’

music-2.5

minimax

Generate full-length songs with vocals, lyrics, and rich instrumentation from a text prompt

audiofree
Run locally
ctxFree in
Explore specs and pricingView details β†’

seedance-2.0

bytedance

ByteDance's multimodal video generation model with native audio, multimodal reference inputs, and intelligent duration control.

visionaudiofree
Run locally
ctxFree in
Explore specs and pricingView details β†’

dotted-waveform-visualizer

lucataco

Create a dotted waveform video from an audio file

audiofree
ctxFree in
Explore specs and pricingView details β†’

veo-3.1-fast

google

New and improved version of Veo 3 Fast, with higher-fidelity video, context-aware audio and last frame support

audiofree
ctxFree in
Explore specs and pricingView details β†’

veo-3.1

google

New and improved version of Veo 3, with higher-fidelity video, context-aware audio, reference image and last frame support

visionaudiofree
ctxFree in
Explore specs and pricingView details β†’

wan-2.7-t2v

wan-video

Generate videos with audio from text prompts using Alibaba's Wan 2.7 model. 1080p, up to 15 seconds, with audio synchronization.

audiofree
Run locally
ctxFree in
Explore specs and pricingView details β†’

wan-2.7-i2v

wan-video

Generate videos from images, with support for first-and-last-frame control, clip continuation, and audio synchronization using Alibaba's Wan 2.7 model

visionimageaudio
ctxFree in
Explore specs and pricingView details β†’

music-2.6

minimax

Generate full-length songs or instrumentals from a text prompt, with optional auto-generated lyrics

audiofree
ctxFree in
Explore specs and pricingView details β†’

zai-org/GLM-ASR-Nano-2512

zai-org

zai-org/GLM-ASR-Nano-2512 is a automatic speech recognition model on Hugging Face with ~160,973 monthly downloads. Open access.

audioopen-source
Run locally
ctx$0.00/1M in
Explore specs and pricingView details β†’

Auto Router

openrouter

Your prompt will be processed by a meta-model and routed to one of dozens of models (see below), optimizing for the best possible output. To see which model was used,...

textvisionmultimodal
2,000,000 ctxFree in
Explore specs and pricingView details β†’

Google: Gemini 2.0 Flash

google

Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5). It...

textvisionmultimodal
1,000,000 ctx$0.10/1M in
Explore specs and pricingView details β†’

Google: Gemini 2.0 Flash Lite

google

Gemini 2.0 Flash Lite offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5),...

textvisionmultimodal
1,048,576 ctx$0.07/1M in
Explore specs and pricingView details β†’

Google: Gemini 2.5 Pro Preview 05-06

google

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs β€œthinking” capabilities, enabling it to reason through responses with enhanced accuracy...

textvisionmultimodal
Run locally
1,048,576 ctx$1.25/1M in
Explore specs and pricingView details β†’