ποΈ Audio & Speech
125 models Β· Page 3 of 4
sam-audio-base
A foundation model for isolating any sound in audio using text, visual, or temporal prompts
gpt-4o-audio-preview-2025-06-03
speech-2.8-hd
Minimax Speech 2.8 HD focuses on high-fidelity audio generation with features like studio-grade quality, flexible emotion control, multilingual support, and voice cloning capabilities
sam-audio-large
SAM-Audio is a foundation model for isolating any sound in audio using text
gpt-audio-mini-2025-10-06
gpt-4o-mini-audio-preview
gpt-4o-audio-preview-2024-12-17
ace-step-1.5
Music generation
gpt-audio-2025-08-28
wan2.6-i2v-flash
Image-to-video generation with optional audio, multi-shot narrative support, and faster inference
ultimate_rvc
An extension of AiCoverGen, which provides several new features and improvements, enabling users to generate audio-related content using RVC with ease. Ideal for people who want to incorporate singing functionality into their AI assistant/chatbot/vtuber,
heart_mula
HeartMuLa: A Family of Open Sourced Music Foundation Models
ltx-2.3-pro
High-fidelity video generation with portrait support, audio-to-video, retake, and extend. Text, image, and audio-driven creation up to 4K at 50 FPS.
ltx-2.3-fast
Lightning-fast video generation with portrait support, camera controls, and synchronized audio. Up to 20 seconds at 1080p, 4K at 50 FPS.
ace-step-1.5
Ace Step 1.5 open source music generation model
kling-v3-omni-video
Kling Video 3.0 Omni: Unified multimodal video generation with reference images, video editing, native audio, and multi-shot control
q3-turbo
Fast video generation with text-to-video, image-to-video, and start-end-to-video modes. Up to 16 seconds at 1080p with synchronized audio.
veo-3.1-lite
Google's cost-efficient video generation model with native audio, optimized for high-volume applications
q3-pro
High-fidelity video generation with text-to-video, image-to-video, and start-end-to-video modes. Up to 16 seconds at 1080p with synchronized audio.
p-video
Fast video generation with built-in draft mode for rapid creative iteration. Text-to-video, image-to-video, and audio-to-video in a single endpoint.
seedance-2.0-fast
A faster variant of Seedance 2.0 for quicker video generation with multimodal inputs and native audio.
lofi
Lo-fi hip-hop music generation with ACE-Step 1.5 + LoRA
music-cover
Reimagine any song in a different style β change voice, instruments, genre, and arrangement while keeping the original melody
music-2.5
Generate full-length songs with vocals, lyrics, and rich instrumentation from a text prompt
seedance-2.0
ByteDance's multimodal video generation model with native audio, multimodal reference inputs, and intelligent duration control.
dotted-waveform-visualizer
Create a dotted waveform video from an audio file
veo-3.1-fast
New and improved version of Veo 3 Fast, with higher-fidelity video, context-aware audio and last frame support
veo-3.1
New and improved version of Veo 3, with higher-fidelity video, context-aware audio, reference image and last frame support
wan-2.7-t2v
Generate videos with audio from text prompts using Alibaba's Wan 2.7 model. 1080p, up to 15 seconds, with audio synchronization.
wan-2.7-i2v
Generate videos from images, with support for first-and-last-frame control, clip continuation, and audio synchronization using Alibaba's Wan 2.7 model
music-2.6
Generate full-length songs or instrumentals from a text prompt, with optional auto-generated lyrics
zai-org/GLM-ASR-Nano-2512
zai-org/GLM-ASR-Nano-2512 is a automatic speech recognition model on Hugging Face with ~160,973 monthly downloads. Open access.
Auto Router
Your prompt will be processed by a meta-model and routed to one of dozens of models (see below), optimizing for the best possible output. To see which model was used,...
Google: Gemini 2.0 Flash
Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5). It...
Google: Gemini 2.0 Flash Lite
Gemini 2.0 Flash Lite offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5),...
Google: Gemini 2.5 Pro Preview 05-06
Gemini 2.5 Pro is Googleβs state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs βthinkingβ capabilities, enabling it to reason through responses with enhanced accuracy...
