New from Meta AI

Segment Anything for Audio

Isolate any sound from complex audio mixtures using natural language, visual cues, or time-based prompts. Meta's state-of-the-art foundation model for audio source separation.

Three Ways to Isolate Sound

SAM-Audio understands multiple input modalities, giving you flexible control over audio separation.

Text Prompts

Describe what you want to isolate in natural language. Type "dog barking", "singing voice", or "thunderstorm" and SAM-Audio extracts it.

"acoustic guitar strumming"

Visual Prompts

Click on a person or instrument in a video frame. SAM-Audio uses visual-audio correspondence to isolate that source's sound.

Click on speaker in video

Temporal Prompts

Mark a time range where your target sound occurs. SAM-Audio learns from that segment and extracts similar sounds throughout.

0:05 - 0:08

How SAM-Audio Works

A deep dive into the architecture powering state-of-the-art audio separation.

1

Input Encoding

Your audio mixture and prompts (text, visual, or temporal) are encoded into a shared multimodal representation space using PE-AV (Perception Encoder Audio-Visual).

2

Flow-Matching Diffusion

The model uses a flow-matching diffusion transformer to iteratively refine the separation, generating clean target audio from the mixture conditioned on your prompt.

3

Quality Re-ranking

Multiple candidate separations are generated and ranked using CLAP (text-audio similarity), Judge (separation quality), and ImageBind (visual-audio alignment).

4

Output Generation

You receive both the isolated target audio and the residual (everything else), giving you complete control over your audio.

0.7x
Real-time Factor
Near real-time processing
3
Model Sizes
Small, Base, Large
SOTA
Performance
Speech, Music, SFX

What Can You Do With SAM-Audio?

From professional production to accessibility applications.

Music Production

Isolate vocals, drums, bass, or any instrument from existing tracks. Perfect for remixing, sampling, or creating stems.

Podcast & Video

Remove background noise, isolate speech from multiple speakers, or extract specific sound effects from recordings.

Film & TV Post

Clean dialogue, separate foley, or re-balance audio elements in post-production without access to original stems.

Accessibility

Help hearing-impaired users focus on specific sounds. Meta partners with Starkey hearing aids and 2gether International.

Audio Analysis

Extract and analyze specific audio events for research, forensics, or quality assurance applications.

Creative Tools

Build new audio applications, integrate separation into DAWs, or create interactive sound experiences.

Technical Details

For developers and researchers who want to understand SAM-Audio's capabilities.

Model Variants

Size Parameters Use Case
Small ~500M Fast inference, edge deployment
Base ~1B Balanced performance
Large ~3B Maximum quality

Each variant also has a TV-specialized version optimized for visual prompting.

Supported Formats

.wav .mp3 .mp4 .mov

Quality Assessment

  • CLAP Score: Text-audio similarity measurement
  • Judge Score: Separation quality assessment
  • ImageBind: Visual-audio alignment scoring

Benchmarks

SAM-Audio achieves state-of-the-art results on SAM-Audio-Bench, a comprehensive evaluation covering:

  • Speech separation and isolation
  • Music and instrument separation
  • Sound effects extraction
  • Multi-modal prompt handling

Frequently Asked Questions

Everything you need to know about SAM-Audio.

What is SAM-Audio?

SAM-Audio (Segment Anything for Audio) is Meta's foundation model for audio source separation. It can isolate any sound from complex audio mixtures using text descriptions, visual cues from video, or time-based prompts. It's part of Meta's "Segment Anything" family of AI models.

How is SAM-Audio different from other audio separation tools?

Unlike traditional separation tools that only work with predefined categories (vocals, drums, bass), SAM-Audio can isolate any sound you describe. Want to extract just the "bird chirping in the background"? Just type it. This flexibility comes from its multimodal training on text, visual, and temporal prompts.

What audio formats does SAM-Audio support?

SAM-Audio supports common audio and video formats including WAV, MP3, MP4, and MOV files. For video files, it can use the visual content to help identify and separate audio sources.

Can I use SAM-Audio for commercial projects?

SAM-Audio is released under the SAM License. Check the GitHub repository for the full license terms and usage guidelines for your specific use case.

How accurate is SAM-Audio?

SAM-Audio achieves state-of-the-art results on speech, music, and sound effect separation benchmarks. However, it may struggle with highly similar sources (like distinguishing individual singers in a choir or specific instruments in an orchestra). For best results, use clear, specific prompts.

What are the known limitations?

SAM-Audio currently cannot: use audio as a prompt input, perform complete blind source separation without prompts, or reliably distinguish between very similar sound sources. It works best with clear, distinct audio events that can be described textually or identified visually.

Is there a free way to try SAM-Audio?

Yes! You can try SAM-Audio for free at TwoShot. Upload your audio, describe what you want to extract, and get results instantly without any setup or installation.

Ready to Isolate Any Sound?

Try SAM-Audio now. No installation required.

Try SAM-Audio Free