UAT: Unified Audio-Text Diffusion for Audio Generation, Editing, and Captioning

Project Overview

Overview

UAT unifies audio generation and understanding in a single diffusion framework. It couples continuous latent diffusion for acoustic synthesis with masked discrete diffusion for text prediction, enabling bidirectional audio-text modeling without relying on autoregressive audio tokens.

Audio Generation Text prompts are rendered into audio through continuous latent diffusion.

Audio Editing Existing audio can be modified according to text instructions within the same model family.

Audio Captioning A lightweight text stream enables non-autoregressive text generation from audio.

Demo Section 1

Audio Generation

Side-by-side text-to-audio outputs for the same prompts. The examples below compare UAT with reference audio and representative unified audio-text baselines.

Demo Section 2

Audio Editing

Text-guided audio editing demos. Each card shows the source recording, a baseline result, and our model's output for three operation types: adding new sounds, deleting existing sounds, and replacing one sound with another.

Demo Section 3

Audio Captioning

Non-autoregressive audio captioning results. Each card presents the audio clip, a human reference caption, and the captions predicted by UAT and a representative baseline.