UAT: Unified Audio-Text Diffusion for Audio Generation, Editing, and Captioning

A diffusion-centric framework that extends a pre-trained text-to-audio backbone to unified audio generation, editing, and captioning.

Continuous audio latent diffusion Masked discrete text diffusion Coupled dual-stream modeling
Project Overview

Overview

UAT unifies audio generation and understanding in a single diffusion framework. It couples continuous latent diffusion for acoustic synthesis with masked discrete diffusion for text prediction, enabling bidirectional audio-text modeling without relying on autoregressive audio tokens.

UAT model structure
Audio Generation Text prompts are rendered into audio through continuous latent diffusion.
Audio Editing Existing audio can be modified according to text instructions within the same model family.
Audio Captioning A lightweight text stream enables non-autoregressive text generation from audio.
Demo Section 1

Audio Generation

Side-by-side text-to-audio outputs for the same prompts. The examples below compare UAT with reference audio and representative unified audio-text baselines.

Demo Section 2

Audio Editing

Text-guided audio editing demos. Each card shows the source recording, a baseline result, and our model's output for three operation types: adding new sounds, deleting existing sounds, and replacing one sound with another.