The field of spatial audio generation has reached a new pinnacle with the introduction of OmniAudio technology by the Tongyi Lab’s speech team. This groundbreaking innovation allows for the direct generation of FOA (First-order Ambisonics) audio from 360° videos, opening up unprecedented possibilities for virtual reality and immersive entertainment experiences.

Spatial audio, which simulates real-world auditory environments, is crucial for enhancing immersive experiences. However, existing technologies have primarily been based on fixed-perspective videos, underutilizing the spatial information present in 360° panoramic videos. Traditional video-to-audio generation techniques have predominantly produced non-spatial audio, failing to meet the demands of immersive experiences for 3D sound localization. These methods have often been based on limited-perspective videos, missing out on the rich visual context offered by panoramic videos. With the proliferation of 360° cameras and the advancement of virtual reality technology, generating spatial audio that matches panoramic videos has become an urgent issue to address.
To tackle this challenge, Tongyi Lab proposed the 360V2SA (360-degree Video to Spatial Audio) task. FOA is a standard 3D spatial audio format that represents sound using four channels (W, X, Y, Z), capturing the directionality of sound and enabling true 3D audio reproduction while maintaining accuracy in sound localization during head rotations.
Data is the cornerstone of machine learning models, but there is a scarcity of paired 360° video and spatial audio data. To address this, the research team meticulously constructed the Sphere360 dataset, which includes over 103,000 real-world video clips covering 288 audio events, totaling 288 hours. This dataset not only contains 360° visual content but also supports FOA audio. During construction, the team employed strict selection and cleaning standards, utilizing various algorithms to ensure high-quality alignment.
The OmniAudio training method is divided into two phases. The first phase involves self-supervised coarse-to-fine flow matching pre-training. The team leveraged large-scale non-spatial audio resources, converting stereo audio into “pseudo-FOA” format before feeding it into a four-channel VAE encoder to obtain latent representations. Random time window masking was then applied to the latent sequences with a certain probability, using both masked and complete sequences as conditions for the flow matching model. This achieves self supervised learning of audio timing and structure, enabling the model to grasp general audio features and macroscopic temporal patterns. The second phase is supervised fine-tuning based on dual-branch video representation. The team used only real FOA audio data, continuing with the masked flow matching training framework to enhance the model’s ability to represent sound source directionality and improve the reconstruction of high-fidelity spatial audio details. After self-supervised pre-training, the team combined the model with a dual-branch video encoder for supervised fine-tuning, selectively “sculpting” FOA latent trajectories from noise that align with visual cues, outputting four-channel spatial audio that is highly aligned with 360° videos and possesses precise directionality.
In experimental settings, the research team conducted supervised fine-tuning and evaluation on the Sphere360-Bench and YT360-Test datasets, using both objective and subjective metrics to measure the quality of the generated audio. Results indicate that OmniAudio significantly outperformed all baselines on both test sets. On YT360-Test, OmniAudio substantially reduced FD, KL, and ΔAngular metrics; on Sphere360-Bench, it also achieved excellent scores. In human-subjective evaluations, OmniAudio scored much higher than the best baseline in terms of spatial audio quality and audio-visual alignment, demonstrating its合成 results’ superiority in clarity, spatiality, and synchronization with visuals. Additionally, ablation studies confirmed the contributions of pre-training strategies, dual-branch design, and model scale to performance enhancement.
Project Homepage: https://omniaudio-360v2sa.github.io/
Open-source code and data repository: https://github.com/liuhuadai/OmniAudio
Paper: https://arxiv.org/abs/2504.14906