Skip to content

Breaking New Ground in Spatial Audio Generation with OmniAudio Technology

  • 4 min read

Recently, Tongyi Lab's speech team has achieved a milestone in the field of spatial audio generation by introducing OmniAudio technology, which can directly generate First-order Ambisonics (FOA) audio from 360° videos, opening up new possibilities for virtual reality and immersive entertainment.

Breaking New Ground in Spatial Audio Generation with OmniAudio Technology

Spatial audio, as a technology that simulates real auditory environments, enhances immersive experiences. However, existing technologies are mostly based on fixed-perspective videos and underutilize the spatial information of 360° panoramic videos. Traditional video-to-audio generation technologies primarily produce non-spatial audio, failing to meet the demand for 3D sound localization in immersive experiences. These technologies are often based on limited-perspective videos, missing out on the rich visual context provided by panoramic videos. With the proliferation of 360° cameras and the development of virtual reality technology, generating spatial audio that matches panoramic videos has become an urgent problem to solve.

To address this challenge, Tongyi Lab proposed the 360V2SA (360-degree Video to Spatial Audio) task. FOA is a standard 3D spatial audio format that represents sound using four channels (W, X, Y, Z), capturing the directionality of sound and enabling true 3D audio reproduction while maintaining the accuracy of sound localization during head rotations.

Data is the cornerstone of machine learning models, but there is a scarcity of paired 360° video and spatial audio data. To this end, the research team meticulously constructed the Sphere360 dataset, which includes over 103,000 real-world video clips, covering 288 audio events, with a total duration of 288 hours. This dataset not only contains 360° visual content but also supports FOA audio. During the construction process, the team adopted strict screening and cleaning standards, using various algorithms to ensure high-quality alignment.

The OmniAudio training method is divided into two stages. The first stage is a self-supervised coarse-to-fine flow matching pre-training, where the team fully leverages large-scale non-spatial audio resources. They convert stereo audio into a "pseudo-FOA" format before sending it into a four-channel VAE encoder to obtain a latent representation. Then, they apply random temporal window masking with a certain probability, using the masked latent sequence and the complete sequence as conditional inputs to the flow matching model. This enables self-supervised learning of audio temporality and structure, allowing the model to grasp general audio features and macro-temporal patterns. The second stage is supervised fine-tuning based on dual-branch video representation. The team uses only real FOA audio data, continuing with the masked flow matching training framework to enhance the model's ability to represent the direction of sound sources and improve the reconstruction effect of high-fidelity spatial audio details. After self-supervised pre-training, the team combines the model with a dual-branch video encoder for supervised fine-tuning, selectively "sculpting" FOA latent trajectories from noise that align with visual cues, outputting four-channel spatial audio that is highly aligned with 360° videos and has precise directional sense.

In the experimental setup, the research team conducted supervised fine-tuning and evaluation on the Sphere360-Bench and YT360-Test test sets, using both objective and subjective metrics to measure the quality of the generated audio. The results show that OmniAudio significantly outperforms all baselines on both test sets. On YT360-Test, OmniAudio substantially reduced FD, KL, and ΔAngular metrics; it also achieved excellent results on Sphere360-Bench. In human-subject evaluations, OmniAudio scored much higher than the best baseline in spatial audio quality and audio-visual alignment, demonstrating that its synthesized results are superior in clarity, spatiality, and synchronization with the visuals. Additionally, ablation experiments confirmed the contributions of pre-training strategies, dual-branch design, and model scale to performance improvement.

Project Homepage: https://omniaudio-360v2sa.github.io/

Code and Data Open Source Repository: https://github.com/liuhuadai/OmniAudio

Paper Address: https://arxiv.org/abs/2504.14906

Leave a Reply

Your email address will not be published. Required fields are marked *