Recently, Tongyi Large Model has released CoGenAV, an innovative speech recognition technology based on the concept of audio-visual synchronization, effectively solving the problem of noise interference in speech recognition. Traditional speech recognition performs poorly in noisy environments, while CoGenAV takes a different approach by learning the temporal alignment relationships between audio-visual-text to build a more robust and universal speech representation framework, systematically improving the performance of multiple speech-centric tasks such as Voice Recognition Tasks (VSR/AVSR), Voice Reconstruction Tasks (AVSS/AVSE), and Voice Synchronization Tasks (ASD).
Technically, CoGenAV adopts a "contrastive generation synchronization" strategy. In the feature extraction phase, the model uses ResNet3D CNN to analyze the lip movements of the speaker in the video, capturing the dynamic association between sound and lip shapes. At the same time, it uses Transformer encoders to extract speech information from audio and precisely align audio-visual features. Contrastive generation synchronization training enhances the model's understanding ability through two methods – contrastive synchronization and generative synchronization. Contrastive synchronization uses the Seq2Seq Contrastive Learning method to strengthen the correspondence between audio and video features and introduces ReLU activation functions to filter out interference frames. Generative synchronization leverages pre-trained ASR models to align audio-visual features with their acoustic-text representations and designs lightweight adaptation modules to improve cross-modal fusion efficiency.
Thanks to these innovative technologies, CoGenAV has achieved breakthrough results on multiple benchmark datasets. In the Visual Speech Recognition (VSR) task, using only 223 hours of lip movement video training, it achieved a 20.5% Word Error Rate (WER) on the LRS2 dataset, comparable to traditional models using thousands of hours of data. In the Audio-Visual Speech Recognition (AVSR) task, combined with the Whisper Medium model, it achieved a 1.27% WER on the same dataset, setting a new SOTA record. Its performance in a 0dB noise environment increased by over 80%, significantly better than pure audio models. In the Audio Enhancement and Separation (AVSE/AVSS) tasks, as a visual feature extractor, it achieved an SDRi metric of 16.0dB in the LRS2 speech separation task, surpassing AvHuBERT by 1.6dB and Av SepFormer by 0.3dB. In the speech enhancement task, the SDRi metric was 9.0dB, better than Av HuBERT by 1.6dB. In the Active Speaker Detection (ASD) task, it achieved an average precision (mAP) of 96.3% on the Talkies dataset, leading existing methods.
CoGenAV can be directly integrated into mainstream speech recognition models such as Whisper without modification or fine-tuning to achieve visual speech recognition capabilities, reducing the deployment threshold and demonstrating excellent noise resistance and data efficiency. It significantly saves training costs and enhances the model's practicality and expansion potential. Currently, the code and models related to CoGenAV have been open-sourced on platforms such as GitHub, arivx, HuggingFace, and ModelScope for researchers and developers to use.