The quest for AI to effectively retrieve and reason about key information from visual languages such as images, tables, and designs has been a significant challenge in the AI domain. Traditional Retrieval-Augmented Generation (RAG) methods struggle with rich visual information due to their inability to handle visual content like images and charts. Moreover, existing visual RAG methods are limited by a fixed retrieval-generation process, hindering the extraction of crucial knowledge from visual information.
Introducing VRAG-RL, a groundbreaking visual perception-driven multimodal RAG inference framework developed by the natural language intelligence team at Tongyi Labs. This innovative framework addresses these challenges by incorporating three key innovations: reinforcement learning-empowered multimodal intelligent agent training, visual perception mechanism design, and collaborative optimization of retrieval and reasoning.
VRAG-RL introduces a variety of visual perception actions such as region selection, cropping, and scaling, enabling the model to focus on information-dense areas from a coarse to fine granularity, accurately extracting key visual information. This coarse-to-fine perception approach not only enhances the model's understanding of visual information but also significantly improves retrieval efficiency.
During training, VRAG-RL employs a multi-expert sampling strategy, combining the reasoning capabilities of large-scale models with the precise annotation capabilities of expert models. This allows the model to learn more effective visual perception strategies. Furthermore, its fine-grained reward mechanism integrates factors such as retrieval efficiency, pattern consistency, and generation quality, guiding the model to continuously optimize retrieval and reasoning paths during interactions with search engines. This multi-dimensional reward mechanism achieves bidirectional driving of retrieval and reasoning, forming a closed-loop optimization.
VRAG-RL also incorporates the industry-leading GRPO algorithm, simulating real-world application scenarios by deploying local search engines, achieving zero-cost search engine calls, and more efficient model training. This training approach not only enhances the model's generalization capability but also enables it to perform well in different domains and types of visual tasks.
Experimental results demonstrate that VRAG-RL outperforms existing methods on multiple visual language benchmark datasets, covering various visual-rich scenarios from single-hop to multi-hop reasoning, from pure text understanding to chart recognition, and complex layout parsing. VRAG-RL exhibits superior comprehensive performance compared to both traditional prompt-based methods and reinforcement learning-based methods.
In addition, VRAG-RL supports multi-round interactions, gradually focusing on information-dense areas during the reasoning phase, achieving coarse-to-fine information acquisition. The method optimizes retrieval efficiency and reasoning paths, significantly enhancing model performance on visual tasks while maintaining high efficiency.
Discover VRAG-RL on GitHub at github.com/Alibaba-NLP/VRAG. Embrace the future of AI with this cutting-edge multimodal RAG inference framework.