The international machine learning community has been buzzing with excitement following the recent unveiling of Orthus, a groundbreaking multimodal generative understanding model developed in collaboration between Kuaishou and Shanghai Jiao Tong University. Built on a self-attentional Transformer architecture, Orthus has demonstrated an unprecedented ability to seamlessly transition between text and images, marking a significant leap forward in AI-generated content.
Orthus stands out for its exceptional computational efficiency and formidable learning capabilities. Research has shown that with minimal computational resources, Orthus outperforms existing hybrid understanding generative models like Chameleon and Show-o across multiple image understanding metrics. In the GenEval metric for text-to-image generation, Orthus has even surpassed the diffusion model SDXL, which was specifically designed for this task.
Beyond managing the interplay between text and images, Orthus has shown immense potential in applications such as image editing and webpage generation. The model's architecture is ingeniously designed, with a self-attentional Transformer serving as the backbone network, complemented by specific modality generation heads for text and image generation. This design effectively decouples the modeling of image details and the expression of text features, allowing Orthus to focus on modeling the complex relationships between text and images.
Orthus is composed of several core components, including a text tokenizer, a visual autoencoder, and two modality-specific embedding modules. By fusing text and image features into a unified representation space, the backbone network can handle inter-modality dependencies more efficiently. During the inference phase, the model generates the next text token or image feature autoregressively based on specific markers, showcasing remarkable flexibility.
Through these innovative designs, Orthus has not only bridged the gap between end-to-end diffusion modeling and autoregressive mechanisms but also minimized the information loss caused by image discretization. The model can be seen as a successful expansion of He Kaiming's MAR work in the image generation domain into the multimodal realm.
The collaboration between Kuaishou and Shanghai Jiao Tong University has undoubtedly opened up new possibilities for the development of multimodal generative models, garnering attention and anticipation from both the industry and academia.