Website Hosting Directory (ChinaZ.com) June 17 News: A research team from the Chinese University of Hong Kong, the Chinese Academy of Sciences, and other institutions has proposed a full-modal pre-training paradigm called MiCo (Multimodal Context). This method has achieved significant results in the field of multimodal learning, setting new records for 37 state-of-the-art performance (SOTA) benchmarks.

1.jpg

Key Features:

  • Full-Modal Understanding: MiCo aims to build a full-modal intelligence capable of understanding any modality and learning universal representations.

  • Large-Scale Pre-Training: By incorporating more modalities, data volume, and model parameters, MiCo simulates the multimodal cognitive processes of the human brain during pre-training.

  • Neural Network Architecture Design: MiCo categorizes different modalities into "knowledge modalities" and "interface modalities" and designs corresponding full-modal learning architectures, aligning them through generative inference methods.

  • Multimodal Context and Scaling Laws: MiCo leverages multimodal context to enhance mutual reinforcement between modalities, constructing cross-modal contextual relationships.

Experimental Results Show:

  • In 10 different single-modal perception benchmark tests across various modalities, MiCo achieved 7 SOTA results.

  • In 25 cross-modal understanding tasks, including retrieval, question-answering, and description, MiCo secured 20 SOTA results.

  • In 18 multimodal large language model benchmark tests, MiCo achieved a total of 10 SOTA results.

MiCo's Pre-Training Method:

The team employed joint pre-training with video paired with corresponding audio, text descriptions, depth, and normals to simulate the human brain's visual, auditory, and spatiotemporal perception capabilities.

Multimodal features were extracted using a full-modal encoder (such as ViT), and text features were extracted using a text encoder, thereby constructing multimodal contextual relationships.

Conclusion and Future Work:

The MiCo project represents a significant attempt to simulate the human brain's multimodal cognition in artificial intelligence. The team anticipates that it will inspire future research and the development of more powerful full-modal foundational models.

Future plans include integrating additional modalities such as optical flow, IMU data, and event files to continue enhancing full-modal joint pre-training.