Researchers at Meta AI, in collaboration with academic partners, have developed an innovative system called MILS (Multimodal Iterative LLM Solver), which can teach large language models to process images, videos, and audio without requiring specialized training. MILS relies on the natural problem-solving capabilities of language models rather than extensive data training, showcasing its unique advantages.

QQ20250210-105931.png

MILS operates by pairing two AI models to solve tasks: one is the "generator," responsible for proposing solutions, and the other is the "scorer," which evaluates the effectiveness of the generated solutions. Feedback from the scorer helps the generator continuously refine its answers until a satisfactory result is achieved. For example, in image description tasks, MILS can progressively refine the descriptions to accurately detail various levels of image intricacies.

MILS excels particularly in image description. By using the Llama-3.1-8B model as the generator and the CLIP model as the scorer, MILS can produce image descriptions that are comparable to or even more detailed than current leading methods, even though CLIP was not specifically trained for image description tasks. Additionally, MILS enhances text-to-image generation capabilities through fine-tuning text prompts and can combine AI-generated prompts with image processing tools to handle tasks like style transfer in image editing.

QQ20250210-105939.png

The accuracy of image descriptions increases with the number of steps between the generator and the scorer. | Image: Ashutosh et al.

MILS's capabilities are not limited to images; it also extends to video and audio domains. When tested using the MSR-VTT video dataset, MILS outperformed existing models in video content description. Since MILS does not modify model parameters during operation, it can convert different types of data into readable text, supporting the merging and transformation of information from multiple sources like images and audio into the desired format, thereby opening new possibilities for multimodal information fusion applications.

Tests indicate that using larger generator and scorer models can yield more accurate results, and increasing the number of potential solutions significantly improves performance. Researchers also found that scaling up to larger language models not only enhances the quality of results but also leads to noticeable improvements in performance.

QQ20250210-105948.png

Descriptions of landscapes evolve from simple basic descriptions to complex representations with more precise details and natural elements. | Image: Ashutosh et al.

This innovative strategy adopted by MILS aligns with the current trend in the field of artificial intelligence towards more intelligent reasoning capabilities. The Meta team also indicated that MILS may show great potential in areas such as 3D data processing in the future, further advancing the development of multimodal AI.

With the rapid advancements of OpenAI's GPT-4 and other open-source alternatives like Meta's Llama3.2, Mistral's Pixtral, and DeepSeek's Janus Pro, these emerging multimodal AI systems are accelerating their applications in everyday life and laying an important foundation for the future development of artificial intelligence.