Groundlight, a research team focused on enabling AI to understand the world, has recently made a significant breakthrough by open-sourcing a novel AI framework! This framework aims to tackle a major challenge in computer vision – complex visual reasoning – enabling AI to not only "see" objects but also to deduce deeper insights from images, much like Sherlock Holmes.
While current AI excels at recognizing cats and dogs, it often struggles with understanding the logical relationships within images and performing complex reasoning. Groundlight researchers point out that current Vision-Language Models (VLMs) often lack sufficient understanding of images themselves, making it even more difficult to accomplish tasks requiring in-depth interpretation.
Although Large Language Models (LLMs) have made tremendous progress in text reasoning, similar breakthroughs in the visual domain remain limited. Existing VLMs often underperform when requiring combined visual and textual clues for logical deduction, highlighting a key deficiency in their capabilities. Simply identifying objects in an image is insufficient; understanding the relationships between objects and the contextual information is crucial.
Reinforcement Learning and GRPO: Powering a "Super Brain"
To enhance the visual reasoning capabilities of VLMs, Groundlight's research team ingeniously employed a reinforcement learning approach and innovatively utilized GRPO (Gradient Ratio Policy Optimization) to improve learning efficiency.
Previous research, such as Deepseek's work and advanced reasoning in language models, has rarely extended these techniques to the VLM domain. To validate their approach, the researchers designed a cipher decryption task requiring simultaneous processing of visual and textual information. The model needed to use a randomly generated decoder image to decipher encoded information. Ultimately, a model with only 3 billion parameters achieved a remarkable 96% accuracy! Attention analysis showed the model actively engaged with the visual input, focusing on relevant decoder regions while solving the task.
Training VLMs using GRPO wasn't without its challenges, particularly regarding tokenization and reward design. Since models typically process text as tokens rather than individual characters, difficulties can arise for tasks requiring precise character-level reasoning.
To mitigate this, the researchers added spaces between letters in the messages to simplify the decoding process. Reward design was another crucial aspect, as reinforcement learning models require well-structured feedback to learn effectively. The researchers used three reward types: a format reward to ensure output consistency; a decoding reward to encourage meaningful transformation of scrambled text; and a correctness reward to improve accuracy. By carefully balancing these rewards, the researchers successfully prevented the model from learning unintended "shortcuts," ensuring it genuinely improved its cipher decryption capabilities.
GRPO optimizes the learning process by comparing multiple outputs instead of relying on direct gradient calculations, leading to greater stability during training. By generating multiple responses for each query and evaluating them against each other, this method achieves a smoother learning curve. This research also highlights the potential of VLMs in reasoning-based tasks but acknowledges the high computational cost of complex visual models.
To address efficiency issues, they proposed techniques such as selective model upgrading, using more expensive models only when ambiguity arises. Furthermore, they suggested integrating pre-trained object detection, segmentation, and depth estimation models to enhance reasoning capabilities without significantly increasing computational overhead. This tool-based approach provides a scalable alternative to training large end-to-end models, emphasizing the balance between efficiency and accuracy.
Groundlight's team has made significant progress in enhancing VLMs by integrating reinforcement learning techniques, particularly GRPO. They tested their approach on a cipher decryption task, and the model demonstrated impressive accuracy.
Project: https://github.com/groundlight/r1_vlm
Demo: https://huggingface.co/spaces/Groundlight/grpo-vlm-decoder