Groundlight Open-Source Framework for Complex Visual Reasoning

AIbase基地

Published inAI News · 6 min read · Mar 17, 2025

27

Groundlight, a research team focused on enabling AI to understand the world, has recently made a significant breakthrough by open-sourcing a novel AI framework! This framework aims to tackle a major challenge in computer vision – complex visual reasoning – enabling AI to not only "see" objects but also to deduce deeper insights from images, much like Sherlock Holmes.

While current AI excels at recognizing cats and dogs, it often struggles with understanding the logical relationships within images and performing complex reasoning. Groundlight researchers point out that current Vision-Language Models (VLMs) often lack sufficient understanding of images themselves, making it even more difficult to accomplish tasks requiring in-depth interpretation.

Although Large Language Models (LLMs) have made tremendous progress in text reasoning, similar breakthroughs in the visual domain remain limited. Existing VLMs often underperform when requiring combined visual and textual clues for logical deduction, highlighting a key deficiency in their capabilities. Simply identifying objects in an image is insufficient; understanding the relationships between objects and the contextual information is crucial.

Reinforcement Learning and GRPO: Powering a "Super Brain"

To enhance the visual reasoning capabilities of VLMs, Groundlight's research team ingeniously employed a reinforcement learning approach and innovatively utilized GRPO (Gradient Ratio Policy Optimization) to improve learning efficiency.

Previous research, such as Deepseek's work and advanced reasoning in language models, has rarely extended these techniques to the VLM domain. To validate their approach, the researchers designed a cipher decryption task requiring simultaneous processing of visual and textual information. The model needed to use a randomly generated decoder image to decipher encoded information. Ultimately, a model with only 3 billion parameters achieved a remarkable 96% accuracy! Attention analysis showed the model actively engaged with the visual input, focusing on relevant decoder regions while solving the task.

Training VLMs using GRPO wasn't without its challenges, particularly regarding tokenization and reward design. Since models typically process text as tokens rather than individual characters, difficulties can arise for tasks requiring precise character-level reasoning.

To mitigate this, the researchers added spaces between letters in the messages to simplify the decoding process. Reward design was another crucial aspect, as reinforcement learning models require well-structured feedback to learn effectively. The researchers used three reward types: a format reward to ensure output consistency; a decoding reward to encourage meaningful transformation of scrambled text; and a correctness reward to improve accuracy. By carefully balancing these rewards, the researchers successfully prevented the model from learning unintended "shortcuts," ensuring it genuinely improved its cipher decryption capabilities.

GRPO optimizes the learning process by comparing multiple outputs instead of relying on direct gradient calculations, leading to greater stability during training. By generating multiple responses for each query and evaluating them against each other, this method achieves a smoother learning curve. This research also highlights the potential of VLMs in reasoning-based tasks but acknowledges the high computational cost of complex visual models.

To address efficiency issues, they proposed techniques such as selective model upgrading, using more expensive models only when ambiguity arises. Furthermore, they suggested integrating pre-trained object detection, segmentation, and depth estimation models to enhance reasoning capabilities without significantly increasing computational overhead. This tool-based approach provides a scalable alternative to training large end-to-end models, emphasizing the balance between efficiency and accuracy.

Groundlight's team has made significant progress in enhancing VLMs by integrating reinforcement learning techniques, particularly GRPO. They tested their approach on a cipher decryption task, and the model demonstrated impressive accuracy.

Project: https://github.com/groundlight/r1_vlm

Demo: https://huggingface.co/spaces/Groundlight/grpo-vlm-decoder

Groundlight AIFramework VisualRecognition AIImageRecognition

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Groundlight Open-Source Framework for Complex Visual Reasoning

AIbase基地

Reinforcement Learning and GRPO: Powering a "Super Brain"

This article is from AIbase Daily

AI News Recommendations

Apple to Use Apple Maps Look Around Photos to Train AI Models

Stop Manual Tuning! Microsoft's PromptWizard Achieves Large-Scale Prompt Optimization, Saving Time and Costs!