Humans often resort to drawing when reasoning, such as sketching auxiliary lines when solving geometric problems, marking and circling on maps, and creating rough sketches to clarify thoughts. However, current multimodal language models (LM) lack such capabilities. In recent research, scientists have introduced the concept of "Sketchpad," providing multimodal LMs with visual drawing boards and tools, enabling them to engage in visual reasoning.

image.png

Product Entry: https://top.aibase.com/tool/visual-sketchpad

Operation Mechanism: Sketchpad allows GPT-4 to generate intermediate sketches for reasoning tasks. Given visual inputs and queries, such as proving that the angles of a triangle equal 180°, the drawing board enables the model to draw auxiliary lines that aid in solving geometric problems. For computer vision issues, Sketchpad can utilize visual experts to sketch and facilitate visual reasoning. For instance, using "Grounding DINO" to draw bounding boxes, or "Segment Anything" to draw masks.

Unlike previous work that used text-to-image models to enable LMs to draw, Sketchpad allows LMs to draw using lines, boxes, marks, etc., which is closer to human sketching and more conducive to reasoning. Additionally, Sketchpad can employ specialized visual models during the drawing process, such as using object detection models to draw bounding boxes and segmentation models to draw masks, further enhancing visual perception and reasoning capabilities.

Experimental results show that Sketchpad significantly improves the performance of multimodal large language models on mathematical tasks (including geometry, functions, graphs, chess) and complex visual reasoning tasks. Compared to powerful baseline models without drawing capabilities, Sketchpad boosts LM performance by an average of 12.7% on mathematical tasks and 8.6% on visual tasks. GPT-4o with Sketchpad has set new technical standards across all tasks, including V*Bench (80.3%), BLINK Spatial Reasoning (83.9%), and Visual Correspondence (80.8%).

The findings of this study suggest that by introducing visual drawing boards and tools, multimodal LMs can approach human thinking patterns when dealing with complex reasoning tasks, enhancing their performance in mathematics and visual reasoning fields. This breakthrough is expected to play a significant role in the development of language models and visual models, opening up new possibilities for the advancement of artificial intelligence technology.