Data to be translated: The Apple AI/ML team, in collaboration with Columbia University, has developed a multimodal large model named "Ferret" that has successfully challenged Google's human-machine captcha. Ferret can recognize traffic lights and enhance the accuracy of large models in tasks that involve "seeing, speaking, and answering." The innovation of Ferret lies in its ability to integrate spatial understanding of references and positioning, simultaneously comprehending semantics and objectives, which is different from traditional multimodal models. By using a mixed-region representation method that combines discrete coordinates and continuous features, the model performs exceptionally well in multi-task evaluations, particularly in tasks involving reference and visual grounding. This breakthrough was achieved by a Chinese team, highlighting China's strength in the research of multimodal large models and providing new directions for image understanding and multimodal tasks. The achievements of Ferret are expected to make significant breakthroughs in areas such as human-computer interaction and intelligent search.