Translated data: Researchers from Apple and Columbia University have collaborated to develop the Ferret multimodal language model, designed to achieve advanced image understanding and description. This model boasts robust global comprehension capabilities, capable of simultaneously processing free text and referenced regions, outperforming traditional models. The researchers created the GRIT dataset to guide model training and evaluate Ferret's performance across multiple tasks, demonstrating its capabilities in referencing and localization, with significant potential breakthroughs anticipated in areas such as human-computer interaction and intelligent search.