In the digital era, the rapid conversion of text content from images into editable text is a common and crucial requirement. Now, the advent of a new Optical Character Recognition (OCR) model called GOT (General Optical Character Recognition Theory) marks the entry of OCR technology into the 2.0 era. This innovative model combines the strengths of traditional OCR systems with large language models, aiming to create a more efficient and intelligent text recognition tool.
The GOT model employs an innovative end-to-end architecture, which not only saves resources but also significantly expands its recognition capabilities, going beyond mere text recognition. The model consists of an image encoder with approximately 80 million parameters and a decoder with about 5 million parameters. The image encoder can compress images up to 1024x1024 pixels into data units, while the decoder converts these data units into text up to 8000 characters long.
The strength of GOT lies in its versatility, capable of recognizing and converting not only English and Chinese documents and scene text but also mathematical and chemical formulas, musical notations, simple geometric shapes, and various charts. This makes GOT a true all-rounder.
To train this model, the research team focused first on text recognition tasks, then used Alibaba's Qwen-0.5B as the decoder, and fine-tuned it with various synthetic data. They generated millions of image-text pairs for model training using professional rendering tools such as LaTeX, Mathpix-markdown-it, and Matplotlib.
Another highlight of the OCR2.0 technology is its ability to extract formatted text, headings, and even multi-page images, converting them into structured digital formats. This opens up new possibilities for automatic processing and analysis in fields such as science, music, and data analysis.
In various OCR task tests, GOT demonstrated outstanding performance, achieving industry-leading results in document and scene text recognition, and even surpassing many professional models and large language models in chart recognition. Whether it's complex chemical structure formulas, musical notations, or data visualization, OCR2.0 can accurately capture and convert them into machine-readable formats.
To allow more users to experience and utilize this technology, the research team has released free demos and code on the Hugging Face platform. The arrival of OCR2.0 undoubtedly brings a revolution to the field of information processing, not only enhancing efficiency but also increasing flexibility, making our handling of text information in images more effortless.