Title: OCR2.0: The Next-Generation Optical Character Recognition Model for Effortless Image-to-Text Conversion

Recently, researchers have developed a new universal Optical Character Recognition (OCR) model named GOT (General OCR Theory). In their paper, they introduced the concept of "OCR2.0" for the first time, aiming to combine the strengths of traditional OCR systems with the powerful capabilities of large language models.

The architecture of GOT is quite advanced, featuring an image encoder with approximately 80 million parameters and a decoder with 5 million parameters. The image encoder can compress images of 1024x1024 pixels into tokens, while the decoder is responsible for converting these tokens into text up to 8000 characters long. This way, the OCR2.0 model is capable of handling more than just simple text.

The charm of this new technology lies in its ability to recognize and convert various types of visual information, including English and Chinese scene text and document text, mathematical and chemical formulas, musical notations, simple geometric shapes, and charts with components. Such functionalities undoubtedly bring new possibilities for automation in fields such as science, music, and data analysis.

image.png

To optimize the training process, the research team first trained the encoder solely for text recognition tasks, then introduced Alibaba's Qwen-0.5B as the decoder and fine-tuned the model using diverse synthetic data. They generated millions of pairs of images and text for training data using rendering tools such as LaTeX, Mathpix-markdown-it, TikZ, Verovio, Matplotlib, and Pyecharts.

image.png

The modular design of GOT allows for flexible expansion of new features without retraining the entire model, significantly enhancing the system's update efficiency. Additionally, researchers report that GOT performs excellently across various OCR tasks, especially in document and scene text recognition, and even surpasses some specialized models and large language models in chart recognition.

image.png

It is worth noting that the research team has released the free demo and code of GOT on Hugging Face for others to use and further develop. This new model will undoubtedly drive the development of OCR technology and open up a broader range of applications.

Demo link: https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo

Key Points:

📌 GOT (General OCR Theory) is a new OCR model that combines traditional OCR systems with large language models, known as OCR2.0.

📌 The model can recognize and convert various visual information, including text, formulas, musical notations, and charts, applicable to a wide range of fields.

📌 The modular design and synthetic data training enable GOT to have flexible expansion capabilities and perform excellently in multiple OCR tasks.