The OCR 2.0 Model Is Here! Converting Charts, Geometric Shapes, and Musical Symbols into Editable Text

AIbase基地

Published inAI News · 5 min read · Oct 15, 2024

357

Title: OCR2.0: The Next-Generation Optical Character Recognition Model for Effortless Image-to-Text Conversion

Recently, researchers have developed a new universal Optical Character Recognition (OCR) model named GOT (General OCR Theory). In their paper, they introduced the concept of "OCR2.0" for the first time, aiming to combine the strengths of traditional OCR systems with the powerful capabilities of large language models.

The architecture of GOT is quite advanced, featuring an image encoder with approximately 80 million parameters and a decoder with 5 million parameters. The image encoder can compress images of 1024x1024 pixels into tokens, while the decoder is responsible for converting these tokens into text up to 8000 characters long. This way, the OCR2.0 model is capable of handling more than just simple text.

The charm of this new technology lies in its ability to recognize and convert various types of visual information, including English and Chinese scene text and document text, mathematical and chemical formulas, musical notations, simple geometric shapes, and charts with components. Such functionalities undoubtedly bring new possibilities for automation in fields such as science, music, and data analysis.

To optimize the training process, the research team first trained the encoder solely for text recognition tasks, then introduced Alibaba's Qwen-0.5B as the decoder and fine-tuned the model using diverse synthetic data. They generated millions of pairs of images and text for training data using rendering tools such as LaTeX, Mathpix-markdown-it, TikZ, Verovio, Matplotlib, and Pyecharts.

The modular design of GOT allows for flexible expansion of new features without retraining the entire model, significantly enhancing the system's update efficiency. Additionally, researchers report that GOT performs excellently across various OCR tasks, especially in document and scene text recognition, and even surpasses some specialized models and large language models in chart recognition.

It is worth noting that the research team has released the free demo and code of GOT on Hugging Face for others to use and further develop. This new model will undoubtedly drive the development of OCR technology and open up a broader range of applications.

Demo link: https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo

Key Points:
📌 GOT (General OCR Theory) is a new OCR model that combines traditional OCR systems with large language models, known as OCR2.0.
📌 The model can recognize and convert various visual information, including text, formulas, musical notations, and charts, applicable to a wide range of fields.
📌 The modular design and synthetic data training enable GOT to have flexible expansion capabilities and perform excellently in multiple OCR tasks.

One image is enough to create a viral video! MOKI's 'AI Creative Advertising' is temporarily free

Recently, an AI video generation tool called MOKI has attracted attention. Its 'AI Creative Advertising' feature allows users to convert images into professional-level videos with simple operations. According to the official introduction, users do not need editing experience or complex ideas. They only need to upload one image and choose limited-time-free templates such as product unboxing, fur transformation, and IP dancing, and they can quickly generate viral videos with cinematic camera effects.

New GoT-R1 Multimodal Model Released: Making AI Drawing Smarter, the New Era of Image Generation!

Recently, a research team from the University of Hong Kong, The Chinese University of Hong Kong, and SenseTime has released a groundbreaking framework - GoT-R1. This new multimodal large model significantly enhances the semantic and spatial reasoning capabilities of AI in visual generation tasks by introducing reinforcement learning (RL), successfully generating high-fidelity and semantically consistent images from complex text prompts. This advancement marks another leap in image generation technology. Currently, although existing multimodal large models have made significant progress in generating images based on text prompts

Faraday Future Introduces FF AI2.0: Deep Integration with OpenAI, Supporting Intelligent Interaction in 50 Languages

On June 24, Faraday Future officially announced the launch of the FF AI2.0 intelligent cockpit operating system, which is the most significant cockpit upgrade since the release of the FF91 in 2023. The new system will be deployed to the FF91 2.0 model first through OTA software updates, and it is planned to be extended to the FX series in the future. FF AI2.0 features a complete restructuring of the entire AI architecture, with its biggest highlight being the deep integration with OpenAI. The system deeply integrates large language models (LLMs) into Faraday Future's software and hardware architecture.

Volc Engine Launches Enterprise AI Mid-Platform HiAgent 2.0, Introducing the Agent DevOps Concept

Recently, Volc Engine officially announced the launch of the enterprise AI mid-platform HiAgent 2.0, aiming to address real-world business needs and pain points in the development of Agent applications, helping enterprises achieve efficient delivery of intelligent bodies from development to operations. HiAgent 2.0 is positioned as an enterprise AI mid-platform, and compared to its predecessor HiAgent 1.0, it has expanded in four dimensions of functionality. Upwardly, HiAgent 2.0 provides more industry scenario templates and a rich plugin market, significantly lowering the barrier to building intelligent bodies; towards

In-Depth Analysis of the Revolutionary AI Tool Eureka: Disrupting R&D Innovation

A deep dive into how Eureka AI addresses three major pain points in R&D - information overload, innovation bottlenecks, and evaluation difficulties - through core features such as a technical Q&A assistant and TRIZ theory innovation. It provides end-to-end AI support solutions from concept to implementation.

110-page lawsuit points to AI infringement: Disney and Universal jointly sue Midjourney with the aim of establishing an AI licensing mechanism

Recently, global entertainment giants Disney and Universal Pictures have filed a copyright lawsuit against artificial intelligence company Midjourney, marking the first official legal dispute involving generative artificial intelligence for major Hollywood companies, drawing global attention. According to reports by CCTV, the two companies pointed out in their 110-page complaint that Midjourney allegedly illegally used its massive copyright library to generate and spread a large number of unauthorized copies of well-known characters, including characters from 'Star Wars,' 'Minions,' 'Shrek,' 'Frozen,' etc.

French AI Lab Mistral Launches Magistral Reasoning Model, Opening a New Era of Intelligence!

The French AI lab Mistral has officially launched its first series of reasoning models - Magistral, marking another significant progress in the AI field. This series includes two versions: Magistral Small and Magistral Medium, aiming to enhance logical reasoning capabilities in fields such as mathematics and physics. These models solve problems step-by-step, comparable to models like OpenAI's o3 and Google's Gemini2.5Pro.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

The OCR 2.0 Model Is Here! Converting Charts, Geometric Shapes, and Musical Symbols into Editable Text

AIbase基地

This article is from AIbase Daily

AI News Recommendations

World Robot Dog Competition to Begin: Black Panther 2.0 Challenges Extreme Missions and 100-Meter Human vs. Machine Duel