Recently, Google announced the launch of a new open-source AI model, DataGemma, aimed at addressing the "hallucination" issue commonly encountered by large language models (LLMs) when processing statistical data.

This hallucination phenomenon can lead models to provide inaccurate answers when responding to questions about numbers and statistics. The introduction of the DataGemma model marks a significant advancement in Google's AI field.

Survey, Data Report

Image Source: This image was generated by AI, provided by the image licensing service Midjourney.

Reducing Hallucinations in Statistical Queries

DataGemma is composed of two distinct methods designed to enhance the accuracy of responses to user queries. These models are based on Google's Data Commons data-sharing platform, which boasts over 240 billion data points covering information from various fields such as economics, science, and health. This provides the models with a solid factual foundation.

Both new models are available on Hugging Face for academic and research purposes, built on the existing Gemma series of open models, and use extensive real-world data from Google's Data Commons platform to ground their answers. This public platform offers an open knowledge graph containing over 240 billion data points from trusted organizations in fields such as economics, science, health, and more.

Model Entry: https://huggingface.co/collections/google/datagemma-release-66df7636084d2b150a4e6643

Google researchers stated that they have explored various aspects of model hallucination to identify the causes of this issue. Traditional models sometimes perform poorly when dealing with logical and arithmetic problems, and public statistical data often comes in diverse formats with complex background information, making it difficult to understand.

To address these issues, Google researchers combined two new methods. The first is called "Retrieval-Interleaved Generation" (RIG), which improves accuracy by comparing the model's generated answers with relevant statistical information in Data Commons. For this, the fine-tuned LLM generates natural language queries describing the initially generated LLM values. Once the queries are prepared, a multi-model post-processing pipeline converts them into structured data queries, runs them to retrieve relevant statistical answers from Data Commons, and returns or corrects the LLM generation with relevant citations.

The second method is called "Retrieval-Augmented Generation" (RAG), which allows the model to extract relevant variables based on the original statistical question and construct natural language queries to obtain relevant data from Data Commons. In this case, the fine-tuned Gemma model uses the original statistical question to extract relevant variables and generate natural language queries for Data Commons. The queries are then run against the database to retrieve relevant statistical information/tables. After extracting the values, they are used along with the original user query to prompt a long-context LLM (in this case, Gemini1.5Pro) to generate the final answer with high accuracy.

Significantly Improved Accuracy

In preliminary tests, the DataGemma model using the RIG method was able to increase the factual accuracy of the baseline model from 5-17% to about 58%. The RAG method, while slightly less effective, still outperforms the baseline model.

Data shows that DataGemma can accurately answer 24-29% of statistical questions, with a high accuracy rate of 99% in terms of numerical accuracy, but still has an error rate of 6-20% when deriving correct conclusions.

Google hopes that the release of DataGemma will further promote related research and lay a more solid foundation for future Gemma and Gemini models. Google's research will continue, with the expectation that these improved features will be integrated into more models after rigorous testing.

Key Points:

🌟 Google introduces the DataGemma model to reduce errors in AI statistical queries.

📊 DataGemma leverages Google's data-sharing platform to enhance the accuracy of model responses.

🔍 Preliminary tests show significant improvements in the accuracy of statistical queries with DataGemma.