The rise of large language models (LLMs) has brought revolutionary changes to artificial intelligence applications. However, they have significant shortcomings in handling tabular data. The research team from the Zhejiang University Institute of Computing Innovation has addressed this issue by introducing a new model called TableGPT2, which can directly and efficiently integrate and process tabular data, opening up new possibilities for business intelligence (BI) and other data-driven applications.

The core innovation of TableGPT2 lies in its unique table encoder, designed specifically to capture the structural information of tables and cell content, enhancing the model's ability to handle ambiguous queries, missing column names, and irregular tables commonly found in real-world applications. TableGPT2 is based on the Qwen2.5 architecture and has undergone extensive pre-training and fine-tuning, involving over 593,800 tables and 2.36 million high-quality query-table-output tuples, a scale of tabular data never before seen in previous research.

image.png

To enhance TableGPT2's encoding and inference capabilities, researchers conducted continuous pre-training (CPT), with 80% of the data being meticulously annotated code to ensure strong coding abilities. Additionally, they collected extensive inference data and textbooks containing specific domain knowledge to enhance the model's reasoning capabilities. The final CPT data included 86 billion rigorously filtered tokens, providing TableGPT2 with the necessary encoding and inference capabilities to handle complex BI tasks and other related tasks.

To address TableGPT2's limitations in adapting to specific BI tasks and scenarios, researchers performed supervised fine-tuning (SFT). They constructed a dataset covering various critical and realistic scenarios, including multi-round dialogues, complex reasoning, tool usage, and highly business-oriented queries. This dataset combined manual annotation and expert-driven automatic annotation processes to ensure data quality and relevance. The SFT process used 2.36 million samples, further refining the model to meet specific needs in BI and other table-involved environments.

TableGPT2 also innovatively introduced a semantic table encoder, which takes the entire table as input and generates a set of compact embedding vectors for each column. This architecture is tailored to the unique properties of tabular data, effectively capturing relationships between rows and columns through a bidirectional attention mechanism and hierarchical feature extraction process. Additionally, a column-wise contrastive learning method was employed to encourage the model to learn meaningful, structure-aware tabular semantic representations.

To seamlessly integrate TableGPT2 with enterprise-level data analysis tools, researchers designed an agent workflow runtime framework. This framework includes three core components: runtime prompt engineering, secure code sandbox, and agent evaluation module, collectively enhancing the agent's capabilities and reliability. The workflow supports complex data analysis tasks through modular steps (input normalization, agent execution, and tool invocation), which work together to manage and monitor the agent's performance. By incorporating retrieval-augmented generation (RAG) for efficient context retrieval and a code sandbox for secure execution, the framework ensures TableGPT2 provides accurate, context-relevant insights in practical problems.

Researchers conducted extensive evaluations of TableGPT2 on various widely-used tabular and general benchmarks, showing that TableGPT2 excels in table understanding, processing, and reasoning. The average performance of the 7-billion parameter model improved by 35.20%, and the 72-billion parameter model by 49.32%, while maintaining strong general performance. For a fair evaluation, they compared TableGPT2 only with open-source benchmark-neutral models (such as Qwen and DeepSeek), ensuring balanced, versatile performance across various tasks without overfitting to any single benchmark. They also introduced and partially released a new benchmark—RealTabBench, which emphasizes unconventional tables, anonymous fields, and complex queries, more aligned with real-world scenarios.

Despite TableGPT2's state-of-the-art performance in experiments, challenges remain in deploying LLMs in real-world BI environments. Researchers point out future research directions include:

Domain-specific encoding: enabling LLMs to quickly adapt to enterprise-specific domain-specific languages (DSL) or pseudocode to better meet specific needs of enterprise data infrastructure.

Multi-agent design: exploring how to effectively integrate multiple LLMs into a unified system to handle the complexity of real-world applications.

Versatile table processing: improving the model's ability to handle irregular tables, such as merged cells and inconsistent structures commonly found in Excel and Pages, to better handle various forms of real-world tabular data.

The introduction of TableGPT2 marks a significant advancement in LLMs' ability to handle tabular data, bringing new possibilities for business intelligence and other data-driven applications. As research continues to deepen, TableGPT2 is expected to play an increasingly important role in the field of data analysis in the future.

Paper link: https://arxiv.org/pdf/2411.02059v1