The technology of Natural Language to SQL (NL2SQL) is rapidly evolving, becoming an important innovation in the field of Natural Language Processing (NLP). This technology enables users to convert natural language queries into Structured Query Language (SQL) statements, significantly facilitating the interaction between users without a technical background and complex databases to obtain valuable information. NL2SQL technology not only opens new doors for exploring large databases across various industries but also enhances work efficiency and decision-making capabilities.

image.png

However, there is a trade-off between query accuracy and adaptability in the implementation of NL2SQL. Some methods struggle to ensure accuracy while also adapting to different types of databases when generating SQL queries. Existing solutions often rely on Large Language Models (LLMs) that generate multiple outputs through prompt engineering and select the best query, but this approach increases computational burden and is not suitable for real-time applications. Meanwhile, Supervised Fine-Tuning (SFT) can achieve targeted SQL generation but faces challenges in cross-domain applications and complex database operations, highlighting the need for innovative frameworks.

image.png

The research team at Alibaba has launched XiYan-SQL, a groundbreaking NL2SQL framework. It integrates multiple generator ensemble strategies, combining the advantages of prompt engineering and SFT. A key innovation of XiYan-SQL is the introduction of M-Schema, a semi-structured schema representation method that enhances the system's understanding of database hierarchies, including data types, primary keys, and example values, thereby improving its ability to generate accurate and contextually relevant SQL queries.

XiYan-SQL employs a three-stage process to generate and optimize SQL queries.

First, the system identifies relevant database elements through schema linking, reducing redundant information and focusing on key structures. Next, it generates SQL candidates using generators based on In-Context Learning (ICL) and SFT. Finally, the system optimizes and selects the generated SQL using error correction models and selection models, ensuring the best query is chosen. XiYan-SQL integrates these steps into an efficient pipeline, surpassing traditional methods.

Through rigorous benchmarking, XiYan-SQL has shown outstanding performance across multiple standard test sets, achieving an execution accuracy of 89.65% in the Spider test set, significantly outperforming previous top models.

image.png

Moreover, in terms of adaptability to non-relational datasets, XiYan-SQL has also achieved excellent results, reaching an accuracy of 41.20% in the NL2GQL test set. These results indicate that XiYan-SQL possesses exceptional flexibility and accuracy across various scenarios.

GitHub: https://github.com/XGenerationLab/XiYan-SQL

Highlights:

🌟 Innovative schema representation: M-Schema enhances the understanding of database hierarchies, improving query accuracy.

📊 Advanced candidate generation: XiYan-SQL utilizes multiple generators to produce diverse SQL candidates, improving query quality.

✅ Superior adaptability: Through benchmarking, XiYan-SQL demonstrates outstanding performance across various databases, setting a new standard for NL2SQL frameworks.