In recent years, Large Language Models (LLMs) have garnered extensive attention in the field of data management, with their applications continuously expanding, including data integration, database tuning, query optimization, and data cleaning. However, handling unstructured data, especially complex documents, still presents numerous challenges.

image.png

Currently, some unstructured data processing frameworks based on LLMs often prioritize cost reduction over improving processing accuracy. This issue is particularly pronounced when analyzing complex tasks, as the outputs from LLMs often fail to precisely meet specific user requirements.

Taking the investigative reporting project at the University of California, Berkeley, as an example, researchers aimed to analyze a large number of police records obtained through record requests to uncover officer misconduct and potential procedural violations. This task, named Police Misconduct Identification (PMI), requires handling various types of documents, extracting and summarizing key information, and aggregating data across multiple files to generate detailed behavior summaries. Existing methods typically use LLMs to process each document once, which often falls short in accuracy, especially when document lengths exceed the context limits of LLMs, leading to potential omissions of critical information.

To address these issues, a research team from the University of California, Berkeley, and Columbia University proposed an innovative system called DocETL. DocETL aims to optimize the processing of complex documents and address the limitations of existing LLMs. The system offers a declarative interface, allowing users to flexibly define processing flows and utilizes an agent-based framework for automatic optimization. Key features of DocETL include logic rewriting processes tailored for LLM tasks, an agent-guided planning evaluation mechanism, and an efficient optimization algorithm to identify the most promising processing plans.

In evaluating the Police Misconduct Identification task, DocETL used a set of 227 documents from the California police department, facing multiple challenges such as document lengths exceeding LLM context limits. Through various pipeline variants, DocETL demonstrated unique capabilities in optimizing complex document processing tasks.

Human evaluations and LLM reviews showed that the output accuracy of DocETL was 1.34 times higher than traditional methods, indicating the system's importance and effectiveness in handling complex document tasks.

In summary, as an innovative declarative system, DocETL not only effectively addresses many challenges in complex document processing but also lays a solid foundation for future research and applications.

Paper: https://arxiv.org/abs/2410.12189v1

Project: https://github.com/ucbepic/docetl

Key Points:

🌟 LLMs face significant accuracy challenges when processing complex documents.

📄 The DocETL system provides a flexible declarative interface and automatic optimization for document processing.

🤖 Human evaluations show a significant improvement in output quality, with a 1.34-fold increase.