In the field of natural language processing, understanding long contexts has always been a challenge. Although large language models (LLMs) perform exceptionally well on various language tasks, they often face limitations when dealing with text that exceeds their context window size. To overcome this limitation, researchers have been striving to enhance LLMs' ability to understand long texts, which is not only significant for academic research but also crucial for real-world applications such as specific domain knowledge understanding, long dialogue generation, and long story or code generation.
In this study, the authors introduce a new benchmark test—LooGLE (Long Context Generic Language Evaluation)—specifically designed to assess the long context understanding capabilities of LLMs. This benchmark includes 776 ultra-long documents published after 2022, with each document averaging 19.3k words and comprising 6448 test instances covering multiple domains such as academic, historical, sports, political, artistic, event-related, and entertainment topics.
Features of LooGLE
Ultra-long real documents: The documents in LooGLE far exceed the context window size of LLMs, requiring the models to memorize and understand longer texts.
Manually designed short and long dependency tasks: The benchmark includes 7 primary tasks, including both short and long dependency tasks, to evaluate the LLMs' understanding of content with varying dependencies.
Relatively new documents: All documents are published after 2022, ensuring that most modern LLMs have not been exposed to these documents during pre-training, thus more accurately assessing their context learning capabilities.
Cross-domain generic data: The benchmark's data is sourced from popular open-source documents such as arXiv papers, Wikipedia articles, movie and TV scripts, etc.
Researchers conducted a comprehensive evaluation of 8 state-of-the-art LLMs, revealing the following key findings:
Commercial models outperform open-source models in performance.
LLMs excel in short dependency tasks but face challenges in more complex long dependency tasks.
Approaches based on context learning and chain of thought provide only limited improvements in long context understanding.
Retrieval-based techniques show significant advantages in short question answering, while strategies to extend context window lengths through optimized Transformer architectures or positional encodings have limited impact on long context understanding.
The LooGLE benchmark not only provides a systematic and comprehensive evaluation scheme for assessing long context LLMs but also guides the development of models with "truly long context understanding" capabilities. All evaluation codes have been released on GitHub for the research community to reference and use.
Paper link: https://arxiv.org/pdf/2311.04939
Code link: https://github.com/bigai-nlco/LooGLE