A team from Peking University, led by Zhang Muhan, has proposed a novel framework—Long Input Fine-Tuning (LIFT)—that enables any short-context window model to handle long texts by training the long input text into the model parameters. This approach subverts traditional long text processing methods, shifting focus from endlessly expanding context windows to internalizing long-text knowledge into model parameters, mirroring how humans convert working memory into long-term memory.
Large language models currently face two major challenges in processing long texts:
The quadratic complexity of traditional attention mechanisms leads to enormous computational and memory overhead when processing long texts. Models struggle to understand long-range dependencies scattered throughout long texts.
Existing solutions like RAG and long-context adaptation have limitations:
RAG relies on accurate retrieval and is prone to noise, leading to hallucinations. Long-context adaptation has high inference complexity and still has limited context windows.
LIFT's Technological Innovation
The LIFT framework comprises three key components:
Dynamic and Efficient Long Input Training
Through segmented language modeling, long texts are divided into overlapping segments. This avoids the increased inference complexity and loss of long-range dependencies caused by excessively long contexts. The training complexity increases linearly with the length of the long text.
A Gated Memory Adapter for Balancing Model Capabilities
A dedicated Gated Memory Adapter architecture dynamically balances the original model's in-context learning capabilities and its understanding of long-input memory. This allows the model to automatically adjust how much LIFT memory content to use based on the query.
Auxiliary Task Training
Pre-trained LLMs are used to automatically generate question-answering auxiliary tasks based on long texts. This compensates for potential capabilities lost during segmented training and helps the model learn to use information from long texts to answer questions.
Experimental Results
LIFT achieved significant improvements on several long-context benchmarks:
LooGLE long-dependency question answering: Llama38B's accuracy increased from 15.44% to 29.97%. LooGLE short-dependency question answering: Gemma29B's accuracy increased from 37.37% to 50.33%. LongBench sub-tasks: Llama3, using LIFT, showed significant improvements in 4 out of 5 sub-tasks.
Ablation experiments showed that the Gated Memory architecture improved the GPT-4 score on the LooGLE ShortQA dataset by 5.48% compared to the original model fine-tuned using PiSSA.
Limitations and Future Directions
Despite LIFT's significant achievements, some limitations remain:
It still performs poorly on "needle-in-a-haystack" tasks requiring precise information extraction. The model's ability to extract parameterized knowledge obtained by LIFT needs optimization. The design of auxiliary tasks heavily relies on downstream testing tasks, limiting its generality. How to better balance memory and original capabilities remains a key research focus.
The research team encourages the community to explore LIFT's potential with broader training data, richer models, more advanced auxiliary task designs, and stronger computational resources.
Conclusion
LIFT offers a novel paradigm for long-text processing, transforming contextual knowledge into parameterized knowledge, similar to how humans convert short-term memory into long-term memory. While a complete solution to the long-context challenge remains elusive, LIFT opens up a highly promising research direction.
Paper Address: https://arxiv.org/abs/2502.14644