Researchers at the Hebrew University of Jerusalem recently discovered that in Retrieval Augmented Generation (RAG) systems, the number of documents processed significantly impacts language model performance, even when the total text length remains constant.
The research team conducted experiments using 2,417 questions from the MuSiQue validation dataset, each linked to 20 Wikipedia paragraphs. Two to four paragraphs contained relevant answer information, while the rest served as distractors. To study the impact of the number of documents, the team created multiple data partitions, gradually reducing the number of documents from 20 to as few as 2-4 containing relevant information. To ensure consistent total token count, researchers extended the retained documents using text from the original Wikipedia articles.
Results showed that in most cases, reducing the number of documents improved language model performance by approximately 10%. The study tested several open-source models, including Llama-3.1, Qwen2, and Gemma2. Notably, the Qwen2 model showed an exception, maintaining relatively stable performance with varying document numbers, while Llama-3.1 and Gemma-2's performance declined significantly with increasing document numbers.
When only documents containing supporting information were provided, all models showed a significant performance boost. This suggests that similar but irrelevant documents, common in RAG systems, confuse the models and reduce performance. Interestingly, models performed better with clearly irrelevant, random documents, indicating they are better at identifying and filtering out obviously unrelated content.
The researchers emphasize the need to balance relevance and diversity when designing retrieval systems to mitigate information conflicts. They also acknowledge some limitations of the study, including the lack of analysis on the impact of prompt variations and data order. The team has publicly released the dataset to facilitate further research in this area.