RULER
A benchmark for evaluating the rationality of long-text language models.
CommonProductProductivityLong-textLanguage model
RULER is a new synthetic benchmark that provides a more comprehensive evaluation of long-text language models. It extends standard retrieval tests to cover different types and quantities of information points. Additionally, RULER introduces new task categories, such as multi-hop tracking and aggregation, to test behaviors beyond retrieving from context. 10 long-text language models were evaluated on RULER and achieved performance on 13 representative tasks. Despite achieving near-perfect accuracy on standard retrieval tests, these models performed poorly as context length increased. Only four models (GPT-4, Command-R, Yi-34B, and Mixtral) performed reasonably well at a length of 32K. We make RULER publicly available to promote comprehensive evaluation of long-text language models.
RULER Visit Over Time
Monthly Visits
20899836
Bounce Rate
46.04%
Page per Visit
5.2
Visit Duration
00:04:57