RULER
A benchmark for evaluating the rationality of long-text language models.
CommonProductProductivityLong-textLanguage model
RULER is a new synthetic benchmark that provides a more comprehensive evaluation of long-text language models. It extends standard retrieval tests to cover different types and quantities of information points. Additionally, RULER introduces new task categories, such as multi-hop tracking and aggregation, to test behaviors beyond retrieving from context. 10 long-text language models were evaluated on RULER and achieved performance on 13 representative tasks. Despite achieving near-perfect accuracy on standard retrieval tests, these models performed poorly as context length increased. Only four models (GPT-4, Command-R, Yi-34B, and Mixtral) performed reasonably well at a length of 32K. We make RULER publicly available to promote comprehensive evaluation of long-text language models.
RULER Visit Over Time
Monthly Visits
19075321
Bounce Rate
45.07%
Page per Visit
5.5
Visit Duration
00:05:32