RULER

A benchmark for evaluating the rationality of long-text language models.

CommonProductProductivityLong-textLanguage model
RULER is a new synthetic benchmark that provides a more comprehensive evaluation of long-text language models. It extends standard retrieval tests to cover different types and quantities of information points. Additionally, RULER introduces new task categories, such as multi-hop tracking and aggregation, to test behaviors beyond retrieving from context. 10 long-text language models were evaluated on RULER and achieved performance on 13 representative tasks. Despite achieving near-perfect accuracy on standard retrieval tests, these models performed poorly as context length increased. Only four models (GPT-4, Command-R, Yi-34B, and Mixtral) performed reasonably well at a length of 32K. We make RULER publicly available to promote comprehensive evaluation of long-text language models.
Visit

RULER Visit Over Time

Monthly Visits

19075321

Bounce Rate

45.07%

Page per Visit

5.5

Visit Duration

00:05:32

RULER Visit Trend

RULER Visit Geography

RULER Traffic Sources

RULER Alternatives