From the abstract of Pang et al.:
To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Current models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).
Tracking and predicting progress on this dataset seems especially relevant to Transformative AI timelines and strategy given:
- Those with short Transformative AI timelines or who expect Transformative AI to look similar to today’s AI generally expect language model capabilities to continue to scale quickly.
- Language models have surpassed human performance on many benchmarks, e.g. SuperGLUE, that require only reasoning over relatively short texts.
- Reasoning requiring understanding of long texts is a strong candidate for a capability that won’t improve much as language models get scaled up and tweaked.
The QuALITY dataset is chosen here because it showcases a task that does not require domain expertise and has a nicely objective evaluation metric, where the gap between language models and humans is still huge. Sam Bowman, whose group also created SuperGLUE, tweets: “Yes, the acronym is a little tone-deaf, but this is almost certainly the best benchmark or dataset release from my group so far.”
The hard subset of the dataset for prediction was chosen because, as described in the paper, these are the questions for which annotators do not do well under strict time constraints. This is intended as a proxy for requiring understanding of the full text to answer correctly. The SAT-style score is chosen to penalize wrong answers when the model is very uncertain; see here for details. As shown on the leaderboard, human performance is 85.4, while the best baseline model scores 28.1.
What will be the best non-human SAT-style score on the hard subset of the QuALITY dataset by January 1, 2025?
Resolution will be according to the leaderboard maintained by the dataset authors. In the case of the leaderboard no longer being maintained, the question will be resolved based on evidence presented in credible reports, as judged by Metaculus moderators.