M

Your submission is now a Draft.

Once it's ready, please submit your draft for review by our team of Community Moderators. Thank you!

You have been invited to co-author this question.

When it is ready, the author will submit it for review by Community Moderators. Thanks for helping!

Pending

This question now needs to be reviewed by Community Moderators.

We have high standards for question quality. We also favor questions on our core topic areas or that we otherwise judge valuable. We may not publish questions that are not a good fit.

If your question has not received attention within a week, or is otherwise pressing, you may request review by tagging @moderators in a comment.

You have been invited to co-author this question.

It now needs to be approved by Community Moderators. Thanks for helping!

{{qctrl.question.title}}

{{qctrl.question.predictionCount() | abbrNumber}} predictions
{{"myPredictionLabel" | translate}}:  
{{ qctrl.question.resolutionString() }}
{{qctrl.question.predictionCount() | abbrNumber}} predictions
My score: {{qctrl.question.player_log_score | logScorePrecision}}
Created by: elifland and
co-authors , {{coauthor.username}}
AI Technical Benchmarks

Make a Prediction

Prediction

From the abstract of Pang et al.:

To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Current models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).

Tracking and predicting progress on this dataset seems especially relevant to Transformative AI timelines and strategy given:

  • Those with short Transformative AI timelines or who expect Transformative AI to look similar to today’s AI generally expect language model capabilities to continue to scale quickly.
  • Language models have surpassed human performance on many benchmarks, e.g. SuperGLUE, that require only reasoning over relatively short texts.
  • Reasoning requiring understanding of long texts is a strong candidate for a capability that won’t improve much as language models get scaled up and tweaked.

The QuALITY dataset is chosen here because it showcases a task that does not require domain expertise and has a nicely objective evaluation metric, where the gap between language models and humans is still huge. Sam Bowman, whose group also created SuperGLUE, tweets: “Yes, the acronym is a little tone-deaf, but this is almost certainly the best benchmark or dataset release from my group so far.”

The hard subset of the dataset for prediction was chosen because, as described in the paper, these are the questions for which annotators do not do well under strict time constraints. This is intended as a proxy for requiring understanding of the full text to answer correctly. The SAT-style score is chosen to penalize wrong answers when the model is very uncertain; see here for details. As shown on the leaderboard, human performance is 85.4, while the best baseline model scores 28.1.