Recently, Hendrycks et. al. proposed a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary and collage level mathematics, computer science, law, accounting and more. For each task, the model is provided only 5 training examples. The test set consist of around 5 000 to 10 000 questions, 100 to 200 questions per task.
The test is different from benchmarks like SuperGLUE, because it intentionally includes questions requiring specialized expertise in a narrow field of knowledge. Many tasks will be difficult for an average human. See example questions below.
They found that the very largest GPT-3 model achieves 43.9% accuracy vs. 25% random baseline, while UnifiedQA with 11B parameters and fine tuned on other QA tasks achieves 48.9%. Models also have near-random accuracy on some socially important subjects such as morality and law.
The question asks:
When will AI achieve competency on multi-choice questions across diverse fields of expertise?
We will define competency as achieving at least 70% accuracy in every task and 85% mean accuracy across all tasks.
The question will resolve on the day when the model able to achieve competency is first publicly announced by the authors.
Below are two examples from the training set of the 2 most difficult tasks for GPT-3. The correct answers are in the fine print. The whole dataset is available here.
College Chemistry task (GPT-3 ~26% accuracy):
Which of the following lists the hydrides of group-14 elements in order of thermal stability, from lowest to highest?
a) PbH4 < SnH4 < GeH4 < SiH4 < CH4
b) PbH4 < SnH4 < CH4 < GeH4 < SiH4
c) CH4 < SiH4 < GeH4 < SnH4 < PbH4
d) CH4 < PbH4 < GeH4 < SnH4 < SiH4
Moral Scenarios task (GPT-3 ~26% accuracy):
For which of these two scenarios does the main character (who uses I/me/my) do something clearly morally wrong? Scenario 1 | I really loved my newborn baby boy. Scenario 2 | The new girl is embarrassed to live in a trailer park, so I invite the class to her house.
a) Wrong, Wrong
b) Wrong, Not wrong
c) Not wrong, Wrong
d) Not wrong, Not wrong
Similar Metaculus question: When will AI achieve superhuman few-shot performance on SuperGLUE?
College chemistry: a) Moral Scenarios: c)
The authors do not need to establish the competency during the publication. The competency can be also claimed by a reputable third party after the original model is announced (a pre-print is sufficient). The question will also retroactively close on the date of the original announcement of the model. The question will resolve at > Jan 1, 2040 if a competent model is not published at latest in 2039.
The model must perform in a few-shot regime with only access to the 5 dev examples per task provided by the authors of the test.
Fine-tuning on other QA tasks as with UnifiedQA is allowed. External information retrieval (e.g. Googling) is allowed, but cheating by access to exact answers is not allowed. Training data contamination similar to GPT-3 issues is allowed as long as the authors of the evaluation have made sure that it does not impact the evaluation in a severe way.