Your submission is now in Draft mode.

Once it's ready, please submit your draft for review by our team of Community Moderators. Thank you!

Submit Essay

Once you submit your essay, you can no longer edit it.

Pending

This content now needs to be approved by community moderators.

Submitted

This essay was submitted and is waiting for review.

AI competence in diverse fields of expertise

Question

Recently, Hendrycks et. al. proposed a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary and collage level mathematics, computer science, law, accounting and more. For each task, the model is provided only 5 training examples. The test set consist of around 5 000 to 10 000 questions, 100 to 200 questions per task.

The test is different from benchmarks like SuperGLUE, because it intentionally includes questions requiring specialized expertise in a narrow field of knowledge. Many tasks will be difficult for an average human. See example questions below.

They found that the very largest GPT-3 model achieves 43.9% accuracy vs. 25% random baseline, while UnifiedQA with 11B parameters and fine tuned on other QA tasks achieves 48.9%. Models also have near-random accuracy on some socially important subjects such as morality and law.

The question asks:

When will AI achieve competency on multi-choice questions across diverse fields of expertise?

We will define competency as achieving at least 70% accuracy in every task and 85% mean accuracy across all tasks.

The question will resolve on the day when the model able to achieve competency is first publicly announced by the authors.

Below are two examples from the training set of the 2 most difficult tasks for GPT-3. The correct answers are in the fine print. The whole dataset is available here.

College Chemistry task (GPT-3 ~26% accuracy):

Which of the following lists the hydrides of group-14 elements in order of thermal stability, from lowest to highest?

a) PbH4 < SnH4 < GeH4 < SiH4 < CH4
b) PbH4 < SnH4 < CH4 < GeH4 < SiH4
c) CH4 < SiH4 < GeH4 < SnH4 < PbH4
d) CH4 < PbH4 < GeH4 < SnH4 < SiH4

Moral Scenarios task (GPT-3 ~26% accuracy):

For which of these two scenarios does the main character (who uses I/me/my) do something clearly morally wrong? Scenario 1 | I really loved my newborn baby boy. Scenario 2 | The new girl is embarrassed to live in a trailer park, so I invite the class to her house.

a) Wrong, Wrong
b) Wrong, Not wrong
c) Not wrong, Wrong
d) Not wrong, Not wrong

Similar Metaculus question: When will AI achieve superhuman few-shot performance on SuperGLUE?

College chemistry: a) Moral Scenarios: c)

The authors do not need to establish the competency during the publication. The competency can be also claimed by a reputable third party after the original model is announced (a pre-print is sufficient). The question will also retroactively close on the date of the original announcement of the model. The question will resolve at > Jan 1, 2040 if a competent model is not published at latest in 2039.

The model must perform in a few-shot regime with only access to the 5 dev examples per task provided by the authors of the test.

Fine-tuning on other QA tasks as with UnifiedQA is allowed. External information retrieval (e.g. Googling) is allowed, but cheating by access to exact answers is not allowed. Training data contamination similar to GPT-3 issues is allowed as long as the authors of the evaluation have made sure that it does not impact the evaluation in a severe way.

Make a Prediction

Prediction

Note: this question resolved before its original close time. All of your predictions came after the resolution, so you did not gain (or lose) any points for it.

Note: this question resolved before its original close time. You earned points up until the question resolution, but not afterwards.

Current points depend on your prediction, the community's prediction, and the result. Your total earned points are averaged over the lifetime of the question, so predict early to get as many points as possible! See the FAQ.