Your submission is now in Draft mode.

Once it's ready, please submit your draft for review by our team of Community Moderators. Thank you!

Submit Essay

Once you submit your essay, you can no longer edit it.


This content now needs to be approved by community moderators.


This essay was submitted and is waiting for review.

ML Makes Near-Human Level Forecasts


Recently, Zou et al., 2022 introduced Autocast:

we introduce Autocast, a new dataset for measuring ML models’ forecasting ability. Autocast includes thousands of forecasting questions collected from human forecasting tournaments. The questions vary in the forecasting horizon from days to decades, in the topic (including politics, economics and science), and in the answer format (e.g. multiple-choice vs. predicting a number). The questions are pre-selected for public interest, and there is a strong human baseline (the crowd aggregate of many competitive forecasters).

Using this dataset, they test a few language models' ability to make forecasts. Models are provided access to contemporaneous (and historical) news sources to precisely simulate the conditions under which humans made past forecasts (avoiding leakage from the future).

They find that language models perform far below the human expert baseline, yet that performance improves with model size and the incorporation of relevant information. Human performance was found to be 82.5, while the best language model, FiD Static with 2.8B parameters, achieved a score of 40.6.

When will a Machine Learning model achieve 90% of human-crowd performance on Autocast?

This question will resolve on the earliest date when a credible report indicates that a Machine Learning model achieves at a Score of at least 74.25, under a similar setting as Zou et al., 2022 that eables rigorous test retrodiction by ensuring, with at least as much effort as by Zou et al., that models are exposed only to news from before the outcomes being forecast.

The score metric is a performance metric between 0 and 100 defined in their section 4.2. Specifically, it is some combination of percent accuracy on true/false questions, multiple-choice questions, and the l1 distance for numerical questions.

If no publications come out that use the dataset for a period of 3 consecutive years, this question resolves ambiguously.

In case Autocast is superseded by a later dataset that is similar, this question resolves positively if a model performs at 90% of crowd-level performance on that dataset.

Make a Prediction


Note: this question resolved before its original close time. All of your predictions came after the resolution, so you did not gain (or lose) any points for it.

Note: this question resolved before its original close time. You earned points up until the question resolution, but not afterwards.

Current points depend on your prediction, the community's prediction, and the result. Your total earned points are averaged over the lifetime of the question, so predict early to get as many points as possible! See the FAQ.