This essay is now a Draft

The list of questions relevant to this contest is here. Once you submit your essay, it will be available for judges to review and will no longer be able to edit it. Please make sure to review the eligibility criteria before submitting. Thank you!


This content now needs to be approved by community moderators.


This essay was submitted and is waiting for review.


Are We Surprised by AI Milestones?

by AABoyles {{qctrl.question.publish_time | dateStr}} Edited on {{qctrl.question.edited_time | dateStr}} {{"estimatedReadingTime" | translate:({minutes: qctrl.question.estimateReadingTime()})}}
  • Facebook
  • Twitter

  • This essay was submitted to the AI Progress Essay Contest, an initiative that focused on the timing and impact of transformative artificial intelligence. You can read the results of the contest and the winning essays here.

    One way we might aspire to forecast the arrival of Transformative AI is by tracking AI milestones (though the possibility of sufficiently hard takeoff scenarios means we shouldn't think of this an AGI fire alarm). And progress seems to be fast and probably accelerating. For example, Miles Brundage (of OpenAI) tweeted this:

    This is a rather cute way to frame recent progress in AI systems. Interestingly, Metaculus has been keeping predictions, in one way or another, on almost all of these developments. This gives us a unique window into forecasting the broader question. Was the community surprised by the progress of AI in a given domain? If the answer is generally "yes", it seems reasonable that we should update in the direction of shorter timelines. So, were we surprised?

    Protein Folding

    The easiest way to discern by improvement in Protein Folding prediction is the Global Distance Test, or GDT. Metaculus' most prominent predictions in this space set the threshold for a "breakthrough" at a GDT score of 90. And it looks like it's almost certain to happen by 2031:

    The driving force behind this near-certainty is Alphabet's Alphafold. At CASP, the biennial competition for algorithms to get judged on a battery of Protein-folding-related metrics, AlphaFold is the 2-time reigning champion:

    As we can see from the above graph, Alphafold 2 got extremely close to 90 GDT in CASP14. (The final score was 87.) Given the speed of Alphafold's progress, there simply isn't that much space left in the metric in terms of absolute or relative improvement. Accordingly, any improvement must be, in a sense, less impressive than the improvement of AlphaFold over previous approaches. Perhaps we should submit a question about whether there will even be a CASP16.

    Sidenote: The next CASP is 15, to be held in August 2022. Will Alphabet win again?

    Metaculus community says "Yeah, probably."

    So, by the time we asked the question we couldn't be surprised by the progress in Protein folding.

    Text-based Image Generation

    This one's a little tough for prediction. While we can point to extremely compelling examples (undoubtedly Brundage's basis for the claim), it's difficult to operationalize progress in Text-based Image Generation in the same way we can point to highly precise error metrics (as in Protein Folding). How are we to judge whether this happens? OpenAI solved this problem by training a new model and just asking people which model worked better. This is fine (if somewhat less satisfying) for an investment by OpenAI, but not a great proxy for Metaculus. A probably better approach is to examine the markets that would be affected by this technology. Metaculus has been tracking Autonomous Vehicles in this way for quite a while. And, as luck would have it, we have a fine example of this approach for Image Generation:

    There's little reason to claim that this isn't possible, either technically or stylistically. The question is merely whether Shonen Jump specifically will leverage the technology, and credit it with authorship. So it's interesting and related, but not a perfect proxy for validating Brundage's implied claim. If we accept this proxy with its limitations, it's notable that the predicted date has been rapidly approaching as the community updates (and the question was notably launched after the publication of the first iteration of DALL-E).

    Sidenote: I'm using "predicted date got closer to present" as a proxy for "surprise." This is an extremely simplistic way of approaching a complex and hairy set of questions, and easy to envision as fundamentally incorrect. I don't believe that this invalidates any conclusions by default, but it's a caveat worth noting.

    This suggests to me that we have been progressively surprised, AI progress is unexpectedly quick, and we should update in favor of a shorter timeline.


    There are a variety of ways to judge performance on complex tasks like games. We can devise complex metrics. We can review latencies and confidences and yada yada yada. Or we can just enter the AI in a competition with humans.

    AlphaStar scored an impressive defeat of a professional human on Jan 24, 2019. At the moment it did, the community estimated the victory was still 6 months away. And before the competition that resulted in the resolution began, the community predicted the date at 18 months away. In this case, I feel quite safe claiming with some confidence that the community was surprised. That said, we can't call StarCraft "solved" just yet...

    Just as AGI and nuclear fusion for commercial power generation are perpetually 30 years away, AI coming to dominate Starcraft seems to be perpetually 7 months away. From our vantage point, this is curious. If it continues in this pattern, we will be surprised when it happens, but we wouldn't have been 7 months earlier. This makes me feel wary about using this specific question as evidence for updating in either direction. Never-the-less, that we were clearly surprised about progress in Starcraft on the earlier question pushes me to put another tick mark in the "early timeline" column.

    3+ Player Poker

    This is one place where the Metaculus community hasn't been looking recently. In 2016 we forecasted this, but it was a bit early and the time horizon a bit short.

    The question resolved "no" on schedule, two years before the resolving breakthrough occurred. Sadly we didn't re-run the question in time to capture the milestone. But I'd note that the community closed this one near the middle of the probability range, implying we were genuinely uncertain about whether it would happen, and thus wouldn't really be surprised no matter the outcome.

    Winograd Schemas

    We were watching progress on the Winograd Schemas. We didn't expect outrageous progress.

    But then the competition was cancelled. We refocused on the following year's Winograd Schema Competition. And we did expect significant progress.

    But then the competition was cancelled. Now the Winograd Challenge appears to have been permanently shut down. So this one's a little unfair for the AIs to claim victory. Or it would be if not for the GLUE battery of benchmarks, which includes a Winograd test. And the latest entries have absolutely shown significant progress. So it appears unlikely we would have been surprised at the speed of progress in the counterfactual world where there were 2019 and 2020 Winograd Competitions.

    So what?

    We can validate with Metaculus data that these benchmarks have been met (or variations of them are likely to be met soon), but that doesn't tell us much more than the tweet does. So, what can we learn from Metaculus, beyond the fact that Miles Brundage knows a few things about recent AI progress?

    Perhaps the punchline is that we are indeed running short of AI benchmarks. But I think that we haven't looked in the right place to draw that specific conclusion. After all, we haven't said much about benchmarks we haven't hit yet. And it turns out, Metaculus is keeping track of quite a few! To frame these, here's an alternative to Brundage's Tweet:

    Isn't it the future yet? I was promised AIs that grokked Montezuma's Revenge, could do physical stuff like a human, scored well on Academic Tests, and programmed for me!

    Montezuma's Revenge

    Among the pantheon of otherwise simple Atari games, Montezuma's Revenge will hold a special place in history for requiring whole new AI architectures to make sense of the game. Unlike others, where the AI can be trained to maximize a score (which it can quickly learn to do by manipulating the game controls), Montezuma's Revenge requires a complex and precise sequence of actions to achieve the very first increase in score.

    First, let's note that the community's prediction has elapsed. In a sense, we should be surprised it hasn't happened yet. The current state of the art, as far as I can tell, falls slightly short of one of this question's resolution criterion (it doesn't quite search all the rooms) and well short of the other (in achieves its performance in 1.2B frames, 24 times the 50M prescribed by this question).

    We have a little more runway on this prediction. There are two-and-a-half years between the time currently predicted and now, which is a certainly a window in which surprises can comfortably occur. However, the fact that the preceding prediction on Montezuma's Revenge hasn't resolved (now coincidentally two-and-a-half years beyond the predicted date) makes me suspect that a surprise from progress on Montezuma's Revenge is presently unlikely.

    Physical Action

    One way in which AI seems to lag far behind human capabilities is robotics, in spite of the best efforts of Boston Dynamics and (of course) Elon Musk. But, as with anything, the proof is in the delivery. So, what can robots do?

    Well, it'll be a surprise if the start assembling lego sets in the next few years!

    And it'll be a big surprise if they start winning soccer matches!

    Academic Tests

    Here we find an comparatively crisply defined outcome: is your AI smarter than a Fifthe-Grader?

    If this resolves positively, we should call it a modest surprise. That said, given the short time-window remaining, this is objectively unlikely to resolve positively. What about a longer window?

    Here we see that with another 3 years before resolution deadline, we have much stronger expectations about performance. We'd be slightly surprised if it failed to happen.

    However, taken together, these two questions represent an extremely interesting and radical prediction: state-of-the-art in AI will improve academically by roughly 7 grade levels in just 3 years. This by itself seems like a warning shot of stronger AIs sooner, but I'll just leave that claim there--right or wrong, there are many caveats to explore and unpack in making a bailey claim like this, and its motte deserves its own separate handling.


    As a programmer myself, it would be nice to believe that my career will be relatively late in the list of jobs to be automated away. Unfortunately, we predicted AI would learn to code pretty soon.

    Doubly unfortunate, it happened sooner. Another case where we were unequivocally surprised. However, perhaps merely writing code isn't sufficiently impressive. After all, Codex is extremely impressive, but can rarely produce code that both works and generates the desired effect for non-trivial cases.

    The community estimate seems quite pessimistic to me on this one (and I've forecasted accordingly!). But there's still a little room for me to be surprised, and a lot of room for the community to be. Until that happens, we can still say AI learning programming is happening surprisingly quickly.

    The Final Tally

    Basically all the milestones Brundage calls out as formerly "very difficult challenges for AI" have seen huge improvements recently. But they all leave at least some room for improvement. And we're tracking plenty of problems that AI hasn't cracked yet, though it's only a matter of time for any of them (even forecasting!).

    More importantly, it is possible to surprise us with AI progress. Here's the final tally:

    Field | Surprise?

    Protein Folding | No

    Image Generation | Kinda

    Starcraft | Yes

    Poker | No

    Winograd Schemas | No

    Montezuma | Don't Know Yet, but probably not

    Physical Activity | Don't Know Yet

    Academic Tests | Don't Know Yet

    Programming | Yes

    We should fight the urge to count up the "Yes"es and say "we're only surprised 1/3 of the time." These aren't all perfectly- (or even especially well-) framed by the forecasting questions we've devised for them. And we didn't decide on an advance threshold for "surprise", pre-register the definitions and approach, etc. etc. However, I claim the salient fact is that we have been surprised at how soon advances have occurred at all. We can see that it's possible to stealthily create and deploy an AI to develop a skill to peak human levels. It's possible to surprise us with an AI sophisticated enough to actually write functioning code sometimes. If we can be caught off-guard by superhuman narrow applications, it isn't a huge inductive leap to claim that it will be possible to catch us off-guard with general and transformative AI technology.

    There's also good reason to be suspicious of this conclusion! A well-calibrated forecaster will be surprised by overestimating about as often as by underestimating. While we weren't explicitly keeping tally of the directionally-inconvenient surprises, we noted that it did happen (Montezuma's Revenge). A proper contrarian would turn the conclusive back on use, saying "See! We can be caught off-guard by the slowness of technological advancement, and should thus be suspicious of transformative forecasts." The contrarian is partly right: As in all forecasts, we must be suspicious. But we must prepare for discontinuous transformations before they occur, and if we are uncertain about how long it will take for them to occur, the optimal time to do so is probably now.

    Artificial Intelligence
    Submit Essay

    Once you submit your essay, you can no longer edit it.