This essay is now a Draft
The list of questions relevant to this contest is here. Once you submit your essay, it will be available for judges to review and will no longer be able to edit it. Please make sure to review the eligibility criteria before submitting. Thank you!
Pending
This content now needs to be approved by community moderators.
Submitted
This essay was submitted and is waiting for review.
Are We Surprised by AI Milestones?
This essay was submitted to the AI Progress Essay Contest, an initiative that focused on the timing and impact of transformative artificial intelligence. You can read the results of the contest and the winning essays here.
One way we might aspire to forecast the arrival of Transformative AI is by tracking AI milestones (though the possibility of sufficiently hard takeoff scenarios means we shouldn't think of this an AGI fire alarm). And progress seems to be fast and probably accelerating. For example, Miles Brundage (of OpenAI) tweeted this:
This is a rather cute way to frame recent progress in AI systems. Interestingly, Metaculus has been keeping predictions, in one way or another, on almost all of these developments. This gives us a unique window into forecasting the broader question. Was the community surprised by the progress of AI in a given domain? If the answer is generally "yes", it seems reasonable that we should update in the direction of shorter timelines. So, were we surprised?
Protein Folding
The easiest way to discern by improvement in Protein Folding prediction is the Global Distance Test, or GDT. Metaculus' most prominent predictions in this space set the threshold for a "breakthrough" at a GDT score of 90. And it looks like it's almost certain to happen by 2031:
As we can see from the above graph, Alphafold 2 got extremely close to 90 GDT in CASP14. (The final score was 87.) Given the speed of Alphafold's progress, there simply isn't that much space left in the metric in terms of absolute or relative improvement. Accordingly, any improvement must be, in a sense, less impressive than the improvement of AlphaFold over previous approaches. Perhaps we should submit a question about whether there will even be a CASP16.
Sidenote: The next CASP is 15, to be held in August 2022. Will Alphabet win again?
So, by the time we asked the question we couldn't be surprised by the progress in Protein folding.
Text-based Image Generation
This one's a little tough for prediction. While we can point to extremely compelling examples (undoubtedly Brundage's basis for the claim), it's difficult to operationalize progress in Text-based Image Generation in the same way we can point to highly precise error metrics (as in Protein Folding). How are we to judge whether this happens? OpenAI solved this problem by training a new model and just asking people which model worked better. This is fine (if somewhat less satisfying) for an investment by OpenAI, but not a great proxy for Metaculus. A probably better approach is to examine the markets that would be affected by this technology. Metaculus has been tracking Autonomous Vehicles in this way for quite a while. And, as luck would have it, we have a fine example of this approach for Image Generation:
Sidenote: I'm using "predicted date got closer to present" as a proxy for "surprise." This is an extremely simplistic way of approaching a complex and hairy set of questions, and easy to envision as fundamentally incorrect. I don't believe that this invalidates any conclusions by default, but it's a caveat worth noting.
This suggests to me that we have been progressively surprised, AI progress is unexpectedly quick, and we should update in favor of a shorter timeline.
Starcraft
There are a variety of ways to judge performance on complex tasks like games. We can devise complex metrics. We can review latencies and confidences and yada yada yada. Or we can just enter the AI in a competition with humans.
3+ Player Poker
This is one place where the Metaculus community hasn't been looking recently. In 2016 we forecasted this, but it was a bit early and the time horizon a bit short.
Winograd Schemas
We were watching progress on the Winograd Schemas. We didn't expect outrageous progress.
So what?
We can validate with Metaculus data that these benchmarks have been met (or variations of them are likely to be met soon), but that doesn't tell us much more than the tweet does. So, what can we learn from Metaculus, beyond the fact that Miles Brundage knows a few things about recent AI progress?
Perhaps the punchline is that we are indeed running short of AI benchmarks. But I think that we haven't looked in the right place to draw that specific conclusion. After all, we haven't said much about benchmarks we haven't hit yet. And it turns out, Metaculus is keeping track of quite a few! To frame these, here's an alternative to Brundage's Tweet:
Isn't it the future yet? I was promised AIs that grokked Montezuma's Revenge, could do physical stuff like a human, scored well on Academic Tests, and programmed for me!
Montezuma's Revenge
Among the pantheon of otherwise simple Atari games, Montezuma's Revenge will hold a special place in history for requiring whole new AI architectures to make sense of the game. Unlike others, where the AI can be trained to maximize a score (which it can quickly learn to do by manipulating the game controls), Montezuma's Revenge requires a complex and precise sequence of actions to achieve the very first increase in score.
Physical Action
One way in which AI seems to lag far behind human capabilities is robotics, in spite of the best efforts of Boston Dynamics and (of course) Elon Musk. But, as with anything, the proof is in the delivery. So, what can robots do?
Academic Tests
Here we find an comparatively crisply defined outcome: is your AI smarter than a Fifthe-Grader?
However, taken together, these two questions represent an extremely interesting and radical prediction: state-of-the-art in AI will improve academically by roughly 7 grade levels in just 3 years. This by itself seems like a warning shot of stronger AIs sooner, but I'll just leave that claim there--right or wrong, there are many caveats to explore and unpack in making a bailey claim like this, and its motte deserves its own separate handling.
Programming
As a programmer myself, it would be nice to believe that my career will be relatively late in the list of jobs to be automated away. Unfortunately, we predicted AI would learn to code pretty soon.
The Final Tally
Basically all the milestones Brundage calls out as formerly "very difficult challenges for AI" have seen huge improvements recently. But they all leave at least some room for improvement. And we're tracking plenty of problems that AI hasn't cracked yet, though it's only a matter of time for any of them (even forecasting!).
More importantly, it is possible to surprise us with AI progress. Here's the final tally:
Field | Surprise?
Protein Folding | No
Image Generation | Kinda
Starcraft | Yes
Poker | No
Winograd Schemas | No
Montezuma | Don't Know Yet, but probably not
Physical Activity | Don't Know Yet
Academic Tests | Don't Know Yet
Programming | Yes
We should fight the urge to count up the "Yes"es and say "we're only surprised 1/3 of the time." These aren't all perfectly- (or even especially well-) framed by the forecasting questions we've devised for them. And we didn't decide on an advance threshold for "surprise", pre-register the definitions and approach, etc. etc. However, I claim the salient fact is that we have been surprised at how soon advances have occurred at all. We can see that it's possible to stealthily create and deploy an AI to develop a skill to peak human levels. It's possible to surprise us with an AI sophisticated enough to actually write functioning code sometimes. If we can be caught off-guard by superhuman narrow applications, it isn't a huge inductive leap to claim that it will be possible to catch us off-guard with general and transformative AI technology.
There's also good reason to be suspicious of this conclusion! A well-calibrated forecaster will be surprised by overestimating about as often as by underestimating. While we weren't explicitly keeping tally of the directionally-inconvenient surprises, we noted that it did happen (Montezuma's Revenge). A proper contrarian would turn the conclusive back on use, saying "See! We can be caught off-guard by the slowness of technological advancement, and should thus be suspicious of transformative forecasts." The contrarian is partly right: As in all forecasts, we must be suspicious. But we must prepare for discontinuous transformations before they occur, and if we are uncertain about how long it will take for them to occur, the optimal time to do so is probably now.
Once you submit your essay, you can no longer edit it.