This essay was submitted to the AI Progress Essay Contest, an initiative focused on the timing and impact of transformative artificial intelligence. The contest is now closed. Finalists have not yet been selected. Learn more about the contest and read the other submitted essays here.
Many people claim that forecasting AI timelines holds a lot of promise or at least sheds some light on the dark. To understand how well the tool of forecasting works for AI timelines, I evaluated forecasts of the recent past (predictions started in December 2020, closed in January 2021 and resolved between June and December 2021).
I chose all questions from the Metaculus Forecasting AI Progress that have already been resolved and have more than 5 upvotes. This cutoff is ultimately arbitrary but I felt like many questions below it were confusing or already answered by other predictions. In total this leaves 22 questions that I put into four categories: compute, arxiv, state-of-the-art (sota) and economic (see 2 examples per category in Appendix A).
I try to evaluate the community predictions along three metrics
- Accuracy: Does the predicted distribution roughly resemble the resolutions?
- Calibration: Do the quantiles of the predicted distributions roughly match the resolutions?
- Implied trends: What do the changes between first prediction and closing prediction and closing prediction vs. resolution imply about AI timelines?
In all cases, I look at aggregate results and additionally split results by category.
Words of caution: 22 predictions over the span of one year with prediction times of ~1-2 months is not a lot. The conclusions could just be random artifacts and thus have to be treated with high uncertainty. Furthermore, I’m new to the Metaculus API and could have made rookie mistakes in my implementation. You can find my code in this google collab notebook.
Short summary: From looking at multiple different plots and statistical estimates, I would say the accuracy and calibration of the Metaculus community are decent to good. There are some individual predictions that are very off, but the aggregates are decent. I find evidence that the community slightly extended its timelines over time, as the closing date neared. In aggregate, the closing predictions have been relatively accurate, thus, the difference between closing predictions and resolutions doesn’t imply updates for timelines.
To evaluate the accuracy of the predictions, I overlay all selected predictions over time with their respective 25%- and 75%-quantiles (blue). Furthermore, I add the resolutions of all questions (red). Of course, this is not a statistical test. It is meant to give us an intuitive understanding of the predictions and whether they are completely off. The vertical dotted grey bars represent the closing dates of the predictions and the resolution dates have been moved to the closing dates for better readability.
The resolutions broadly cluster around the main bulk of the predictions indicating reasonable accuracy.
Now, we want to take a look at the final prediction distributions and resolutions. For this, I added all closing probability density functions to one aggregate distribution.
Once again, the resolutions seem like reasonable samples from the overall aggregate distribution implying decent accuracy.
To check if the community is more accurate for some categories than others, we redo the above procedure for all four cases and add the individual predictions.
I would interpret the results of arxiv and economic predictions as pretty plausible and those of compute and sota slightly unbalanced. We can also see that some individual predictions, e.g. the image classification index (sota) or the GPU price (compute) resolve far outside the predicted distribution indicating bad accuracy.
However, given the small number of samples, I still think the predictions are overall plausible and we cannot say, with high confidence, that the community has blindspots for specific categories.
To test how well-calibrated the uncertainty of the community predictions is, I checked in which quantile the final resolution falls. Split by category we get the following distribution.
Perfect calibration would be indicated by a uniform distribution. The figure doesn’t depict a uniform distribution. However, we want to know whether the observed distribution is plausible under the assumption that the community is well-calibrated. To approximate this, I use three statistical tests. The first is a chi-squared test which is the appropriate frequentist test for this scenario. The second and third are Bayes factors with two different assumptions.
The Bayes factor has been computed by choosing a multinomial distribution with a uniform distribution as the probability vector for M1 and a multinomial distribution with samples from a Dirichlet distribution as the probability vector for M2. I drew 10000 samples from the Dirichlet and averaged the probabilities of the observed values given the samples to get p(M2 | data).
I used two different values for the alpha vector of the Dirichlet distribution. The first (a=[22/4, 22/4, 22/4, 22/4]) can be interpreted as drawing from a narrow (i.e. values very similar) uniform distribution and the second (a=[1,1,1,1]) can be interpreted as drawing from a wide (i.e. values can be very different) uniform distribution. Given that we have observed 22 results in 4 categories, I personally find the narrow interpretation more plausible.
In the frequentist test and the Bayesian test with narrow assumptions, we get non-significant results. This would imply that the observed calibration is broadly reasonable for the aggregate estimate. The Bayes factor with a wide distribution implies a substantial difference, i.e. that the calibration is not good. As described above, I find the wide assumption less plausible and would interpret this result as evidence for decent calibration.
Some of the individual categories seem slightly unbalanced, e.g. the community seems to underestimate the compute predictions. However, since we have so few data points, we should not jump to conclusions prematurely.
We not only want to understand how accurate and well-calibrated the Metaculus community is, but we also want to understand if we can learn something for the future of AI timelines.
Therefore, I evaluate the trend that is implied by the difference between the first and closing prediction and the closing prediction and the resolution. These two estimates can be interpreted as “how much did the community change their mind until predictions closed” and “does the resolution imply faster/slower timelines than the closing prediction”.
I operationalized this by going through every prediction and denoting it as “down”, “neutral” and “up” which indicates whether faster timelines are indicated by an upward or downward trend, e.g. the price of GPUs dropping faster than expected would imply faster timelines. Neutral items are ignored and downward items are flipped, i.e. all values v became 1-v such that we don’t have to split it up into different cases.
The aggregate effects can be seen in the following figure. The thick lines denote the mean, the shaded area indicates one standard deviation and the thin lines are the individual trends.
We see that, on average, there was a minimal decreasing trend between the first and last prediction and basically no change in trends between the closing prediction and the resolution.
As in the previous sections, I split up the analysis by categories
It seems like there is no large difference between last prediction and resolution for any category and that the original arxiv and sota predictions implied faster timelines than during closing, i.e. the community toned down their timeline predictions a bit over the course of the predictions.
Overall, I would interpret these results as implying no major updates for timelines, i.e. the community didn’t seem to strongly change its mind on timelines during the prediction period.
Looking at the past can sometimes tell us a bit about the future. In this piece, I evaluated the most recently resolved Metaculus community predictions for AI timelines. I evaluated them by looking at their accuracy, their calibration and their implied trends. I find that the accuracy and calibration of the community are good with some outliers for specific questions. The implied trends for timelines are not large but the first predictions implied faster timelines than the closing predictions, i.e. the community corrected to slower timelines. There is no clear trend between closing predictions and resolutions. This indicates that, in aggregate, the community's timelines are broadly correct--at least for very immediate predictions.
I want to caution against overinterpreting these results. My sample size is very small and the results might be disproportionately influenced by a small number of outliers. Furthermore, we only look at very short timeframes and therefore shouldn’t draw strong conclusions for much longer timelines.
In the following you can find two example predictions per category and a full list of question IDs.
Full list of IDs: 5871, 5872, 6148, 5891, 6194, 5911, 6232
Full list of IDs: 5894, 5901, 5900
Full list of IDs: 5935, 5936, 6149, 5939
Full list of IDs: 5902, 5898, 5903, 5897, 6230, 5896, 5941, 5904