M

This essay is now a Draft

The list of questions relevant to this contest is here. Once you submit your essay, it will be available for judges to review and will no longer be able to edit it. Please make sure to review the eligibility criteria before submitting. Thank you!

Pending

This content now needs to be approved by community moderators.

Submitted

This essay was submitted and is waiting for review.

{{qctrl.question.primary_project.name}}

The next five years

by marius.hobbhahn {{qctrl.question.publish_time | dateStr}} Edited on {{qctrl.question.edited_time | dateStr}} {{"estimatedReadingTime" | translate:({minutes: qctrl.question.estimateReadingTime()})}}
  • Facebook
  • Twitter

  • This essay was submitted to the AI Progress Essay Contest, an initiative that focused on the timing and impact of transformative artificial intelligence. You can read the results of the contest and the winning essays here.


    The ultimate goal of forecasting AI progress is to get a better understanding of the future of AI--potentially multiple decades ahead of time. In this essay, I want to do something slightly different and have a look at near-term predictions for the next five years. 


    This provides value in two ways. Firstly, we get a better understanding of “how crazy” the near-term future will be according to the Metaculus community. It’s easy to get swallowed by the hype on Twitter and I want to see whether forecasters agree or not. Secondly, it provides a way to test the calibration of the community in the near term. In 5 years we could come back to this essay and see how the community performed. 


    To this end, I investigate two questions

    1. What trends are implied by the combination of different questions, e.g. by the number of arxiv papers between 2021 and 2023 vs. 2021 and 2026? 
    2. What trends are implied by the changes in the predictions so far, e.g. did the community change their predictions in a way that would imply faster/slower timelines? 

    All code can be found in this collab notebook.


    Executive summary: I think all the different angles we look at share the same conclusion: slightly faster timelines than anticipated. The number of papers written in each category is expected to increase, the compute predictions are slightly higher than historical trends, the recent sota questions resolved less conservative than anticipated by the community and the slope between first and last predictions is slightly positive implying updates towards faster timelines. However, we shouldn’t overinterpret these results as they come from a low number of data points and might have some structural biases which I discuss in the respective sections. 


    Investigating selected questions

    In this section, I choose multiple questions on a topic and see what their relationships imply. A list of all used questions can be found in Appendix A. 


    Arxiv

    There are many questions along the lines of “How many arxiv papers will be published in category X between date Y and Z?”. I looked at different prediction horizons for the categories of “RL”, “AI safety” and “few shot” and compared the median estimate they provide. The dotted estimates have already been resolved.

    Furthermore, I translated all estimates to papers/day for better comparison. 

    As is expected, the community predicts that there will be much fewer papers on AI safety than on RL or few-shot learning. What I found interesting, however, is that there is no expected increase (even a slight decrease!) in AI safety papers/day until 2023 (when comparing lines indexed 1 and 2). Additionally, the growth in AI safety papers until 2026 is much bigger than that of RL and few-shot when compared to their contemporary counterparts in 2021 and 2022. 


    Compute

    Let’s take a look at the estimates for the maximal amount of compute used in an AI experiment for different dates. I added the predictions of Sevilla et al. (disclaimer: I’m a co-author) for large-scale and regular-scale models (shown in the screenshot below). 

    The community median estimates seem to be broadly above the naive extrapolation of the regular and large-scale trends. Since the trends are averages of models it is plausible that the estimates for the maximum are higher. The slopes are comparable, so this looks broadly reasonable to me. 


    State-of-the-art

    Lastly, I take a look at the state-of-the-art (sota) performances for imagenet and wikitext. I use the median values from the respective Metaculus predictions and additionally add some of the previous sota datapoints from papers with code (imagenet, wikitext). Keep in mind that the Metaculus questions ask specifically about the category that doesn’t use additional training data.

    The clearest observation in both cases is that the most recent prediction was too bearish and had already been beaten quite drastically before the question resolved. My second observation is that the 2026 predictions don’t seem that bullish to me, i.e they are less steep than a linear extrapolation from previous years would imply. I have three hypotheses for why that might be the case: a) there are diminishing gains, i.e. it’s just much harder to get from 95% to 100% accuracy than from 65% to 70%. b) The questions specifically require the “no additional training data” category. I expect that most future models will use additional training data and that this is just the category that most ML people care less about. Thus, most of the gains could be made in the other category. c) Metaculus might just be too conservative. They have been too conservative with the near-term estimates for 2021--why not again for 2026?


    Implied trends

    The above trends can be interpreted as implying shorter timelines but they are just a small selection. Therefore, I want to take a look at 21 predictions categorized into “compute”, “economic”, “arxiv” and “sota” (see appendix B for details) and see whether their trends have implications on timelines. 


    To this end, I compare the median of the first estimate with the median of the last estimate of all forecasts. I assign “faster”, “slower” and “neutral” to all questions to indicate whether a positive slope would indicate faster/slower AI progress, e.g. more compute implies faster timelines. I flip all slower predictions to fit them into one plot, e.g. v becomes 1-v to keep the same units. 

    Secondly, I check whether the uncertainty of the community shrinks or increases. For this, I compute the difference of the 3rd and 1st quartile for the first and last estimate and plot it as well. The shaded area shows one standard deviation. 

    The slope between the first and last prediction is slightly positive implying that, in aggregate, the community has updated slightly towards faster AI progress. The uncertainty estimate for the first and last data point is also slightly increasing (I would have expected the opposite). This would indicate that people get more uncertain with more time even though they get closer to the resolution date. I think this could be true but might also just be an artifact of the setup. Usually, the very first data points have fewer forecasts which often comes with smaller uncertainty bands. 


    And now the same but split by category.

    It looks like the trends from the aggregate estimate broadly continue. The difference between the first and last estimate implies small updates for faster timelines but the uncertainty estimates differ between categories. It looks like the community is more confident in their predictions on economic questions and the number of papers compared to the other two categories. However, this might just be an effect of a smaller sample size.


    Conclusions

    I think all the different angles we looked at come with the same message: slightly faster timelines. The number of papers written in each category is expected to increase, the compute estimates are slightly higher than historical trends, the recent sota questions resolved less conservative than anticipated by the community and the slope between first and last predictions is slightly positive implying updates towards faster timelines. 


    However, it should be noted that my estimates carry some uncertainty. I use a very small number of forecasts and very coarse methods, e.g. I just draw a line between two points which removes all the nuance of what happens in between. Furthermore, some effects might just be artifacts from Metaculus such as the uncertainty estimates for the first and last points. 


    Appendix A - selected questions

    Arxiv


    Compute


    Sota


    Appendix B - all predictions

    Here is a full list of IDs used for the section “implied trends”

    arxiv

    List of IDs: 6300, 5962, 5899, 6529

    compute

    List of IDs: 5951, 5949, 5960, 6559, 6517

    economic

    List of IDs: 5955, 6236

    sota

    List of IDs: 5946, 6228, 5948, 5947, 6571, 6252, 6566, 6570, 8026

    Categories:
    Artificial Intelligence
    Submit Essay

    Once you submit your essay, you can no longer edit it.