This essay is now a Draft

The list of questions relevant to this contest is here. Once you submit your essay, it will be available for judges to review and will no longer be able to edit it. Please make sure to review the eligibility criteria before submitting. Thank you!


This content now needs to be approved by community moderators.


This essay was submitted and is waiting for review.

AI safety and timelines

{{"estimatedReadingTime" | translate:({minutes: qctrl.question.estimateReadingTime()})}}
AI Progress Essay Contest

This essay was submitted to the AI Progress Essay Contest, an initiative that focused on the timing and impact of transformative artificial intelligence. You can read the results of the contest and the winning essays here.

This essay will explore the medium-term future of artificial intelligence, focusing on what forecasts in Metaculus imply about timelines on AGI and AI-safety research.

Some key underlying assumptions

While I tried to base this essay mostly on Metaculus' forecasts, many parts are influenced by my priors on the discussed topics. In a very simplified, unnuanced manner, the main assumptions are the following:

  • In principle, intelligence and terminal values are independent of each other.
    • And in practice, we don't have any reason to believe that AIs we make will be aligned by default with what we think of as 'human values'.
  • Intelligence is powerful.
    • Even if constrained in available actions and hardware, an AI of certain level of intelligence would be able to eventually overcome its constraints, and gain access to the wider world and its resources.
    • Additionally, digital AIs could reproduce extremely quickly, much faster than biological entities such as humans.
  • Conclusion from the two previous main items: A sufficiently intelligent system (or even a sufficiently large group of ~human-level systems) would be more capable than all of humanity at achieving its goals. A priori, there is no reason to assume that the goals of an AI system will be aligned with those of humanity. Thus, sufficiently powerful AI systems are an existential threat to humanity.
  • Conclusion from the previous item: If humanity wants to benefit from the development of very advanced AI, learning how to properly align its goals is extremely important.
    • Other 'control' strategies could potentially suffice, or delay dangerous capability gains, when used on systems with relatively low intelligence. But they are probably not safe in the long run.

AI safety and the alignment problem

AI safety, broadly defined, is an area of research dedicated to ensuring that AI deployment does not result in undesirable outcomes. The alignment problem in particular deals with how to construct AI systems whose goals and values are aligned with what we really want them to do. This is a very difficult problem, and it gets much harder with the intelligence of the system. A key insight into the severity of this issue is that, specially for non-evolved systems, intelligence and goals are orthogonal from each other. The space of possible minds is huge, we want to hit a vanishingly narrow and imprecise range in it, and we don't even know how to aim yet.

A naive answer to the alignment problem would be to try directly specifying a set of desirable objectives/conditions, rewarding the system when it meets the goal and punishing it otherwise. But objective specification is hard, even in simple environments. A reinforcement learner will naturally find solutions that optimize exactly on what it was asked to optimize. A similar phenomenon, but much more familiar and bounded, also occurs in humans: Goodhart's law.

Examples of specification gaming can be fun when the objective of high average speed is satisfied by evolving tall creatures that fall over, or when undesirable events are only avoided in the sense that the proxy sensory data is kept blind to the penalized outcomes. But these types of behavior become less humorous, and frighteningly worrisome, when considering similar failure modes on hypothetical systems that are much more capable, specially at the human level or higher. For a powerful enough AI, objective specification can be seen as being at least as hard as safely asking a wish to a completely amoral, alien genie.

So, does the whole alignment problem reduces to finding a good enough specification? Not really: see for example the topics of mesa-optimization and distributional shift (see also Section 7 of this paper). Furthermore, for a powerful enough AI system, a 'good enough' specification would basically need to include/extrapolate a description of something like 'all human values and morality'. That seems hard enough, but gets worse considering that human morals are based on our particular ontological and epistemic representations of reality, which may diverge wildly from those learned by an AI. Furthermore, human ethics are mostly based on finite intuitions, even though we probably live in an infinite universe of some type or another; that is quite a distributional shift, one that we ourselves don't know how to deal with. Also, the ethical intuitions of humans are not necessarily 'good', or coherent, or widely-shared, and locking ourselves into current humanity's ethical intuitions could also be potentially disastrous.

So far, the Metaculus community is not hopeful of the alignment problem, or other control methods, being solved before the first AGI.

The question above asks about the control problem, which includes alignment but also other approaches, such as trying to constrain the actions that the AI can take. At the moment of writing this, the community assigns only 10% to the chance that the control problem will be considered solved before the first AGI. (The question also resolves negatively if the control problem is solved after AGI but before superintelligent AI. However, the window of time for that possibility seems very short: see the section "AGI timelines", below.)


What are the risks of deploying an unaligned, generally human-level AI? If there is enough hardware overhang, even an initially slow AGI system could have huge impacts by its capability of being copied massively or being sped-up. Furthermore, costs, efficiency, and capabilities will probably improve quickly after the first AGI is publically presented.

Is having many human-level AIs a problem, even if they are unaligned? After all, human society already deals regularly with harmful human-level intelligences, namely criminal humans via law enforcement. But this analogy totally underestimates the otherness that a de novo AI will probably have. Even one based on artificial neural networks, and trained to imitate humans, will probably have less in common with a human than a human has with an arthropod, or with other creatures that are the product of biological evolution. At the very least, one can imagine entities that are as good as a human at reasoning and planning, which can reproduce extremely fast over digital substrates, and which have a completely alien set of values.

But the real worry is that, once we have artificial intelligence that is roughly at the level of humans, it won't be long before we have superhuman intelligence: systems that can surpass any individual human, and perhaps all humanity as a whole, in all intellectual tasks. Given the impressive cognitive accomplishments of some mere humans, it seems very unlikely that even a single, unaligned superintelligent AI won't eventually result in terrible results for humanity, even if considerable efforts are made to contain it.

This analysis from November 2021, harnessing Metaculus predictions, indicates that AI is considered the most likely cause for a hypothetical near/complete extinction this century, with a total probability for that scenario of .

In the approximately four and a half months since that analysis was made until April 10th 2022, the community's underlying risk forecast of global catastrophe this century has risen from 20% to 30%. On the other hand, the probability of an AI catastrophe, conditional on some global catastrophe, has moved from ~25% to ~20%, and the probability (conditional on an AI catastrophe) of extinction scenarios from AI has mostly fluctuated around 67%.

So, from these more recent community estimates, we can derive an expected chance of an AI-related near/total extinction this century, up from the previous or Metaculus' .

And, although 4% chance of extinction this century is certainly not small, other bad outcomes are possible if we create superintelligent AIs which do not share our values and that we can't control.

This question asks whether, conditional on the creation of a superintelligent AI, the resulting future will be considered a positive one. As of April 10th, the community assigns a 41% chance to "no". While the question resolves ambiguously if there is no such superintelligent AI before the year 2300, we will see in the next section that the arrival of such systems will probably happen much sooner.

AGI timelines

Progress in machine learning has been very impressive in the last years. Existing systems already seem to have the potential of having a large economic and scientific impact (text-to-image generation, protein folding prediction, code generation, robotics, et cetera). It is possible that we are not far away from transformative AI, or even a strong form of it: having AI systems capable of automating or considerably speeding up scientific and technological advancement. Even if such technologies do not initially require human-level AI, it seems very probable that they would quickly facilitate the development of human or superhuman level AI.

So, what are the community predictions regarding the arrival of AGI? The following question technically asks about an event that is posterior to the actual creation of the AGI. But, from the community median in this question, the difference is not very relevant at a timescale of years.

In this question, AGI is defined as a single software system that can achieve four particular criteria. None of them seem to be very far away, given the current state of the art and the pace of progress. The most difficult criterion is potentially the one about the text-based implementation of the Turing test, because sufficiently deep probing is a very high bar to pass: the AI not only needs to be cognitively human-level in several complex areas, but must also have a good enough model of human psychology and capabilities.

As of April 10th, the community puts ~50% of the probability mass on a system achieving all these criteria by 2037, and more than 25% on it happening before 2030.

If AGI is developed, what would happen next? How long between AGI and the first superintelligent AI?

So, the community puts 50% probability in less than 6 months from the first AGI (as defined in the previous question) to the first system that can perform any task as well as or better than the best 2021-human in the corresponding domain. The community prediction indicates that this will happen in less than 27 months with ~75% probability, and in less than 7 weeks with ~25% probability.

These numbers indicate very grim perspectives for those who plan to delay alignment research until the first AGI system has been created. The window of available time between the first AGI and an overwhelmingly intelligent unaligned system would be too narrow.

About surprising capability spikes

In October 2015, Deepmind's AlphaGo achieved the first victory against a professional go player, Fan Hui, by 5-0. Before this achievement, go programs were only capable of playing at the level of human amateurs, and professional-level play still seemed to be far away.

This match was not disclosed to the public until 27 January 2016. Soon after that, the community prediction for the following question climbed from 29% to 90% (the reason for not jumping higher could be that the question asked for the game to be played in 2016, and in an official setting).

In March 2016, AlphaGo played against Lee Sedol, considered the greatest player of the past decade. It achieved a 4-1 victory. In 2017, a stronger version of AlphaGo beat the world's top ranked go player, Ke Jie, defeating him in all three games. Deepmind later released AlphaZero, which, training purely on self-play, could achieve superhuman level in chess, shogi, and go. This work was further generalized and improved with MuZero in 2019, which was also able to attain superhuman performance in an ensemble of 57 Atari games.

In 2020, Deepmind introduced AlphaFold, which basically solved the protein folding prediction problem: determining within a certain margin of error the shape that proteins will fold into. AlphaFold achieves a median performance on this problem that is comparable to what was achievable by modern, costly, and slow experimental methods. AlphaFold's achievement probably came faster than was expected by most: near the end of 2020, the Metaculus community assigned an 80% chance to something like this happening before 2031, which naively corresponds to a uniform annual probability of . After the impressive AlphaFold results were announced, the community prediction promptly jumped to 99%.

In February 2019, OpenAI presented GPT-2, a language model that was able to complete an arbitrary input by generating somewhat realistic and coherent text. The outputs sometimes seemed meandering and distracted, or they contained failures of real-world modeling, but they were very impressive nonetheless, and could seem almost human-like under cursory reading. In May 2020, GPT-3 was formally introduced, achieving a much higher text quality.

There is a question in Metaculus that tries to operationalize how surprising/sudden will AI progress be. Around the introduction of GPT-3, this question median went from about 62% to 68%, and it has stayed over 68% ever since.

Relatedly and remarkably, perhaps due to a quick succession of impressive news, during the writing of this essay, the two main questions that were previously mentioned with regard to AGI timelines received considerable updates downwards in their forecasts. The median time for the first AGI has gone down by 6 years (2043 to 2037), while the median time for the transition from AGI to ASI has gone down by more than a month (from 7.15 to 5.95).

The pace of AI-safety research

Given the current pace of growth for AI capabilities, rapidly increasing investments in the area, and the precedents of unexpected capability spikes, a fundamental question is whether humanity's efforts for AI alignment (or, more generally, for solving the control problem) will manage to find solutions in time.

A simple measure of the effort being spent in a research field is counting the number of papers being published:

Between 2021-01-14 and 2022-01-14, the true value was 560, so the community is expecting the rate to at least double, in average, during this five-years period.

Here, the median prediction is not even twice that of the 2021-2026 period. Taken at face value, this implies, at some point during this 10-years period, a worrisome reduction in the amount of work given to this area. It does not seem very plausible to me that AI-safety research will decelerate, so maybe this result reflects the specifics of the question's resolution: in 10 years, there is the potential of terms drifting, and new ones appearing, thus producing papers that are part of the area, but do not satisfy the particular criteria of the question.

Beyond considerations about keyword drift and how representative arXiv e-prints are of the number of publications in the area, it is important not to mistake the proxy for what we actually want to measure: progress in AI-safety research. Indeed, not all AI-safety research is being published, and the proportion that is not published could increase in the future: concerns about the risks of disclosure could become more common, and it might become more widely accepted to focus on results over other considerations that are more typical in academic careers.

But, without interpreting the graphs with optimistic assumptions, the eventual deceleration that can be inferred from these two forecasts is not encouraging.

Now, given that the number of papers is not a very reliable proxy for advancement or interest in the area, another possibility is to observe the amount of investment that it is receiving. Metaculus currently does not have many questions of this type, but there is a pair that asks about a particular grantmaker.

From the probability distributions, it can be seen that the community expects a moderate increase in the funds dedicated by Open Philanthropy to AI-risk reduction (although the expected increase, conditional on Open Philanthropy continuing to exist, is not easily legible from these graphs).

In any case, it would seem that funds are not currently the main constraint for progress in AI-risk reduction, and that there are many other, more important bottlenecks. Having more of these forecasting questions, asking about other indicators of progress and attention in the area, would be useful to form a more complete understanding of the expected trajectories of AI-safety research in the upcoming years.


Currently, the Metaculus community does not expect the control problem to be credibly solved before the first AGI, and expects that the transition time from AGI to superintelligence will be a short one. In contrast, the indicators of attention/resources dedicated to AI-safety research do not seem to be growing nearly as fast as the general progress and investment into AI capabilities.

At the moment, Metaculus does not have many questions about the expected effects that different interventions or research would have on reducing AI risks. Having more forecasts distinguishing outcomes conditional on different scenarios could be very informative when trying to understand which actions may get us closer to workable solutions or temporary mitigations for AI risk. But, for now, it seems like current trajectories have a worryingly high probability of concluding in very undesirable outcomes for humanity. And the timelines keep getting shorter.

Summary and final remarks

  • Regarding the pace of advancement in AI-safety research, the few relevant Metaculus questions, by themselves, do not give much reason to expect a substantial speed-up during this decade.
  • The Metaculus community currently believes (90%) that the control problem won't be solved before the first AGI is demonstrated. This is despite a median community prediction of 2037 for the first AGI, a value which I consider optimistic (in the sense that I expect that operationalization of AGI to be reached sooner).
  • The community estimates a median time of ~6 months from AGI to superintelligent AGIs (although this prediction could be assuming a faster takeoff due to a larger hardware overhang than if the community had predicted AGI sooner).
  • The community forecasts high chances (~41%) of non-positive outcomes from the creation of superintelligent AIs. As previously calculated in the section on risks, this possibly includes a ~4% chance of near/total human extinction.
  • Unless there is a substantial acceleration of AI-alignment research, and unless the risks of advanced AI are taken seriously by those working on general AI research, humanity may face extinction or other very undesirable outcomes within less than 2 decades.
Artificial Intelligence
Submit Essay

Once you submit your essay, you can no longer edit it.