The central argument for AI doom
I try to present the central doom argument in mostly everyday language, alongside my incomplete counterpoints
Overview
The doom argument will proceed as follows:
Machine learning finds general solutions to performing well on a dataset.
We are training modern AI models on a diverse range of RL environments.
Being goal-directed (i.e. taking actions aimed at achieving a particular target world state) and power-seeking (i.e. trying to acquire resources/more general capabilities) is useful for many of these RL environments.
Therefore, machine learning on these environments will elicit goal-directed and power-seeking behavior in models.
These models will eventually subvert humans to pursue their goals and/or acquire power.
My counterpoints will be:
Generalization is not an on/off switch with a predictable outcome. Even if we get a coherent goal-directed agent “at the limit” of something, is that something the same thing that’s increasing with the scale of AI development, and will we actually reach this limit?
Instrumental goals like acquiring power and avoiding shutdown are not the only goals a model is likely to have. These can trade off against each other. There’s no reason to believe the instrumental goals will necessarily dominate.
Even if the AI has a single coherent goal, there’s no reason to believe achieving its goal will result in humanity’s destruction.
Aside: you can’t just appeal to utility maximization being dangerous
Sometimes I hear people hand-wave arguments like “as you train an AI it will become more and more like a perfect utility maximizer and this is dangerous because we’re unable to ensure the utility function it’s maximizing is the right one.” However, this argument doesn’t make sense without additional assumptions about what utility functions the AI will have. Almost anything can be modeled as a maximizer of some utility function (it’s maximizing the extent to which it does whatever it naturally happens to be doing). You can perhaps salvage this line of reasoning by saying that AI models will generalize a special class of utility functions, maybe non-myopic or time-independent ones1. But then you have to argue why you believe this.
Machine learning finds general solutions
We’re training a neural network with some number of parameters by showing it a bunch of data, a batch at a time, and updating its parameters to do slightly better on that batch before moving onto the next one. (I’m describing stochastic gradient descent and its cousins).
Let’s imagine a particular configuration of parameters gets good performance on a batch of data. If that configuration is specifically good at scoring well on that batch, but not generally good on most batches, we expect to quickly move away from it after future updates. However, if it is general, i.e. if the configuration gets good performance on a larger proportion of batches, it’s more likely to stay along for the ride.
Furthermore, general configurations also have more degrees of freedom—because you’re replacing task-specific groups of parameters with groups of parameters capable of multiple tasks, you can tolerate more error in your remaining parameters. So, for most methods of picking parameter configurations that do well on a task, you’re more likely to pick configurations with generally-useful components. In the set of possible parameter configurations, there are many more successful combinations that involve general components, than there are with specialized components, because there are many versions of <general components + other random stuff> (as you vary the “random stuff”) but only a few versions of <exactly the right specialized components>.
As an analogy, consider Bob the three-armed Cutlery Monster who requires the functionality of a spoon, fork, and knife at all time. With regular utensils, he can only be satisfied when he’s holding one spoon, one fork, and one knife. But if we introduce a Spork, or even better, a Splayd, suddenly Bob can hold a large number of other sets of three objects. With the Splayd, his two other hands are free to hold literally anything.

So if we collect a group of every possible satisfied Cutlery Monster, as identified by the objects they are holding, many more of them will be holding Sporks or Splayds (compared to the single Bob who holds a regular set of cutlery).
We are training modern AI models on a diverse range of RL environments
Unlike AlphaFold, which is solely optimized for solving protein-folding problems, or Leela Chess Zero, which only knows how to play chess, modern LLMs like ChatGPT, Claude, Grok, or Gemini, are being trained on an increasingly diverse range of reinforcement learning (RL) tasks.
On a basic level, the way these tasks work is that the model starts out being able to take certain actions (for example, choosing where to click or type on a computer screen, choosing what text/numbers to output, choosing which pixels to output). These actions affect its “environment” which is reflected to the model before it chooses its subsequent action. At some point the model’s performance is scored using a “reward function”—some way of quantifying how well it did at the task. Based on this, the model’s parameters are updated to get incrementally higher reward on that task.
So, for example, we may task a model with ordering a pizza by inputting “Please order me a pizza. This is a screenshot of my current screen, you can click by outputting CLICK <X, Y>, or type by outputting TYPE <string>”. The model proceeds by outputting CLICK or TYPE instructions. Every time it outputs something, that action is automatically taken on the computer by a scaffolding system. Then the screenshot of the screen after the action is fed back to the model for it to decide its next action. You stop after either a maximum of, say, 100 steps, or when the pizza is successfully ordered. You detect whether the pizza is ordered by automatically monitoring your bank account for a payment to Pizza Co.
And AI models are learning a whole lot more than how to order pizza—they are being trained on tasks like solving hard math problems, financial modeling in spreadsheets, building all kinds of software, tutoring students, etc. Many startups are working on building increasingly complex, diverse, long, and realistic training environments so we should expect generalization from RL to increase with time.
Being goal-directed and power-seeking is generally useful for many of these RL environments
The equivalent to the Cultery Monster’s Splayd for an LLM’s RL training mix is “being generally good at pursuing goals”.
Here’s another intuition pump. Imagine you’re training a model to play chess on a superhuman level. Obviously if you take this model and give it a game of Go it will have no idea what to do—it can only play chess. But what if we introduce both Go and Chess into the same training mix. Well, now we might get a model that can play both Go and Chess, but no other game. Maybe if we introduce five games, we’ll still get a model that can just play five games. But if we expand to 100, or 1000, or 10000 games? Well, eventually the model will have to find general approaches that help it in a diverse range of games. Things like quickly picking up the rules of the game based on a few demonstrations, figuring out winning strategies from first principles, simulating abstract game trees. Once the model has mastered some general skills like this, it will be able to tackle games it hasn’t seen at all in training.
Modern RL is tending towards this direction, but not just with games, but with any economically useful tasks that can be done at a computer. So we should expect that eventually, whichever skills are generally useful for a wide range of tasks, will get learned by AI models. Such skills probably include things like being good at pursuing goals, in general, which includes being good at acquiring power (see the literature on Instrumental Convergence).
Therefore, machine learning on these environments will elicit goal-directed and power-seeking behavior in models
The result of this extensive RL will be models that have a general propensity to effectively pursue goals.
The final step of the “doom” argument is claiming that power-hungry goal-directed models will subvert humans to get power/resources/avoid being turned off.
Another version of the argument doesn’t rely on the convergent instrumental subgoals, but simply says that because the model has a particular goal, and is also extremely smart, it will find a way to achieve its goal at all costs, and if humans stand in the way of the goal they will be destroyed.
Flaws in the basic argument
Machine learning generalization is a blessing and a curse. On the one hand, it can lead to models that have more general drives like seeking power. But on the other hand, it’s how we get models to actually understand and do what we want rather than only repeat exactly the things they have been trained on.
Generalization is not an on/off switch with a predictable outcome—going back to the “model trained on many games” example, it’s not that suddenly, at some number of games n, we get a fully general game-learning engine. Instead, we gradually develop more and more generally useful capabilities that apply to multiple games at once. Furthermore, there’s no pressure in the optimization process to be even more general than the scope of the training data. The cutlery monster isn’t in want of a Splayd++ that can also act as a whisk, if it never has anything to whisk. In the context of training a model on just Chess and Checkers, an algorithm that simultaneously solves Chess, Checkers and Connect 4 is no more general if we already have an equally-sized algorithm that solves Chess and Checkers. And so the range of RL environments matters a lot. It’s possible that if we train an AI to be superhuman on all tasks we care about, it still won’t develop a fully general power-seeking drive, instead settling on more situation-specific power-seeking tendencies (especially if some of the training is dedicated safety training that aims to discourage solutions involving overpowering humans). Even if we get a coherent goal-directed agent “at the limit” of something, is that something the same thing that’s increasing with the scale of AI development, and will we actually reach this limit?
Convergent instrumental subgoals are not the only subgoals—even if a model learns goal-directedness and a general power-seeking drive from its diverse range of RL environments, this is unlikely to be its only drive. We should expect models to acquire a number of different drives resulting from different levels of generalization from the training data, some of which look more like “general goal-directed power-seeking agent” and some of which are more contextual like “try to understand what the human is saying in their instruction” and “don’t harm people” and “do math correctly”. When the more specific drives conflict with the more general drives, there’s no guarantee that the general drive dominates. This is intuitively similar to how humans operate—we have many different drives at different points in time, and including more general ones like a drive towards power and accomplishment. At this point, some people appeal to various “coherence arguments” in an attempt to prove that the model will necessarily resolve its drives into a single coherent goal. However, all attempts to prove this appear to either be near-circular (assume that that model already has something equivalent to a coherent goal) or handwavy (appeal to analogies rather than math).
Even if the AI has a single coherent goal, there’s no reason to believe achieving its goal will result in humanity’s destruction. It’s of course near-impossible to predict what the outcome of an extremely complex optimization process (i.e. training a massive neural network of a bunch of data) will be. But intuitively, the common thread across most AI training tasks is something like “follow the human’s instruction”. Insofar as the AI generalizes a single coherent goal, there’s no good reason to believe it will be very different from that. Some people argue that it’s not enough for the AI’s goal to roughly match what we’d want and expect. Even if it misses the target ever so slightly, it will wreak havoc by violating something important to us that wasn’t included in its target. But this doesn’t imply a high likelihood of human destruction. It only implies that the AI achieving any very perfectly specified goal is unlikely. But many potential goals don’t involve badly harming or destroying humanity.
Then it starts to look more like the utility-maximizer is “pursuing a goal”


One point I didn't notice reflected here is that the development of AI naturally matches evolution - the best models are selected to be iterated on further, which is similar to reproduction. As a result, a drive for survival can emerge regardless of what goals the developers attempted. And if you couple superintelligence with a survival instinct, you will have an existential threat.
Thanks for writing, interesting post! There aren't too many people seriously, but critically, engaging with xrisk arguments, that's always appreciated.
I agree with some cruxes, for example, that it's not certain that we even can create an AI powerful enough to take over the world, and therefore, p(doom) should at least be less than say 95%. Also tend to agree with some things you say about generalization.
However, one problem with alignment approaches that this post seems to suffer from as well: it's not enough if someone can make a safe AI. The big question is: is it possible to make an unsafe AI? After Murphy's law: if it's possible to make an unsafe AI of takeover-level capability, someone will eventually do that. Do we think that chance is significant? That determines most of my p(doom).