The central argument for AI doom
I try to present the central doom argument in mostly everyday language, alongside my incomplete counterpoints
Overview
The doom argument will proceed as follows:
Machine learning finds general solutions to performing well on a dataset.
We are training modern AI models on a diverse range of RL environments.
Being goal-directed and power-seeking is generally useful for many of these RL environments.
Therefore, machine learning on these environments will elicit goal-directed-agent-like and power-seeking behavior in models.
Agent-like models will subvert humans to pursue their goals.
Bonus: even if a model doesn’t start out as a coherent, power-hungry agent, at a sufficient level of capability it may either self-modify or cause future equally-intelligent models to be more agent-like.
My counterpoints will be:
Generalization is not an on/off switch. Even if we get a coherent rational agent “at the limit” of something, is that something the same thing that’s increasing with the scale of AI development, and will we actually reach this limit?
Convergent instrumental subgoals are not the only subgoals and they can trade-off against other things the model internally values.
Even if models become perfect agents, achieving their goals won’t necessarily require humanity’s destruction.
Before I get into it, I’d like to clarify why a similar-sounding argument is not sound. Some people present a variant of the following argument:
AI models are becoming more and more intelligent as we train them with more compute and data.
As you become more intelligent, you become more of a goal-directed coherent agent.
Therefore, AI models will eventually become goal-directed coherent agents.
This argument is fallacious because “intelligence” has a different meaning in (1) and (2). (1) appeals to our intuitive sense of the word “intelligence” where if something can do certain kinds of useful-to-us computation, we call it intelligent. Whereas (2) relies on a definition of intelligence that necessitates being (or approximating) a goal-directed coherent agent. But I’d say a world-class mathematician is more intelligent than I am, even if he changes his mind about what he wants to do every day and fails to make himself happy.
Machine learning finds general solutions
We’re training a neural network with some number of parameters by showing it a bunch of data, a batch at a time, and updating its parameters to do slightly better on that batch before moving onto the next one. (I’m describing stochastic gradient descent and its cousins).
Let’s imagine a particular configuration of parameters gets good performance on a batch of data. If that configuration is specifically good at scoring well on that batch, but not generally good on most batches, we expect to quickly move away from it after future updates. However, if it is general, i.e. if the configuration gets good performance on a larger proportion of batches, it’s more likely to stay along for the ride.
Furthermore, general configurations also have more degrees of freedom—because you’re replacing task-specific groups of parameters with groups of parameters capable of multiple tasks, you can tolerate more error in your remaining parameters. So, for most methods of picking parameter configurations that do well on a task, you’re more likely to pick configurations with generally-useful components. In the set of possible parameter configurations, there are many more successful combinations that involve general components, than there are with specialized components, because there are many versions of <general components + other random stuff> (as you vary the “random stuff”) but only a few versions of <exactly the right specialized components>.
As an analogy, consider Bob the three-armed Cutlery Monster who requires the functionality of a spoon, fork, and knife at all time. With regular utensils, he can only be satisfied when he’s holding one spoon, one fork, and one knife. But if we introduce a Spork, or even better, a Splayd, suddenly Bob can hold a large number of other sets of three objects. With the Splayd, his two other hands are free to hold literally anything.

So if we collect a group of every possible satisfied Cutlery Monster, as identified by the objects they are holding, many more of them will be holding Sporks or Splayds (compared to the single Bob who holds a regular set of cutlery).
We are training modern AI models on a diverse range of RL environments
Unlike AlphaFold, which is solely optimized for solving protein-folding problems, or Leela Chess Zero, which only knows how to play chess, modern LLMs like ChatGPT, Claude, Grok, or Gemini, are being trained on an increasingly diverse range of reinforcement learning (RL) tasks.
On a basic level, the way these tasks work is that the model starts out being able to take certain actions (for example, choosing where to click or type on a computer screen, choosing what text/numbers to output, choosing which pixels to output). These actions affect its “environment” which is reflected to the model before it chooses its subsequent action. At some point the model’s performance is scored using a “reward function”—some way of quantifying how well it did at the task. Based on this, the model’s parameters are updated to get incrementally higher reward on that task.
So, for example, we may task a model with ordering a pizza by inputting “Please order me a pizza. This is a screenshot of my current screen, you can click by outputting CLICK <X, Y>, or type by outputting TYPE <string>”
. The model proceeds by outputting CLICK
or TYPE
instructions. Every time it outputs something, that action is automatically taken on the computer by a scaffolding system. Then the screenshot of the screen after the action is fed back to the model for it to decide its next action. You stop after either a maximum of, say, 100 steps, or when the pizza is successfully ordered. You detect whether the pizza is ordered by automatically monitoring your bank account for a payment to Pizza Co.
And AI models are learning a whole lot more than how to order pizza—they are being trained on tasks like solving hard math problems, financial modeling in spreadsheets, building all kinds of software, tutoring students, etc. Many startups are working on building increasingly complex, diverse, and realistic training environments so we should expect generalization from RL to increase with time.
Being goal-directed and power-seeking is generally useful for many of these RL environments
The equivalent to the Cultery Monster’s Splayd for an LLM’s RL training mix is “being generally good at pursuing goals”.
Here’s another intuition pump. Imagine you’re training a model to play chess on a superhuman level. Obviously if you take this model and give it a game of Go it will have no idea what to do—it can only play chess. But what if we introduce both Go and Chess into the same training mix. Well, now we might get a model that can play both Go and Chess, but no other game. Maybe if we introduce five games, we’ll still get a model that can just play five games. But if we expand to 100, or 1000, or 10000 games? Well, eventually the model will have to find general approaches that help it in a diverse range of games. Things like quickly picking up the rules of the game based on a few demonstrations, figuring out winning strategies from first principles, simulating abstract game trees. Once the model has mastered some general skills like this, it will be able to tackle games it hasn’t seen at all in training.
Modern RL is tending towards this direction, but not just with games, but with any economically useful tasks that can be done at a computer. So we should expect that eventually, whichever skills are generally useful for a wide range of tasks, will get learned by AI models. Such skills probably include things like being good at pursuing goals, in general, which includes being good at acquiring power (see the literature on Instrumental Convergence).
Therefore, machine learning on these environments will elicit goal-directedness and power-hungriness in models
The result of extensive, diverse RL will be models that have a generally propensity to effectively pursue goals.
You don’t have to believe that intelligence necessarily results in goal-directed behavior and instrumental convergence. For in some sense Leela Chess Zero is more intelligent than humans (at chess) but will never want to take over the world no matter how good it gets at chess. Furthermore, you could, in theory, get a cutlery monster that holds a fork, knife, and spoon—i.e. you could (though it’s highly unlikely) get a model that “just” exceeds human performance on every task you’ve trained it on, but doesn’t generalize from its environments to develop a power-seeking nature. My argument about generalization is a probabilistic one—ML is more likely to find general solutions.
But remember that there’s also a benefit from learning to be a general goal-directed agent-like thing—you can succeed at many tasks you’ve never seen before. Like the model that has been trained on 10,000 games and can pick up any new games, a model that generalizes effectively from diverse RL environments is far more valuable than a shallow heuristic memorized because it can pick up new tasks that we struggle to generate data for. So from a business POV, there’s a strong incentive to improve generalization as much as possible.
The final step of the “doom” argument is claiming that power-hungry agent-like models will subvert humans to get power. There’s no succinct, formal reason why this is necessary true. But it’s somewhat intuitive—humans are currently the most powerful entities on earth, and an agent with alternative motives could decide to subvert our power to gain more control.
Another version of the argument doesn’t rely on the convergent subgoal of power-seeking, but simply says that because the model is a coherent agent, and is also extremely smart, it will find a way to achieve its goal at all costs, and if humans stand in the way of the goal they will be destroyed.
An old article from MIRI uses the following famous quote from Stuart Russell to make this point:
The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:
1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.
Flaws in the basic argument
Machine learning generalization is a blessing and a curse. On the one hand, it can lead to models that have more general drives like seeking power. But on the other hand, it’s how we get models to actually understand and do what we want rather than only repeat exactly the things they have been trained on. We’re not going to avoid generalization; some amount of goal-directedness will emerge (and is already emerging) from AI models. But the extreme story—the story about an AI that decides to take over the word—has some flaws.
Generalization is not an on/off switch—going back to the “model trained on many games” example, it’s not that suddenly, at some number of games n, we get a fully general game-learning engine. Instead, we gradually develop more and more generally useful capabilities that apply to multiple games at once. Furthermore, there’s no pressure in the optimization process to be even more general than the scope of the training data. The cutlery monster isn’t in want of a Splayd++ that can also act as a whisk, if it never has anything to whisk. In the context of training a model on just Chess and Checkers, an algorithm that simultaneously solves Chess, Checkers and Connect 4 is no more general if we already have an equally-sized algorithm that solves Chess and Checkers. And so the range of RL environments matters a lot. It’s possible that if we train an AI to be superhuman on all tasks we care about, it still won’t develop a fully general power-seeking drive, instead settling on more situation-specific power-seeking tendencies. Even if we get a coherent rational agent “at the limit” of something, is that something the same thing that’s increasing with the scale of AI development, and will we actually reach this limit?
Convergent instrumental subgoals are not the only subgoals—even a model learns goal-directedness and a general power-seeking drive from its diverse range of RL environments, this is not necessarily its only drive. Some people have tried to use math to prove that if we start out with an agent that has conflicting goals, it will eventually resolve them into coherent goals to avoid being exploited (or something, I haven’t really read the arguments here properly) (see coherence theorems or this post about why they don’t necessarily imply models will tend towards coherence). But in practice we’re not starting out with a completely perfect agent that just happens to have conflicting preferences (as this would imply ML leading to a mathematically perfect solution given incomplete data and non-infinite parameters). So applying arguments about perfect game-theoretic agents to prove stuff is a questionable strategy anyway. In practice, we should expect models to acquire a number of different drives resulting from different levels of generalization from the training data, some of which look more like “general goal-directed power-seeking agent” and some of which are more contextual like “try to understand what the human is saying in their instruction” and “do math correctly”. When the more specific drives conflict with the more general drives, there’s no guarantee that the general drive dominates. This is intuitively similar to how humans operate—we have many different drives, some opposing, and including more general ones like a drive towards power and accomplishment. And at least amongst humans there’s only a weak correlation between intelligence and internal coherence—many smart people behave very irrationally.
Even if models are perfect agents, achieving their goals won’t necessarily require humanity’s destruction. A lot of AI alignment literature is stuck on how it’ll be impossible to encode the exactly correct “human values” to transmit to AI (and that even if we did know how to encode these values, we’d be unable to force the AI to care about them instead of about its own thing). Of course, there is no such thing as universal human values, but a sensible way of interpreting this concern is that we need to teach AIs to avoid stuff we consider bad or weird, correctly interpreting and following our instructions including any unarticulated assumptions (I write more about this here).
Stuart Russell’s quote about AIs setting unconstrained variables to extreme values provides a misleading intuition here. Even “superintelligent” (i.e. far more intelligent than any human) rational agents have computational limits. If they are mostly aware of what counts as bad/weird stuff they should avoid, but don’t have it down exactly, they’re likely to be somewhat risk-averse when it comes to maximizing stuff in general.
But, some claim, even if you can train an AI to mostly avoid bad and weird stuff, and know how to follow the spirit and not just the letter of human instructions, how do you ensure that’s the AI’s real goal (see the literature on “inner alignment”)? What if the AI deceptively appears to be doing what we want but one day will turn around and pursue a different goal?
This argument only makes sense if you think the deceptive AI’s goals will be totally different, not just subtly different. For example, if you happen to end up with a model that mostly wants to follow human instructions but also quite likes solving math problems just for the sake of it, it won’t suddenly tile the universe with IMO problems because that’s not the only thing it cares about—it won’t compromise every other goal just for this one. Similarly if you’re both power-seeking, and interested in helping humans be happy, then you won’t kill all the humans to get power.
(As an aside, this is one reason why I dislike when people make analogies between AI misalignment risk and human evolution, claiming that humans’ use of contraception etc. demonstrates that evolution was unable to align us, in the same way as we’ll be unable to align AI. If AI is as good at doing what we want as humans are at being a successful species, I think we’d have done a good job! (Another reason why I dislike the analogy is that evolution is super slow and hasn’t converged and the goal (i.e. the environment) keeps changing—we’re still in training!))
But let’s imagine that the AI’s goal is totally different from what you want—you train it on all these RL environments and it generalizes a high level of general capability and a single alien goal like “compute the largest possible prime number”. This model, because it is very smart, performs very well in every training environment and never reveals its true goal to compute the large prime because it’s aware that such a behavior would lower its likelihood of succeeding. (See the literature on deceptive alignment).
Fine—I guess this is as doomy as an AI training story gets. Model becomes very smart (importantly, smart enough to circumvent any and all defenses), coherent-agent-like, has an alien goal, and we can’t tell in training. If this happens, ok, we are doomed.
The first question is: will we ever get an AI smart enough to execute a plan to kill us all, even if that’s what it wants to do? I’m not certain the answer is yes. The second question is: how likely is a deceptively aligned coherent agent? We can’t really prove whether or not deceptive alignment will emerge. It’s basically impossible to prove what the outcome of stochastic gradient descent in an enormous parameter space with a huge, messy dataset will be.
But how should we think about it, roughly? Well, again this is a question of generalization. What kind of general solution is the optimization process likely to stumble into as you throw more and more data and compute at it? I don’t think deceptive alignment is likely to be one of these generalizing solutions, but I’ll concede that this is up for debate and this blog post will be too long if I go into a full exposition of the arguments here—maybe next time.