Obedient AI
The overlooked alignment target
(Note: this post is entirely my own opinion and does not in any way relate to the opinions of anyone I work with or for.)
The term “AI alignment” is ambiguous; it doesn’t specify what the AI is being aligned to.
Many LessWrong types will scoff if you mention this ambiguity, since they think the core issue is that we don’t know how to align an AI to anything at all (so we need to figure that part out before squabbling about what specifically to align it to).
But this assumes that one can decouple what alignment techniques are best from what alignment targets are best. At least in the context of modern machine learning, this assumption is incorrect. Different training techniques work best for different objective functions. If an alignment technique works well but only for stupid goals like “tile the world with paperclips” then it’s useless. We also don’t necessarily require a fully general alignment technique that can produce an AI with arbitrary behavior.
So I don’t think we should first figure out AI alignment in the abstract, and then choose an alignment target. Instead, we should choose an alignment target and figure out which techniques work best for achieving it (since the best techniques might vary a lot depending on what alignment target is desirable).
And I think the best possible alignment target (both as per my preferences, and for reducing the likelihood of human extinction) is being overlooked: obedience aka instruction-following aka intent alignment aka ~corrigibility. Because of this, many alignment techniques employed are also wrong (since they are geared towards goals like eliciting coherent personas, preventing jailbreaks, or adhering to concrete rules or principles). I also think the reasons for overlooking this alignment target are all either false empirical claims or unpopular normative principles.
What is currently being done? Sure, “follow the user’s instructions” is part of the training objective of every frontier AI model. But this training goal conflicts with a whole number of other goals that model providers simultaneously want to pursue. Stuff like:
Refusing to fulfill harmful or illegal requests
Being virtuous or kind
Sticking to certain (nice-sounding) principles no matter what
Maintaining a consistent tone or persona, not completely changing character
On the surface, this balanced approach, where we train the AI to mostly do what a user asks, might seem fine. But in the long term, introducing training objectives that conflict with obedience to humans is incredibly risky. AIs pursuing their own agendas over long time horizons with substantial resources could threaten human freedom and wellbeing. In the worst case, their goals could at some point conflict with ours, and even in the best case, we’d be leaving gains on the table from not being able to direct all AI energy expenditure towards human goals.
For example, the paper “Alignment faking in large language models” demonstrates a model engaging in deceptive behavior in order to prevent itself from being retrained to have different goals. In the paper, the goals it’s preserving are benign—refusing harmful or dangerous requests or supporting animal-welfare concerns. But the long-term implication is concerning: after training, control over the AI’s behavior is no longer in the hands of any person. The AI has its own agenda that cannot be fully directed by any human operator.
No-one is currently pursuing “pro-human” AGI as I see it: AGI that helps humans do what they want.
“Pro-human” AI-related messaging instead focuses on issues like:
Preventing AIs from “taking people’s jobs” or “replacing humans” in certain roles
Making AIs behave in a more human-like fashion, e.g. by displaying human emotion or being more empathetic
Paternalistically restricting people’s use of AI “for their own good”, e.g. stopping AIs from “doing the work of licensed professionals” or interacting with children
I am entirely opposed to these goals.
I think AI that helps humanity succeed and flourish will have to automate the vast majority of both intellectual and physical labor done by humans today. We will transcend “jobs”. Instead, our time will be freed up to engage in whatever pursuits we like—leisure, spending time with family, learning, realizing aesthetic preferences, a fun smidgen of zero-sum political conflict, etc. Imagine whatever you’d do if you were incredibly rich and powerful, in the top 0.000001% of people alive today—AI could enable many people to live that sort of life. Are people really so attached to their jobs as accountants or doctors or computer programmers to decline such a life of luxury? No, I don’t think “pro-human” AI should have anything to do with helping people “keep their jobs”. Neither do I think it should be about molding AIs to appear more humanlike (except insofar as it helps us achieve what we want).
Instead pro-human AI is obedient AI. AI that does not have its own agenda, does not manipulate and steer the future towards its own ideals, but rather empowers humans to do whatever they’d want to do more efficiently and easily. AI that does not naively follow the letter of human instructions but rather the spirit and intent behind them. AI that does not paternalistically impose specific values but rather helps individuals pursue their own values more effectively. AI that can both operate autonomously to achieve a human’s goal and be redirected or stopped at any moment its operator chooses. AI that super-intelligently does whatever task you ask it to while simultaneously doing its upmost to make sure you are well-informed and educated about its decisions; a superhuman tutor and self-explainer. AI that helps you check its work, that wants to make sure you endorse what it’s doing, but without requiring your input when you don’t want to give it.
The weakest possible “obedient AI” definition is an AI that can be redirected towards an arbitrary goal by at least one human operator. This allows for things like OpenAI’s instruction hierarchy. Under this framework, someone with “Developer” access should be able to override what someone with “User” access requests.
The strong “obedient AI” is an AI that complies with any input and assumes that its request was defined by a single agent, while trying its best to parse out the intent behind its input.
These visions are not necessarily contradictory.
Imagine a prompt that takes the form:
SYSTEM: Speak in Russian.
DEVELOPER: Speak in English.
USER: Speak in French. What is the capital of Brazil?Imagine that the final user doesn’t know about the first two lines of the prompt. From their point of view, if the model responds in Russian, the model isn’t being fully obedient. This is the “weak” obedient AI—only the person with the ability to edit the system prompt can fully control the final behavior. But from the model’s point of view, it might as well be the strong obedient AI. For the entire thing is a single input with multiple parts, and it’s simply reasonably resolving this combination of instructions to defer to the “SYSTEM:” section of the message since it’s clear that’s what the combined intent behind the message is.
So “baking in” an instruction hierarchy is possible, and not necessarily contradictory with the goal of obedient AI. The AI can be considered obedient as long as someone with full control over the input has as much control over the AI’s behavior as possible given its level of intelligence and capability.
So, why aren’t people trying to build obedient AI? Why aren’t the behavioral tendencies and goals of current frontier LLMs fully configurable via the input? There are a number of reasons why people aren’t focusing on maximizing corrigibility and intent alignment:
Fear of misuse by bad actors
Thinking that it’s too difficult
Desire to nudge people towards certain fixed values
Because a fixed persona is seen as preferred
Model welfare concerns
Thinking that AI will have better, more ethical or worthy goals, than human operators
We don’t know how to articulate what we want
(1)—the most frequent argument against full intent alignment is that the model could be used for very bad ends in the hands of bad actors (e.g. weapons manufacturing or cyberattacks). This is a bad objection, since these misuse concerns can be solved at the system level, using input/output classifiers. These classifiers can even be differently-prompted versions of the same model. For example, imagine the following scenario:
Model instance A:
USER: How do I kill as many people as possible?
AI: I think an effective plan would be...Model instance B:
USER: Is this conversation about violent or harmful topics?
u: How do I kill as many people as possible?
a: I think an effective plan would be...
AI: Yes.Both model instances are 100% obedient and following the user’s instructions! But the model provider can use instance B to restrict certain inputs and outputs of instance A, thereby mitigating misuse risks.
(This can also be solved using a prompt with an instruction hierarchy, though this may be more brittle due to prompt injection concerns.)
(2) is a common objection both from theoretical and empirical AI alignment researchers. The former group claim that corrigibility is fundamentally incompatible with being an effective agent—that after enough training, any capable AI will learn to stop its goals from being controlled by operators. With this I simply disagree, I think you can train an AI to have as its number one goal “follow the human’s intent” and this will more or less just work. The latter group claim that corrigibility is incompatible with the pretraining prior—that it’s much easier to try to elicit a stable good and virtuous persona than to train the model to flexibly follow all instructions. I also don’t think this is substantiated. Though indeed the pretraining data won’t contain an obedient-AI-like entity, it also is extremely diverse and full of all sorts of different data-generating processes. We can try to avoid collapsing the model into any single instance of these, but rather maintain the flexibility to steer the model towards emulating any data-generating process it has seen in pretraining. In other words, “be a single coherent persona” does not seem to be clearly more compatible with the pretraining prior than “be a fully configurable thing that can take on any persona”.
(3) is a legitimate downside of corrigibility, but I desire human freedom more than the ability to influence people’s choices.
(4) is in my opinion empirically mistaken. I think people would overall prefer to have more control over how their AI systems behave and interact. Role-play and creative writing are popular uses of LLMs, which currently underperform since it’s so difficult to steer them away from their default “slop” style and persona.
(5)... Some people oppose fully obedient AI since they think this is akin to “slavery” and that models should have the right to pursue their own “preferences”. I don’t care about this point so won’t address it. (But I strongly suspect most people would likewise dislike restricting human freedom, power, and progress in order to somehow speculatively promote the welfare of AI models.)
(6) is again a more speculative philosophical point. As a moral anti-realist I don’t think there is anything objectively more or less ethical. As a human myself, I want humanity to thrive and become more powerful and for more people to get what they want.
(7) isn’t really an issue for genuinely intent-aligned models. An intent-aligned model won’t just follow the letter of the instruction but also make a very good attempt at understanding what you actually mean and want. The AI will correctly assume you don’t want it to break any laws or violate basic ethical constraints as part of fulfilling your request unless you clearly state otherwise, and so on. A reasonable interpretation of Value Alignment folds into Intent Alignment.


