A reasonable interpretation of Value Alignment folds into Intent Alignment

Value Alignment as the "avoid bad or weird stuff" clause

Aug 30, 2025

So we’re training an AI model to achieve some number of goals. For example: writing software, designing robots, orchestrating a marketing campaign, ordering a pizza, or being a general-purpose assistant that can do all of those things and more.

The simplest way to do this (though it is very difficult in practice) is to give the model lots of training examples for the target task(s) and update it based on how well it achieves the goal.

How do you measure how well it achieves the goal? Well, in the case of writing software, you could measure whether unit tests pass, or whether people approve of the code, or whether people approve of the software when it runs, or whether the software runs as expected according to another model or automated testing tool. In the case of ordering a pizza, you could measure whether the “pizza successfully ordered” screen appeared on the computer, or whether a pizza actually arrived at your office, or whether $10 left your bank account and went to Domino’s.

But that’s not exactly what you want. Exactly what you want is closer to: “do what I say (write an app, order food, design a tool…) but without doing anything bad or weird that I wouldn’t want you to do”. If the model writes software that passes all your tests by hacking them, that’s clearly not what you want. If the model orders a pizza by asking a random stranger on Twitter to order a pizza on its behalf, that’s also probably not what you want.

If you hire a human employee and ask them to do a task for you, the without doing anything bad or weird that I wouldn’t want you to do is implicit, and you trust that the employee knows roughly what that entails. Though you can easily imagine this failing somewhat if you hire a child or someone from a very different culture. Then you need to specify more detailed instructions.

An AI may be even more prone to failing to avoid the bad or weird. When training an AI you are limited in how much data you can show it, and what you can measure. When training on a large number of “practice problems”, measuring things that require physical tests or lots of time (for example, how a particular tool interacts with humans, or whether a particular medicine works as expected) is extremely challenging. And in general, measuring did you do anything bad or weird that I wouldn’t want you to do is often very difficult.

So you need to hope that the model will generalize from your limited number of examples and limited ability to measure its performance. Getting the model to generalize correctly here—avoiding the bad or weird stuff—is the heart of the AI alignment problem.

People sometimes talk about the distinction between “intent alignment” and “value alignment”. Let’s ask some LLMs what these terms mean so that I can paste in some authoritative-looking quotes:

Intent alignment ensures an AI system accurately follows a user's direct commands, while value alignment seeks to instill AI with broader human ethical principles and values. In essence, intent alignment is about correctly interpreting and executing specific instructions. In contrast, value alignment is about ensuring the AI's goals and behaviors are consistent with what is broadly considered beneficial and ethical for humanity, even in unforeseen circumstances.
—Gemini 2.5 Pro

“Intent alignment” means the system is trying to do what its human overseer intends—i.e., its objective is to faithfully pursue the user’s goals, even under uncertainty. “Value alignment” goes deeper: the system’s own preferences or ethics match human values, so it makes good choices even without direct guidance. In short, intent alignment is about deference to our aims; value alignment is about sharing them.
—ChatGPT 5 Thinking

Intent alignment refers to getting an AI system to understand and follow the specific instructions or goals that a human gives it - essentially doing what the human meant rather than a literal or incorrect interpretation of their words.
Value alignment is broader and deeper - it's about ensuring an AI system's behavior aligns with human values, ethics, and long-term interests, even in situations where no explicit instructions were given or where following immediate instructions might conflict with broader human values.
While intent alignment focuses on correctly interpreting and executing specific tasks, value alignment addresses the fundamental question of ensuring AI systems act in ways that are beneficial and ethical according to human values across all contexts.
—Claude Opus 4.1

Putting aside, for now, the cases where someone’s true intent conflicts with some notion of global “human values”, the pragmatic interpretation of why we need “value alignment” is that we want models to avoid the bad or weird when trying to do what we tell them. Another way of thinking about this goal is that we’re teaching the model to understand the unsaid, implicit part of our instruction. We say “get me pizza” but we mean “get me a pizza without doing anything illegal or spending more money than the pizza is actually worth (etc., etc.), also give up if the pizza shop is closed or the task is too difficult for you”.

If you think about “value alignment” this way, i.e. that you’re “just” trying to get the model to understand the implicit avoid bad or weird stuff clause (and all the other implicit and sometimes inarticulable assumptions) then value alignment folds into intent alignment. The avoid bad or weird stuff clause implicit in user requests includes the common ethical principles we’d want AIs to follow: don’t harm people to achieve the goal, don’t break the law, don’t mislead us into thinking you achieved the goal without actually achieving it, etc.

But what about the non-pragmatic interpretation of value alignment often referenced, implicitly and explicitly, in the AI safety literature? This idea of an AI model having a complete notion of “the Good”, “Human Values”, and pursuing that as its main goal? This version of the concept is much more flawed: humans don’t have a consistent set of values that you can task an AI to go off and optimize for. The closest we can get to this is providing people with AIs that try to deeply understand what individuals want and help them with their asks (i.e. intent alignment, with “pragmatic value alignment” folded in).

Nina Panickssery

Discussion about this post