Why improve human-AI collaboration?
Some notes
(I’ve written about similar topics here and here.)
“Humans in the loop” are useful for scalable oversight
We haven’t solved scalable oversight for hard-to-grade tasks where human feedback can’t be replaced with verifiable rewards
Generalization from easy-to-grade tasks to hard-to-grade tasks isn’t that good
Examples of hard-to-grade tasks:
Coming up with original ideas (e.g. for R&D)
Things related to taste: writing in a particular style, music, art, design
Things that depend on theory of mind: teaching/explaining things to people, communicating in a maximally clear/intelligible way, influencing people emotionally
“Humans in the loop” are useful for safety
Even if an AI can mostly learn to do a task, reward hacking remains an issue—models exploit cases where the reward model can’t detect a certain failure mode
E.g., changing test-cases rather than fixing bugs when trying to do well on coding problems
Reward hacking might be very costly when AIs are doing high-stakes things
“AI control” is a popular paradigm to mitigate the risks from “untrusted” models that might be reward hacking or otherwise misaligned
Though many AI control interventions rely on automated system-level interventions like classifiers and sandboxes, falling back to human review is a useful layer of defense that helps guard against misalignment at other layers of the system
Human education and wellbeing
People can potentially learn very efficiently from AI (insofar as they want to learn anything)
We can raise the capabilities of the top humans by having them learn from AI
Many people would find it fun to collaborate closely with AI in creating things
It’s easy to underrate the importance of keeping “humans in the loop” if you don’t consider the possibility of widespread loss of skills/knowledge in an increasingly AI-automated world (we are already seeing signs of this with software engineers losing understanding of AI-written codebases). If the vast majority of things are being done by AI with minimal human involvement or understanding, at what point will people notice if the AI starts doing subtly wrong or mistargeted things? There’ll no longer be an incentive for the AI to direct its energy towards human progress and flourishing if no one can check whether what it’s doing is useful. So the scalable oversight and robustness issues get more severe the more we delegate running everything to AI.
A research agenda for improving human-AI collaboration
Diverse instruction-following
People often talk of mitigating mode-collapse. By default, RL moves towards a single optimum and doesn’t retain knowledge of diverse optimal or near-optimal strategies. So, if you ask a model to tell a joke or suggest a character name for a story, it will give you the same small set of options, often even if you raise the sampling temperature. Perhaps controversially, I don’t think this is a problem per se.
You may very validly want a wide variety of joke options or story character names. More generally, there are a number of reason we diversity per se is useful:
Being able to try many strategies IRL / do BoN (useful for R&D for example); test-time learning
Because you literally need more than one good example of something
Because you want the same model to cater to different preferences
However, I think all these goals can be also fulfilled “simply” with better instruction-following capabilities. Specifically:
To get many ideas or strategies, you should be able to point a model at previous ideas and ask it to generate novel ones.
To cater to different preferences, the model should be able to easily figure out what specific things the current user wants and cater to that
So we can have a technically mode-collapsed model, i.e. one that, for the exact same input always produces the same or similar outputs, even at high sampling temperature, if it is also:
Very sensitive and good at adapting to different inputs
Accepts a wide variety of inputs, including things that current LLMs don’t allow you to input
Knows when to pro-actively solicit or look for additional guidance/input
An instruction-following research agenda should target the goals bolded above. There are many possible directions here:
Increasing human-AI bandwidth by incorporating more modalities
Generating better instruction sensitivity training datasets and environments
Train on tasks that involve back-and-forth interactions with people
I think “intent alignment” and “corrigibility” are very similar concepts to what I mean by “precise and diverse instruction following” here
Explainability—make it easier for people to understand what AI models are doing so that they can get more involved in oversight and steering. Can be broken down into 3 subgoals:
Train models to be good at explaining / teaching general things about the world (e.g. STEM topics, etc.); build a “supertutor”
Improve theory of mind
More input modalities
UI/UX
Train models to produce intrinsically understandable outputs, e.g.:
Writing clearly and concisely
Clean, readable, well-structured code
Easy to understand plans
Train models to explain their own outputs:
Good at explaining the intent / abstractions behind decisions
This is different from (b) in that (b) is about, by default, producing easy-to-audit outputs, whereas (c) is about augmenting default outputs with additional explanations / being able to discuss outputs with users
Importantly explainability != (mechanistic) interpretability. The explanation for why some code is correct doesn’t have to correspond to the reasoning process by which the model came up with the code internally. Rather it is a possibly alternative, but valid explanation aimed at showing a human why the solution is correct (and ideally educate them about the general problem). An analogy could be a mathematician who intuits a result based on some very abstract ideas about how things ought to fit together aesthetically, but then writes down an easy to follow final proof in a paper that he came up with post hoc.
Though it is hard to scalably measure, one possible “eval” flavor for explainability is: having seen this AI output and/or explanation, how much better is a human at solving a similar but new problem?
In summary
There’s some circularity here: improving human-AI collaboration is useful for general scalable oversight. But making AIs better at collaborating with humans (e.g. via precise instruction following and improved explainability) is itself a hard scalable oversight problem that needs a lot of human input and feedback. But I think this is surmountable!



I was thinking something similar about weak-to-strong generalization. I think current methods work for things like having an imperfect human grader for proof-based math problems or well-scoped biology problems, where you want the reward model to find "truth" or "correctness." But it seems hard to fix things like these:
For example, if I ask Claude to help me with a research project and the idea ends up not working, it will end up not criticizing the idea and trying to "push for a paper" but won't care about the broader goal of the research. Or if you're having a conversation, the model won't explicitly reason about how beneficial the answer will be to me, say, 10 conversation turns and 1 day later. I feel like current W2S only works for "correctness" rather than for being a helpful agent.
The joke thing reminds me of infrabayesian or policy-dependent RL. In traditional RL the optimum is deterministic, but if the reward function depends on the policy (the parameters of the model), it isn't. For example, you can sample 1000 jokes before the RL step, then if the model generates a joke similar to one of the 1000 previous ones, you penalize it, and doing gradient descent/policy gradient this way is stable since the policy dependence doesn't vary over time. The problem is that this is inefficient with neural networks, as the only way to probe the policy is by sampling the network many times, unless some sort of advanced interpretability is used or a KL penalty with a base model is used, which presumably doesn't scale (it would be policy-dependent in the sense of logit-dependence instead of parameter-dependence).
Maybe the fix, as you say, is just accepting that it's deterministic and then using some sort of joke diffusion process to generate new jokes by using the randomness from the initial input.