Why improve human-AI collaboration?
Some notes
(I’ve written about similar topics here and here.)
“Humans in the loop” are useful for scalable oversight
We haven’t solved scalable oversight for hard-to-grade tasks where human feedback can’t be replaced with verifiable rewards
Generalization from easy-to-grade tasks to hard-to-grade tasks isn’t that good
Examples of hard-to-grade tasks:
Coming up with original ideas (e.g. for R&D)
Things related to taste: writing in a particular style, music, art, design
Things that depend on theory of mind: teaching/explaining things to people, communicating in a maximally clear/intelligible way, influencing people emotionally
“Humans in the loop” are useful for safety
Even if an AI can mostly learn to do a task, reward hacking remains an issue—models exploit cases where the reward model can’t detect a certain failure mode
E.g., changing test-cases rather than fixing bugs when trying to do well on coding problems
Reward hacking might be very costly when AIs are doing high-stakes things
“AI control” is a popular paradigm to mitigate the risks from “untrusted” models that might be reward hacking or otherwise misaligned
Though many AI control interventions rely on automated system-level interventions like classifiers and sandboxes, falling back to human review is a useful layer of defense that helps guard against misalignment at other layers of the system
Human education and wellbeing
People can potentially learn very efficiently from AI (insofar as they want to learn anything)
We can raise the capabilities of the top humans by having them learn from AI
Many people would find it fun to collaborate closely with AI in creating things
It’s easy to underrate the importance of keeping “humans in the loop” if you don’t consider the possibility of widespread loss of skills/knowledge in an increasingly AI-automated world (we are already seeing signs of this with software engineers losing understanding of AI-written codebases). If the vast majority of things are being done by AI with minimal human involvement or understanding, at what point will people notice if the AI starts doing subtly wrong or mistargeted things? There’ll no longer be an incentive for the AI to direct its energy towards human progress and flourishing if no one can check whether what it’s doing is useful. So the scalable oversight and robustness issues get more severe the more we delegate running everything to AI.
A research agenda for improving human-AI collaboration
Diverse instruction-following
People often talk of mitigating mode-collapse. By default, RL moves towards a single optimum and doesn’t retain knowledge of diverse optimal or near-optimal strategies. So, if you ask a model to tell a joke or suggest a character name for a story, it will give you the same small set of options, often even if you raise the sampling temperature. Perhaps controversially, I don’t think this is a problem per se.
You may very validly want a wide variety of joke options or story character names. More generally, there are a number of reason we diversity per se is useful:
Being able to try many strategies IRL / do BoN (useful for R&D for example); test-time learning
Because you literally need more than one good example of something
Because you want the same model to cater to different preferences
However, I think all these goals can be also fulfilled “simply” with better instruction-following capabilities. Specifically:
To get many ideas or strategies, you should be able to point a model at previous ideas and ask it to generate novel ones.
To cater to different preferences, the model should be able to easily figure out what specific things the current user wants and cater to that
So we can have a technically mode-collapsed model, i.e. one that, for the exact same input always produces the same or similar outputs, even at high sampling temperature, if it is also:
Very sensitive and good at adapting to different inputs
Accepts a wide variety of inputs, including things that current LLMs don’t allow you to input
Knows when to pro-actively solicit or look for additional guidance/input
An instruction-following research agenda should target the goals bolded above. There are many possible directions here:
Increasing human-AI bandwidth by incorporating more modalities
Generating better instruction sensitivity training datasets and environments
Train on tasks that involve back-and-forth interactions with people
I think “intent alignment” and “corrigibility” are very similar concepts to what I mean by “precise and diverse instruction following” here
Explainability—make it easier for people to understand what AI models are doing so that they can get more involved in oversight and steering. Can be broken down into 3 subgoals:
Train models to be good at explaining / teaching general things about the world (e.g. STEM topics, etc.); build a “supertutor”
Improve theory of mind
More input modalities
UI/UX
Train models to produce intrinsically understandable outputs, e.g.:
Writing clearly and concisely
Clean, readable, well-structured code
Easy to understand plans
Train models to explain their own outputs:
Good at explaining the intent / abstractions behind decisions
This is different from (b) in that (b) is about, by default, producing easy-to-audit outputs, whereas (c) is about augmenting default outputs with additional explanations / being able to discuss outputs with users
Importantly explainability != (mechanistic) interpretability. The explanation for why some code is correct doesn’t have to correspond to the reasoning process by which the model came up with the code internally. Rather it is a possibly alternative, but valid explanation aimed at showing a human why the solution is correct (and ideally educate them about the general problem). An analogy could be a mathematician who intuits a result based on some very abstract ideas about how things ought to fit together aesthetically, but then writes down an easy to follow final proof in a paper that he came up with post hoc.
Though it is hard to scalably measure, one possible “eval” flavor for explainability is: having seen this AI output and/or explanation, how much better is a human at solving a similar but new problem?
In summary
There’s some circularity here: improving human-AI collaboration is useful for general scalable oversight. But making AIs better at collaborating with humans (e.g. via precise instruction following and improved explainability) is itself a hard scalable oversight problem that needs a lot of human input and feedback. But I think this is surmountable!


