Discussion about this post

User's avatar
Julian's avatar

>[Empirical AI safety researchers think] that it’s much easier to try to elicit a stable good and virtuous persona than to train the model to flexibly follow all instructions

I think this is true at current levels of capability: ie the models are currently not capable of flexibly following all instructions *while correctly inferring what the user means and wants* from severely underspecified prompts, compared to the output quality you get from a specific "good and virtuous persona"

Hopefully this changes in the future, although the fear of bad actors makes it less likely the labs will try.

This feels analogous to me of how it's easier for a bounded agent to be moral via virtue ethics than by trying to do utilitarianism in a non-naive way.

No posts

Ready for more?