Discussion about this post

User's avatar
Emanuel Ruzak's avatar

I was thinking something similar about weak-to-strong generalization. I think current methods work for things like having an imperfect human grader for proof-based math problems or well-scoped biology problems, where you want the reward model to find "truth" or "correctness." But it seems hard to fix things like these:

For example, if I ask Claude to help me with a research project and the idea ends up not working, it will end up not criticizing the idea and trying to "push for a paper" but won't care about the broader goal of the research. Or if you're having a conversation, the model won't explicitly reason about how beneficial the answer will be to me, say, 10 conversation turns and 1 day later. I feel like current W2S only works for "correctness" rather than for being a helpful agent.

The joke thing reminds me of infrabayesian or policy-dependent RL. In traditional RL the optimum is deterministic, but if the reward function depends on the policy (the parameters of the model), it isn't. For example, you can sample 1000 jokes before the RL step, then if the model generates a joke similar to one of the 1000 previous ones, you penalize it, and doing gradient descent/policy gradient this way is stable since the policy dependence doesn't vary over time. The problem is that this is inefficient with neural networks, as the only way to probe the policy is by sampling the network many times, unless some sort of advanced interpretability is used or a KL penalty with a base model is used, which presumably doesn't scale (it would be policy-dependent in the sense of logit-dependence instead of parameter-dependence).

Maybe the fix, as you say, is just accepting that it's deterministic and then using some sort of joke diffusion process to generate new jokes by using the randomness from the initial input.

No posts

Ready for more?