Reward models are fundamental components for aligning LLMs with human feedback, yet they face the challenge of reward hacking issues. These models focus on superficial attributes such as response length or formatting rather than…
Crome: Google DeepMind’s Causal Framework for Robust Reward Modeling in LLM Alignment
