Crome: Google DeepMind’s Causal Framework for Robust Reward Modeling in LLM Alignment

Reward models are fundamental components for aligning LLMs with human feedback, yet they face the challenge of reward hacking issues. These models focus on superficial attributes such as response length or formatting rather than…

Continue Reading