Reinforcement learning (RL) has emerged as a fundamental approach in LLM post-training, utilizing supervision signals from human feedback (RLHF) or verifiable rewards (RLVR). While RLVR shows promise in mathematical reasoning, it…
Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment
