Week 08 · GDPO: Fixing Multi-Reward RLHF · AI & Automation Chronicle

The Paper

"GDPO: Group reward-Decoupled Normalization Policy Optimization" was released in January 2026 by Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov - a team of researchers at NVIDIA. The paper's central claim is that applying Group Relative Policy Optimization (GRPO) directly to multi-reward settings is fundamentally flawed: normalizing a combined reward signal causes distinct rollout advantages to collapse into identical values, eroding the training signal and causing instability. The authors introduce GDPO, which decouples normalization to operate per reward before aggregation, preserving advantage resolution and enabling stable, consistent improvement across tool calling, math reasoning, and coding tasks.

Read the Paper on arXiv →

The Problem Before This Paper

GRPO, introduced as part of DeepSeek-R1's training pipeline, computes advantages by normalizing rewards across a group of rollouts from the same prompt. It was designed for single-reward settings, where the only signal is, say, answer correctness. Modern RLHF pipelines increasingly combine multiple rewards - correctness, format adherence, output length constraints - to shape more nuanced behavior. The straightforward approach is to aggregate these rewards into a weighted sum and apply GRPO as-is. The NVIDIA team shows this is not a safe default: when rewards of varying scale and difficulty are summed before normalization, rollouts that differ meaningfully on individual reward components end up with nearly identical normalized advantages. The gradient signal becomes nearly uniform across the group, reducing effective batch diversity and, in the case of math reasoning with a format reward, causing training collapse around step 400.

What They Built

GDPO's core change is to move normalization inside the reward loop rather than applying it after aggregation. For each reward r_i in the multi-reward set, a group-wise mean and standard deviation are computed independently across the rollout group, producing a per-reward normalized advantage A_i. These per-reward advantages are then summed and passed through a final batch-level normalization to produce the final advantage A used in the policy gradient update. This two-level normalization - per-reward group normalization followed by batch normalization - ensures that each reward retains its relative ordering across rollouts regardless of the absolute scale differences between reward types. The paper also introduces reward conditioning as a complementary technique: when one reward is significantly harder to satisfy than another (e.g., format correctness vs. answer correctness in math), the easier reward's gradient is gated on whether the harder reward was already satisfied, preventing the model from gaming the easier signal at the cost of the harder one.

// GRPO (broken in multi-reward):
A = normalize( sum_i( w_i * r_i ) )

// GDPO (decoupled):
A_i = group_normalize( r_i ) // per-reward, independent
A = batch_normalize( sum_i( w_i * A_i ) )

// Reward conditioning (difficulty-aware gating):
A_easy = A_easy * I[ r_hard > threshold ]
// Easy reward gradient fires only when hard reward is satisfied

Key Findings

Advantage collapse is a structural problem, not a tuning issue. As rollout count or reward count increases, the number of distinct advantage groups under GRPO drops sharply. GDPO maintains high group diversity across all configurations tested.
GRPO collapses on math reasoning; GDPO does not. On DeepSeek-R1-1.5B with correctness and length rewards, GRPO training diverges around step 400. GDPO runs to completion with stable loss curves.
Removing std normalization from GRPO ("GRPO w/o std") is insufficient. This variant marginally improves advantage diversity but still produces 0% format correctness on BFCL-v3 tool calling - the same failure mode as standard GRPO.
Reward conditioning outperforms weight adjustment. Manually down-weighting easier rewards reduces their contribution but does not prevent the model from satisfying them at the wrong time. Conditioning gates the gradient more precisely.
Gains are consistent across model sizes and task types. GDPO improves over GRPO at both 1.5B and 3B parameter scales and across all three task domains tested, with no settings where GRPO is competitive.

Results

On tool calling with Qwen2.5-Instruct models evaluated on BFCL-v3, GDPO raises average accuracy from 30.18% to 32.81% at 1.5B parameters, and from 39.20% to 40.87% at 3B - while also improving format correctness from 76.33% to 80.66% (1.5B) and 81.64% to 82.23% (3B). On math reasoning with DeepSeek-R1-1.5B, GDPO achieves 29.4% on AIME-24 versus GRPO's 23.1%, and 86.2% on MATH500 versus 83.6%, while also reducing the length-exceed rate from 10.8% to 6.5% - a direct indicator of improved length reward adherence. On coding with DeepSeek-R1-7B evaluated on Codeforces, pass rate improves from 68.1% to 71.2% and bug ratio drops from 7.0% to 5.6%. Across all tasks, no single setting shows a regression under GDPO relative to GRPO.

Why This Matters for AI and Automation

Multi-reward RLHF is the production default. Single-reward training is a research simplification. Any real alignment pipeline - whether for assistants, agents, or domain-specific models - combines at least correctness, safety, and format signals. This paper exposes a silent failure mode that practitioners likely already hit without knowing the cause.
Drop-in fix with no architectural change. GDPO is a normalization schedule change, not a new algorithm. It requires no changes to model architecture, reward model design, or rollout strategy. The integration cost is low relative to the reliability gain.
Training stability directly reduces compute waste. The GRPO collapse on math at step 400 means hours of GPU time and reward model queries are lost to a failed run. GDPO eliminates this category of failure, making RL fine-tuning runs more predictable and resource-efficient at scale.
Reward conditioning is broadly applicable. The insight that reward difficulty disparities require gating, not just reweighting, generalizes to any pipeline where one signal is harder to satisfy than another - a common scenario in enterprise LLM fine-tuning for structured output or compliance requirements.

My Take

This is an important paper precisely because it is narrow and empirically honest. The contribution is not a new architecture or a new training paradigm - it is a documented failure mode in a widely adopted method, with a principled fix and rigorous ablations. The collapse behavior GRPO exhibits at step 400 in math training is the kind of issue that would appear in an internal experiment as "unstable training run" and get attributed to hyperparameters or data quality. The team here does the work to isolate the cause: advantage homogenization from pre-normalization aggregation. The solution, decoupled group normalization, is elegant and the ablation showing GRPO w/o std still fails on format rewards is particularly well-designed - it rules out the obvious partial fix before proposing the full one. The open question is whether GDPO's advantage of preserving per-reward resolution holds when reward count grows to five or more signals, as some enterprise alignment pipelines require, or when rewards are correlated rather than independent. The paper tests up to two rewards, and the interaction dynamics in higher-dimensional reward spaces remain unexplored.

Discussion question: If GRPO's normalization collapses advantages when rewards are aggregated before normalization, what does this imply about reward model design choices - specifically, should practitioners design reward models that output independent scalar signals rather than composite scores, even when the behaviors being evaluated are inherently correlated?

GDPO: Why GRPO Breaks Under Multiple Rewards - and How to Fix It