GRPO (single-action)
Update: minimize reward-weighted NLL(action | prompt). |
GSPO (grouped)
We minimize: sum_i q_i · NLL_i + β · KL(pi | pref) |
Implementation hooks
Flags (runner)
--samples
: group size K--eta
: weight temperature--adv-norm : softmax |
zscore | rank | baseline |
--beta
: KL weight--beta-schedule : fixed |
target; --target-kl |
--ema-ref-decay
: EMA decay for reference policy