Publications
その他 (国際) A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning
Akifumi Wachi, Hirota Kinoshita (Toyota Technological Institute at Chicago), Shokichi Takakura, Rei Higuchi (University of Tokyo/RIKEN AIP), Taiji Suzuki (University of Tokyo/RIKEN AIP)
arXiv.org (arXiv)
2026.2.2
Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a relative-budget theory explaining this variation through a single quantity called relative budget ξ := H/E[T], where H is the generation hori- zon (token budget) and T denotes the number of tokens until the first correct solution under a base policy. We show that ξ determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analy- sis reveals three regimes: in the deficient regime (ξ → 0), informative trajectories are rare and the sample complexity explodes; in the balanced regime (ξ= Θ(1)), informative trajectories occur with non-negligible probability and RL is maxi- mally sample-efficient; and in the ample regime (ξ →∞), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that char- acterize learning progress across these regimes. Specifically, in a case study under idealized dis- tributional assumptions, we show that the relative budget grows linearly over iterations. Our empir- ical results confirm these predictions in realistic settings, identifying a budget ξ ∈[1.5,2.0] that maximizes learning efficiency and coincides with peak reasoning performance.
Paper :
A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning
(外部サイト)