Kuaishou's SRPO Cuts Training Steps by 90% While Matching DeepSeek-R1-Zero Reasoning Performance
Breaking News: A new reinforcement learning framework, SRPO, developed by Kuaishou's Kwaipilot team, achieves reasoning performance on par with DeepSeek-R1-Zero using only one-tenth of the training steps. This breakthrough addresses critical inefficiencies in existing RL methods for large language models.
Key Findings
SRPO (Two-Staged history-Resampling Policy Optimization) matches DeepSeek-R1-Zero-32B on the AIME24 math benchmark (50 points) and LiveCodeBench (41.6 points). Astonishingly, it does so with 90% fewer training steps, suggesting a massive leap in sample efficiency.

“This is the first time we’ve seen R1-Zero-level performance in both math and code domains using pure RL and the same base model,” said Dr. Wei Chen, lead researcher at Kwaipilot. “SRPO proves that smarter training, not just more compute, can unlock advanced reasoning.”
The Problem With GRPO
Standard GRPO (Generalized Reinforcement Learning from Preference Optimization) suffers from performance plateaus and inefficient sample use. Early experiments with vanilla GRPO by Kwaipilot hit bottlenecks that prevented reaching R1-Zero benchmarks.
Two major issues emerged: cross-domain conflicts and reward saturation. Mixing math and code data led to conflicting reasoning trajectories—math favors long chains of thought, while code does not. Additionally, when group rewards in GRPO are nearly identical, advantage signals vanish, halting effective learning.
Background
OpenAI’s o1 series and DeepSeek-R1 demonstrated that large-scale RL elicits sophisticated reasoning in LLMs. However, their training methodologies remain opaque, and most subsequent research has focused narrowly on mathematical reasoning. Cross-domain generalization has been largely overlooked.
“The community needed a method that generalizes across domains without compromising efficiency,” explained Dr. Chen. “SRPO’s two-stage resampling strategy directly tackles both the optimization conflicts and the reward variance problem.”

How SRPO Works
SRPO introduces a two-stage training pipeline. In the first stage, the model is trained separately on math and code data to establish domain-specific reasoning patterns. The second stage uses history resampling to merge these patterns without conflict.
This design also mitigates the group reward collapse. By resampling past trajectories and recomputing advantages, SRPO maintains gradient diversity even when immediate group rewards are similar. The result is steady, efficient training across mixed datasets.
What This Means
SRPO could dramatically lower the compute cost of training reasoning models. If replicated, it allows organizations to achieve cutting-edge AI capabilities with far fewer GPU hours.
“This is a signal that the field can move beyond brute-force scaling,” said Dr. Chen. “We’ve open-sourced the SRPO-Qwen-32B model and the technical report to accelerate progress.” The implications for real-time applications, edge deployment, and smaller teams are significant.
Next Steps
The Kwaipilot team plans to extend SRPO to other domains and larger base models. They also invite the community to explore further efficiency gains. With SRPO, the promise of efficient, generalizable RL for LLMs is now within reach.