-
Notifications
You must be signed in to change notification settings - Fork 3k
[algo] Add SPO (Single-stream Policy Optimization) recipe implementation #3503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[algo] Add SPO (Single-stream Policy Optimization) recipe implementation #3503
Conversation
- Add SPO algorithm implementation with KL-adaptive value tracker - Implement single-stream architecture eliminating group synchronization - Add prioritized sampling and global advantage normalization - Include comprehensive README with performance results and usage guide - Add configuration files and training scripts - Achieve +3.4 pp improvement on math benchmarks vs GRPO
Remove Chinese language comments from spo_ray_trainer.py to improve code readability and maintain English-only codebase standards.
…990407/verl_spo_dev into feature/spo-implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the Single-stream Policy Optimization (SPO) algorithm, a novel reinforcement learning method for Large Language Models. The changes primarily consist of new files for the SPO recipe, including configuration, the main training script, a run script, and the core Ray trainer implementation. My review has identified two critical issues. First, the run_spo.sh script uses an undefined variable which will cause the training to fail at launch. Second, the spo_ray_trainer.py contains unsafe exception handling during data resampling, which could lead to silent data corruption and hard-to-debug training failures. Addressing these issues is crucial for the correctness and stability of the new algorithm's implementation.
|
Could you pin the verl commit in your readme? |
recipe/spo/README.md
Outdated
| ```bash | ||
| # Enable SPO training mode | ||
| export SPO_ENABLE=True | ||
| export SPO_OFFLINE_VALUES="/path/to/offline/values.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see Appendix A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is an offline value estimate (Appendix A), I have added a link to huggingface in the README.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
I have update it in the readme file. |
- Switch offline values from local JSON file to HuggingFace dataset loading - Update README with offline value generation instructions - Add debug mode support with RAY_DEBUG flag in config - Fix config name reference from ppo_trainer to spo_trainer - Update batch sizes and paths to use environment variables - Change custom module paths from retool to spo directory - Switch multi-turn format from retool_paper to hermes - Adjust offline value threshold from 0 to 0.5 for binary classification This improves the SPO training pipeline by using centralized dataset storage and providing better configuration flexibility through environment variables.
|
@wuxibin89 @vermouth1992 @tongyx361 @PeterSH6 Thanks for all the great feedback! I have updated the code based on the review comments and pushed the changes. Please take a quick look when you have a chance. Thanks! |
…dataset processing and execution
|
@hustnn @wuxibin89 @vermouth1992 @tongyx361 @PeterSH6 Hey folks, can you review this PR? |
|
Hi @zhongwen-xu ,I am not in the verl team. I am interested in your work and testing it in my case (a coding model). If it working, I will let you know. Do I need to first run it to get offline values before training? |
|
Hi @hustnn, Thanks for your interests. For your question, please refer to Appendix E in https://arxiv.org/abs/2509.13232, especially Figure 6 (c) for "Offline Initialization Ablation". Short answer: the offline initialization helps for early training steps, in the long run, with and without the initializaon are similar. My recommendation is to get the offline values before training though, a good v_hat estimation with the offline sampling is relatively cheap, same cost as what people do for filtering prompts, and they can be shared across multiple experiment runs. Happy to answer any further questions by email! |
|
Hi @zhongwen-xu ,Thanks for your great work on SPO. I'm testing your implementation using the training scripts provided in this PR ( |
|
Hi @zsgvivo Thanks for your interests in our work! We are working on a full replication of what we had in the paper where the tool call protocol is with , rather than the JSON function calls as it was in ReTool implementation. (Note that we actually present all the core code in this PR so if you can modify the tool call protocol as what I just said, it should just work, without waiting for our implementation.) Stay tuned! |
|
Hi @zhongwen-xu and SPO authors, Thanks a lot for sharing your work and the initial SPO implementation in this PR. From your last comment, I understand that this PR version didn’t pass your verification run and that you’re working on a full replication with the updated tool-calling protocol (instead of the original ReTool JSON function calls). I’m very interested in reproducing the SPO results and applying the method to my own use cases. May I ask if there is currently a validated/recommended implementation of SPO available (e.g., a newer branch, updated scripts, or a different repo) that you would suggest users follow? Thanks again for the great work, and really appreciate any pointers! |
|
@longls777 SPO has been merged into https://github.com/verl-project/verl-recipe |
|
Thanks so much for sharing this implementation - really appreciate your help! |

What does this PR do?
This PR implements the Single-stream Policy Optimization proposed by paper https://arxiv.org/abs/2509.13232.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)