[recipe] feat: @register_policy_loss("geo_mean"); Geometric-Mean Policy Optimization #2795

MzeroMiko · 2025-07-29T06:20:03Z

What does this PR do?

This is the official implementaion of paper Geometric-Mean Policy Optimization.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

The code has trained for 100 iterations, and is still running.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

A new policy loss function has been added into "verl/trainer/ppo/core_algos.py"

@register_policy_loss("geo_mean")
def compute_policy_loss_geo_mean(
    old_log_prob: torch.Tensor,
    log_prob: torch.Tensor,
    advantages: torch.Tensor,
    response_mask: torch.Tensor,
    loss_agg_mode: str = "token-mean",
    config: Optional[DictConfig | AlgoConfig] = None,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch. Tensor]:
    ...

We also added directory "examples/gmpo_trainer" for quick start.

Design & Code Changes

see above

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request introduces Geometric-Mean Policy Optimization (GMPO) by adding a new policy loss function. The implementation is a good step towards incorporating this new algorithm. However, I've identified a critical issue related to potential division-by-zero errors that could cause training instability, and another high-severity issue concerning redundant code that impacts maintainability. Addressing these points will make the implementation more robust and cleaner.

verl/trainer/ppo/core_algos.py

CLAassistant · 2025-07-29T06:22:41Z

All committers have signed the CLA.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

examples/gmpo_trainer/run_qwen2_5-7b_math.sh

examples/gmpo_trainer/gmpo_teaser.png

vermouth1992 · 2025-07-29T12:40:45Z

Please fix pre-commit according to readme and let's merge it

…cy Optimization (volcengine#2795) ### What does this PR do? > This is the official implementaion of paper [***Geometric-Mean Policy Optimization***](https://arxiv.org/abs/2507.20673). ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > The code has trained for 100 iterations, and is still running. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. A new policy loss function has been added into "verl/trainer/ppo/core_algos.py" ```python @register_policy_loss("geo_mean") def compute_policy_loss_geo_mean( old_log_prob: torch.Tensor, log_prob: torch.Tensor, advantages: torch.Tensor, response_mask: torch.Tensor, loss_agg_mode: str = "token-mean", config: Optional[DictConfig | AlgoConfig] = None, ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch. Tensor]: ... ``` We also added directory "examples/gmpo_trainer" for quick start. ### Design & Code Changes > see above ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

MzeroMiko added 3 commits July 28, 2025 23:25

update gmpo

8a87117

Merge remote-tracking branch 'upstream/main'

b1372ce

update readme

4e49963

MzeroMiko requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners July 29, 2025 06:20

gemini-code-assist bot reviewed Jul 29, 2025

View reviewed changes

verl/trainer/ppo/core_algos.py Outdated Show resolved Hide resolved

verl/trainer/ppo/core_algos.py Show resolved Hide resolved

MzeroMiko and others added 2 commits July 29, 2025 14:23

Update verl/trainer/ppo/core_algos.py to avoid response_mask.sum() == 0

d53b0f0

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

rename parameter: rename clip_ratio for clarity

361e365

vermouth1992 reviewed Jul 29, 2025

View reviewed changes

examples/gmpo_trainer/run_qwen2_5-7b_math.sh Show resolved Hide resolved

update script with dapo

0b0293b

vermouth1992 reviewed Jul 29, 2025

View reviewed changes

examples/gmpo_trainer/gmpo_teaser.png Outdated Show resolved Hide resolved

MzeroMiko added 3 commits July 29, 2025 20:09

Update README.md

d64500b

Delete examples/gmpo_trainer/gmpo_teaser.png

1633cb1

Update core_algos.py (for code line length limitation)

a2dc825

MzeroMiko changed the title ~~[recipe] {feat}: {@register_policy_loss("geo_mean"); Geometric-Mean Policy Optimization}~~ [recipe] feat: @register_policy_loss("geo_mean"); Geometric-Mean Policy Optimization Jul 29, 2025

vermouth1992 approved these changes Jul 29, 2025

View reviewed changes

regular format with ruff

dc45707

vermouth1992 approved these changes Jul 29, 2025

View reviewed changes

vermouth1992 merged commit 977b7d9 into volcengine:main Jul 29, 2025
46 of 51 checks passed

erictang000 mentioned this pull request Jan 2, 2026

[skyrl-train] Add GMPO NovaSky-AI/SkyRL#834

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[recipe] feat: @register_policy_loss("geo_mean"); Geometric-Mean Policy Optimization #2795

[recipe] feat: @register_policy_loss("geo_mean"); Geometric-Mean Policy Optimization #2795

Uh oh!

MzeroMiko commented Jul 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

vermouth1992 commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[recipe] feat: @register_policy_loss("geo_mean"); Geometric-Mean Policy Optimization #2795

[recipe] feat: @register_policy_loss("geo_mean"); Geometric-Mean Policy Optimization #2795

Uh oh!

Conversation

MzeroMiko commented Jul 29, 2025

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vermouth1992 commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Jul 29, 2025 •

edited

Loading