[1/N][rollout] feat: support vllm/sglang native http server #3456

wuxibin89 · 2025-09-11T15:46:57Z

What does this PR do?

This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode,
the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes.

We're going to support three deployment modes:

hybrid mode: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training.
standalone mode: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training.
colocated mode: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge).

Following PR will be:

[2/N] support DP+EP [2/N][rollout] feat: support vllm/sglang DP+EP in server mode #3530
[3/N] standalone rollout with weight transfer by NCCL/UCX
[4/N] colocated GRM service with wake_up/sleep(without weight synchronization) [trainer, worker] feat: more flexible and easy-to-use reward model #3679
[5/N] switch to /generate http api with token-in-token-out: currently sglang has /generate api but may need some effort to support multi-modal; while vllm still lack /generate api
[6/N] switch to sglang/vllm router with better kv-cache awareness load balance

The native http server is inspired by the design of slime, thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 #3090 @SuperCB #3102 with their prior contribution.

gemini-code-assist

Code Review

This pull request introduces a significant and valuable refactoring to support native HTTP servers for vLLM and SGLang, which improves modularity and decoupling. The new RolloutReplica abstraction is clean and simplifies the server management logic. However, I've identified a few critical issues, particularly in the sglang integration, that need to be addressed to ensure correctness and maintainability, especially for multi-node and on-policy training scenarios. These are related to known FIXMEs in the code regarding device_mesh correctness and potential random weight loading on resume. Additionally, there's a fragile mock for vllm that should be replaced with a more robust solution.

verl/workers/rollout/sglang_rollout/async_sglang_server.py

verl/workers/rollout/sglang_rollout/sglang_rollout.py

verl/workers/rollout/replica.py

verl/trainer/config/ppo_megatron_trainer.yaml

verl/workers/megatron_workers.py

lizipao · 2025-09-15T07:38:02Z

Thank you for your contribution. May I ask what configuration changes are used to implement the switching between these modes?

lizipao · 2025-09-15T08:05:16Z

What's the difference between the server mode implemented in #3090 and the current HYBRID mode? Wasn't the previous server mode also running in separate processes?

wuxibin89 · 2025-09-15T08:08:32Z

Thank you for your contribution. May I ask what configuration changes are used to implement the switching between these modes?

@lizipao

hybrid mode: this is the default mode, no additional configuration change needed.
colocated mode: this is for GRM(generative reward model) only, still needed to develop in 4/N.
standalone mode: this is for off-policy training, we will refactor verl/recipe/one_step_off_policy with this PR in 3/N. The configuration changes is in:
https://github.com/volcengine/verl/blob/main/recipe/one_step_off_policy/config/one_step_off_ppo_trainer.yaml#L9-L14

lizipao · 2025-09-15T13:07:14Z

May I ask why skip_tokenizer_init is set to True?

wuxibin89 · 2025-09-15T13:24:44Z

May I ask why skip_tokenizer_init is set to True?

Because we want token-in-token-out as default.

Cesilina · 2025-09-19T09:43:28Z

想问下，多机情况下，其他节点能正常访问rollout server的address吗

wuxibin89 · 2025-09-24T03:04:33Z

想问下，多机情况下，其他节点能正常访问rollout server的address吗

@Cesilina 可以的

Cesilina · 2025-09-24T05:51:02Z

想问下，多机情况下，其他节点能正常访问rollout server的address吗

@Cesilina 可以的

嗯嗯，谢谢，

lizipao · 2025-09-25T08:42:39Z

Hi there,

I'd like to understand the differences and relationship between the AsyncHttpServerAdapter implemented in #3090 and the current SGLangHttpServer used in HYBRID mode.

From my code analysis, I found that:

When rollout_mode is set to async, the system uses SGLangHttpServer (as a Ray Actor)
Otherwise, it uses AsyncHttpServerAdapter (as an HTTP client adapter)
Could you briefly explain the differences between these two implementations from both performance and architectural design perspectives? Specifically:

The differences in their call paths (internal API vs HTTP requests)
Their respective use cases and performance characteristics
Why we need two different implementations
Thanks for your time!

### What does this PR do? Following #3456, support vllm/sglang DP+EP in server mode.

lizipao · 2025-09-28T13:00:02Z

在这个模式下我发现AgentLoopWorker里，没有传top_k，并且config.calculate_log_probs的默认值为False，这是正常的吗？

bhks · 2025-10-06T22:22:00Z

Hi @wuxibin89 and @vermouth1992 ,

Thank you for this PR.

Is it possible for users to use the SGLang Router with current code in main ?

Based on my analysis we are not passing the router ip and port from top level trainer modules. I looked into the SGLangRollout class and _init_inference_engine method and don't see we are passing any router ip and port yet.

I only see the AsyncHttpServerAdapter and HttpServerAdapter have the router ip and port. Wondering what changes we need to make to just add that suppprt ?

Do we have to make other changes before we can enable the sglang router support. Given you have added as last item in your list of items to complete.

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

### What does this PR do? Following volcengine/verl#3456, support vllm/sglang DP+EP in server mode.

Ericnano · 2025-12-11T04:14:23Z

请教一下这个pr支持的vllm是哪一个版本的，在0.9.1上会报错No module named 'vllm.v1.engine.utils'

ji-huazhong · 2025-12-11T15:48:23Z

@Ericnano >0.9.1

@ChangyiYang

…ne#3456) ### What does this PR do? This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode, the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes. We're going to support three deployment modes: - **hybrid mode**: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training. - **standalone mode**: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training. - **colocated mode**: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge). <img width="2644" height="1276" alt="image" src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7" /> Following PR will be: - [2/N] support DP+EP - [3/N] standalone rollout with weight transfer by NCCL/UCX - [4/N] colocated GRM service with wake_up/sleep(without weight synchronization) - [5/N] switch to `/generate` http api with token-in-token-out: currently sglang has `/generate` api but may need some effort to support multi-modal; while vllm still lack `/generate` api - [6/N] switch to sglang/vllm router with better kv-cache awareness load balance The native http server is inspired by the design of [slime](https://github.com/THUDM/slime), thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 volcengine#3090 @SuperCB volcengine#3102 with their prior contribution.

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

wuxibin89 requested review from PeterSH6, SwordFaith, chenhaiq, eric-haibin-lin, tongyx361, vermouth1992 and zhaochenyang20 as code owners September 11, 2025 15:46

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

verl/workers/rollout/sglang_rollout/async_sglang_server.py Show resolved Hide resolved

verl/workers/rollout/sglang_rollout/sglang_rollout.py Show resolved Hide resolved

verl/workers/rollout/replica.py Show resolved Hide resolved

wuxibin89 force-pushed the wuxibin/native_http_rollout branch from 6db64c2 to bf5d5c1 Compare September 12, 2025 11:11

vermouth1992 reviewed Sep 12, 2025

View reviewed changes

verl/trainer/config/ppo_megatron_trainer.yaml Outdated Show resolved Hide resolved

vermouth1992 reviewed Sep 12, 2025

View reviewed changes

verl/workers/megatron_workers.py Show resolved Hide resolved

wuxibin89 added 8 commits September 15, 2025 10:00

[rollout] feat: support vllm/sglang native http server

a8f69e3

support progress in pr title

c18b382

fix ci

0429a13

fix ci

25cedbc

disable cuda graph to acclerate ci

071b761

fix ci

d741e19

fix ci

35e1f43

fix ci

4059692

wuxibin89 force-pushed the wuxibin/native_http_rollout branch from a32d75d to 4059692 Compare September 15, 2025 02:15

wuxibin89 added 2 commits September 15, 2025 10:39

fix config

47d01b6

fix ci

eeee5ae

vermouth1992 approved these changes Sep 16, 2025

View reviewed changes

vermouth1992 merged commit fd8ae66 into volcengine:main Sep 16, 2025
59 of 60 checks passed

wuxibin89 mentioned this pull request Sep 18, 2025

[Bug][RL]: sleep level=2 does not work with expert parallel vllm-project/vllm#25171

Closed

1 task

This was referenced Sep 18, 2025

[rollout, vllm] feat: support blockwise fp8 rollout #3519

Merged

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode #3530

Merged

vermouth1992 pushed a commit that referenced this pull request Sep 26, 2025

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode (#3530)

84d5619

### What does this PR do? Following #3456, support vllm/sglang DP+EP in server mode.

masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request Oct 19, 2025

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode (volcen…

83424eb

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode (volcen…

9596bd6

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

mtian8 pushed a commit to mtian8/verl that referenced this pull request Nov 1, 2025

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode (volcen…

0dd08ec

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode (volcen…

55a0583

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode (volcen…

6fece5f

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

bhks mentioned this pull request Nov 16, 2025

verl next steps #3624

Open

chenhaiq pushed a commit to The-Hierophant/verl-1 that referenced this pull request Nov 18, 2025

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode (volcen…

b51fd67

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 26, 2025

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode (volcen…

489ce4f

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

paolo328 added a commit to paolo328/Verl that referenced this pull request Nov 27, 2025

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode (#3530)

4dc5f3e

### What does this PR do? Following volcengine/verl#3456, support vllm/sglang DP+EP in server mode.

TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025

[2/N][rollout] feat: support vllm/sglang DP+EP in server mode (volcen…

4077319

…gine#3530) ### What does this PR do? Following volcengine#3456, support vllm/sglang DP+EP in server mode.

[1/N][rollout] feat: support vllm/sglang native http server #3456

[1/N][rollout] feat: support vllm/sglang native http server #3456

Uh oh!

Conversation

wuxibin89 commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lizipao commented Sep 15, 2025

Uh oh!

lizipao commented Sep 15, 2025

Uh oh!

wuxibin89 commented Sep 15, 2025

Uh oh!

lizipao commented Sep 15, 2025

Uh oh!

wuxibin89 commented Sep 15, 2025

Uh oh!

Uh oh!

Cesilina commented Sep 19, 2025

Uh oh!

wuxibin89 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cesilina commented Sep 24, 2025

Uh oh!

lizipao commented Sep 25, 2025

Uh oh!

lizipao commented Sep 28, 2025

Uh oh!

bhks commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ericnano commented Dec 11, 2025

Uh oh!

ji-huazhong commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wuxibin89 commented Sep 11, 2025 •

edited

Loading

wuxibin89 commented Sep 24, 2025 •

edited

Loading

bhks commented Oct 6, 2025 •

edited

Loading