Skip to content

Conversation

@wuxibin89
Copy link
Collaborator

@wuxibin89 wuxibin89 commented Sep 11, 2025

What does this PR do?

This is the first part to support vllm/sglang native http server in server mode rollout. In native http server mode,
the inference services are launched separately from the training engine, and the model runner share GPU with training engine but in different processes.

We're going to support three deployment modes:

  • hybrid mode: Training engine and model runner share GPU but in different process. To sync weights, there's a server adapter in training process, which is a http client to send wake_up/sleep/update_weights request to inference server. This is used for on-policy training.
  • standalone mode: Training engine and inference services have separate GPU resource, disaggregated architecture. This is used for off-policy training.
  • colocated mode: Like hybrid mode, but without server adapter since no need to sync weights. This is mainly used for GRM service (LLM as a judge).
image

Following PR will be:

The native http server is inspired by the design of slime, thanks to their prior work. Also credit to @ChangyiYang @zhaochenyang20 #3090 @SuperCB #3102 with their prior contribution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable refactoring to support native HTTP servers for vLLM and SGLang, which improves modularity and decoupling. The new RolloutReplica abstraction is clean and simplifies the server management logic. However, I've identified a few critical issues, particularly in the sglang integration, that need to be addressed to ensure correctness and maintainability, especially for multi-node and on-policy training scenarios. These are related to known FIXMEs in the code regarding device_mesh correctness and potential random weight loading on resume. Additionally, there's a fragile mock for vllm that should be replaced with a more robust solution.

@wuxibin89 wuxibin89 force-pushed the wuxibin/native_http_rollout branch from 6db64c2 to bf5d5c1 Compare September 12, 2025 11:11
@wuxibin89 wuxibin89 force-pushed the wuxibin/native_http_rollout branch from a32d75d to 4059692 Compare September 15, 2025 02:15
@lizipao
Copy link

lizipao commented Sep 15, 2025

Thank you for your contribution. May I ask what configuration changes are used to implement the switching between these modes?

@lizipao
Copy link

lizipao commented Sep 15, 2025

What's the difference between the server mode implemented in #3090 and the current HYBRID mode? Wasn't the previous server mode also running in separate processes?

@wuxibin89
Copy link
Collaborator Author

Thank you for your contribution. May I ask what configuration changes are used to implement the switching between these modes?

@lizipao

  1. hybrid mode: this is the default mode, no additional configuration change needed.
  2. colocated mode: this is for GRM(generative reward model) only, still needed to develop in 4/N.
  3. standalone mode: this is for off-policy training, we will refactor verl/recipe/one_step_off_policy with this PR in 3/N. The configuration changes is in:
    https://github.com/volcengine/verl/blob/main/recipe/one_step_off_policy/config/one_step_off_ppo_trainer.yaml#L9-L14

@lizipao
Copy link

lizipao commented Sep 15, 2025

May I ask why skip_tokenizer_init is set to True?

@wuxibin89
Copy link
Collaborator Author

May I ask why skip_tokenizer_init is set to True?

Because we want token-in-token-out as default.

@vermouth1992 vermouth1992 merged commit fd8ae66 into volcengine:main Sep 16, 2025
59 of 60 checks passed
@Cesilina
Copy link

想问下,多机情况下,其他节点能正常访问rollout server的address吗

@wuxibin89
Copy link
Collaborator Author

wuxibin89 commented Sep 24, 2025

想问下,多机情况下,其他节点能正常访问rollout server的address吗

@Cesilina 可以的

@Cesilina
Copy link

想问下,多机情况下,其他节点能正常访问rollout server的address吗

@Cesilina 可以的

嗯嗯,谢谢,

@lizipao
Copy link

lizipao commented Sep 25, 2025

Hi there,

I'd like to understand the differences and relationship between the AsyncHttpServerAdapter implemented in #3090 and the current SGLangHttpServer used in HYBRID mode.

From my code analysis, I found that:

When rollout_mode is set to async, the system uses SGLangHttpServer (as a Ray Actor)
Otherwise, it uses AsyncHttpServerAdapter (as an HTTP client adapter)
Could you briefly explain the differences between these two implementations from both performance and architectural design perspectives? Specifically:

The differences in their call paths (internal API vs HTTP requests)
Their respective use cases and performance characteristics
Why we need two different implementations
Thanks for your time!

vermouth1992 pushed a commit that referenced this pull request Sep 26, 2025
### What does this PR do?

Following #3456, support
vllm/sglang DP+EP in server mode.
@lizipao
Copy link

lizipao commented Sep 28, 2025

在这个模式下我发现AgentLoopWorker里,没有传top_k,并且config.calculate_log_probs的默认值为False,这是正常的吗?

@bhks
Copy link

bhks commented Oct 6, 2025

Hi @wuxibin89 and @vermouth1992 ,

Thank you for this PR.

Is it possible for users to use the SGLang Router with current code in main ?

Based on my analysis we are not passing the router ip and port from top level trainer modules. I looked into the SGLangRollout class and _init_inference_engine method and don't see we are passing any router ip and port yet.

I only see the AsyncHttpServerAdapter and HttpServerAdapter have the router ip and port. Wondering what changes we need to make to just add that suppprt ?

Do we have to make other changes before we can enable the sglang router support. Given you have added as last item in your list of items to complete.

masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request Oct 19, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request Oct 19, 2025
…gine#3530)

### What does this PR do?

Following volcengine#3456, support
vllm/sglang DP+EP in server mode.
techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025
…gine#3530)

### What does this PR do?

Following volcengine#3456, support
vllm/sglang DP+EP in server mode.
mtian8 pushed a commit to mtian8/verl that referenced this pull request Nov 1, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
mtian8 pushed a commit to mtian8/verl that referenced this pull request Nov 1, 2025
…gine#3530)

### What does this PR do?

Following volcengine#3456, support
vllm/sglang DP+EP in server mode.
wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025
…gine#3530)

### What does this PR do?

Following volcengine#3456, support
vllm/sglang DP+EP in server mode.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…gine#3530)

### What does this PR do?

Following volcengine#3456, support
vllm/sglang DP+EP in server mode.
@bhks bhks mentioned this pull request Nov 16, 2025
chenhaiq pushed a commit to The-Hierophant/verl-1 that referenced this pull request Nov 18, 2025
…gine#3530)

### What does this PR do?

Following volcengine#3456, support
vllm/sglang DP+EP in server mode.
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 26, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 26, 2025
…gine#3530)

### What does this PR do?

Following volcengine#3456, support
vllm/sglang DP+EP in server mode.
paolo328 added a commit to paolo328/Verl that referenced this pull request Nov 27, 2025
### What does this PR do?

Following volcengine/verl#3456, support
vllm/sglang DP+EP in server mode.
@Ericnano
Copy link

请教一下这个pr支持的vllm是哪一个版本的,在0.9.1上会报错No module named 'vllm.v1.engine.utils'

@ji-huazhong
Copy link
Collaborator

@Ericnano >0.9.1

TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…ne#3456)

### What does this PR do?

This is the first part to support vllm/sglang native http server in
server mode rollout. In native http server mode,
the inference services are launched separately from the training engine,
and the model runner share GPU with training engine but in different
processes.

We're going to support three deployment modes:
- **hybrid mode**: Training engine and model runner share GPU but in
different process. To sync weights, there's a server adapter in training
process, which is a http client to send wake_up/sleep/update_weights
request to inference server. This is used for on-policy training.
- **standalone mode**: Training engine and inference services have
separate GPU resource, disaggregated architecture. This is used for
off-policy training.
- **colocated mode**: Like hybrid mode, but without server adapter since
no need to sync weights. This is mainly used for GRM service (LLM as a
judge).
<img width="2644" height="1276" alt="image"
src="https://github.com/user-attachments/assets/2c1adf2d-adb5-4563-8a1a-8948f93b09b7"
/>

Following PR will be:
- [2/N] support DP+EP
- [3/N] standalone rollout with weight transfer by NCCL/UCX
- [4/N] colocated GRM service with wake_up/sleep(without weight
synchronization)
- [5/N] switch to `/generate` http api with token-in-token-out:
currently sglang has `/generate` api but may need some effort to support
multi-modal; while vllm still lack `/generate` api
- [6/N] switch to sglang/vllm router with better kv-cache awareness load
balance

The native http server is inspired by the design of
[slime](https://github.com/THUDM/slime), thanks to their prior work.
Also credit to @ChangyiYang @zhaochenyang20
volcengine#3090 @SuperCB
volcengine#3102 with their prior
contribution.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…gine#3530)

### What does this PR do?

Following volcengine#3456, support
vllm/sglang DP+EP in server mode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants