Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions .claude/skills/deployment/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
---
name: deployment
description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).
license: Apache-2.0
---

# Deployment Skill

Serve a model checkpoint as an OpenAI-compatible inference endpoint. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy).

## Quick Start

Prefer `scripts/deploy.sh` for standard local deployments — it handles quant detection, health checks, and server lifecycle. Use the raw framework commands in Step 4 when you need flags the script doesn't support, or for remote deployment.

```bash
# Start vLLM server with a ModelOpt checkpoint
scripts/deploy.sh start --model ./qwen3-0.6b-fp8

# Start with SGLang and tensor parallelism
scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4

# Start from HuggingFace hub
scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8

# Test the API
scripts/deploy.sh test

# Check status
scripts/deploy.sh status

# Stop
scripts/deploy.sh stop
```
Comment on lines +15 to +33
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify script path relative to repository root.

The examples use scripts/deploy.sh but the script is at .claude/skills/deployment/scripts/deploy.sh. Users running from repo root would need the full path.

📝 Suggested clarification

Either update paths to be absolute from repo root:

.claude/skills/deployment/scripts/deploy.sh start --model ./qwen3-0.6b-fp8

Or add a note that examples assume the working directory is .claude/skills/deployment/.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/SKILL.md around lines 15 - 33, Update the example
commands to use the repository-root-correct path or add a working-directory
note: replace occurrences of scripts/deploy.sh with
.claude/skills/deployment/scripts/deploy.sh in the examples (e.g., the four
start/test/status/stop commands) or add a single-line note above the code block
stating the examples assume you run them from .claude/skills/deployment/ so
users know to adjust the path when invoking scripts/deploy.sh.


The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4), server lifecycle (start/stop/restart/status), health check polling, and API testing.

## Decision Flow

### 0. Check workspace (multi-user / Slack bot)

If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:

```bash
ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
```

If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.

### 1. Identify the checkpoint

Determine what the user wants to deploy:

- **Local quantized checkpoint** (from ptq skill or manual export): look for `hf_quant_config.json` in the directory. If coming from a prior PTQ run in the same workspace, check common output locations: `output/`, `outputs/`, `exported_model/`, or the `--export_path` used in the PTQ command.
- **HuggingFace model hub** (e.g., `nvidia/Llama-3.1-8B-Instruct-FP8`): use directly
- **Unquantized model**: deploy as-is (BF16) or suggest quantizing first with the ptq skill

> **Note:** This skill expects HF-format checkpoints (from PTQ with `--export_fmt hf`). TRT-LLM format checkpoints should be deployed directly with TRT-LLM — see `references/trtllm.md`.

Check the quantization format if applicable:

```bash
cat <checkpoint_path>/hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json"
```

If not found, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither exists, the checkpoint is unquantized.

### 2. Choose the framework

If the user hasn't specified a framework, recommend based on this priority:

| Situation | Recommended | Why |
|-----------|-------------|-----|
| General use | **vLLM** | Widest ecosystem, easy setup, OpenAI-compatible |
| Best SGLang model support | **SGLang** | Strong DeepSeek/Llama 4 support |
| Maximum optimization | **TRT-LLM** | Best throughput via engine compilation |
| Mixed-precision / AutoQuant | **TRT-LLM AutoDeploy** | Only option for AutoQuant checkpoints |

Check the support matrix in `references/support-matrix.md` to confirm the model + format + framework combination is supported.

### 3. Check the environment

Read `skills/common/environment-setup.md` for GPU detection, local vs remote, and SLURM/Docker/bare metal detection. After completing it you should know: GPU model/count, local or remote, and execution environment.

Then check the **deployment framework** is installed:

```bash
python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"
```

If not installed, consult `references/setup.md`.

**GPU memory estimate** (to determine tensor parallelism):

- BF16: `params × 2 bytes` (8B ≈ 16 GB)
- FP8: `params × 1 byte` (8B ≈ 8 GB)
- FP4: `params × 0.5 bytes` (8B ≈ 4 GB)
- Add ~2-4 GB for KV cache and framework overhead

If the model exceeds single GPU memory, use tensor parallelism (`-tp <num_gpus>`).
Comment on lines +80 to +101
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we delegate this part to common/environment.md?

The PTQ skill (PR #1107) delegates to common/environment-setup.md. I think the deployment skill should do the same for consistency — GPU detection is the same regardless of skill.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegate GPU/environment detection to common/environment-setup.md.


### 4. Deploy

Read the framework-specific reference for detailed instructions:

| Framework | Reference file |
|-----------|---------------|
| vLLM | `references/vllm.md` |
| SGLang | `references/sglang.md` |
| TRT-LLM | `references/trtllm.md` |

**Quick-start commands** (for common cases):

#### vLLM

```bash
# Serve as OpenAI-compatible endpoint
python -m vllm.entrypoints.openai.api_server \
--model <checkpoint_path> \
--quantization modelopt \
--tensor-parallel-size <num_gpus> \
--host 0.0.0.0 --port 8000
```

For NVFP4 checkpoints, use `--quantization modelopt_fp4`.

#### SGLang

```bash
python -m sglang.launch_server \
--model-path <checkpoint_path> \
--quantization modelopt \
--tp <num_gpus> \
--host 0.0.0.0 --port 8000
```

#### TRT-LLM (direct)

```python
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="<checkpoint_path>")
outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95))
```

#### TRT-LLM AutoDeploy

For AutoQuant or mixed-precision checkpoints, see `references/trtllm.md`.

### 5. Verify the deployment

After the server starts, verify it's healthy:

```bash
# Health check
curl -s http://localhost:8000/health

# List models
curl -s http://localhost:8000/v1/models | python -m json.tool

# Test generation
curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model_name>",
"prompt": "The capital of France is",
"max_tokens": 32
}' | python -m json.tool
```

All checks must pass before reporting success to the user.

### 6. Remote deployment (SSH/SLURM)

If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:

1. **Source remote utilities:**

```bash
source .claude/skills/common/remote_exec.sh
remote_load_cluster
remote_check_ssh
remote_detect_env
```

2. **Sync the checkpoint** (only if it was produced locally):

If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:

```bash
remote_sync_to <local_checkpoint_path> checkpoints/
```

3. **Deploy based on remote environment:**

- **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). Use `remote_submit_job` and `remote_poll_job` to manage the job. Get the node hostname from `squeue -j $JOBID -o %N`.

- **Bare metal / Docker** — use `remote_run` to start the server directly:

```bash
remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &"
```

4. **Verify remotely:**

```bash
remote_run "curl -s http://localhost:8000/health"
remote_run "curl -s http://localhost:8000/v1/models"
```

5. **Report the endpoint** — include the remote hostname and port so the user can connect (e.g., `http://<node_hostname>:8000`). For SLURM, note that the port is only reachable from within the cluster network.

For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically.

## Error Handling

| Error | Cause | Fix |
|-------|-------|-----|
| `CUDA out of memory` | Model too large for GPU(s) | Increase `--tensor-parallel-size` or use a smaller model |
| `quantization="modelopt" not recognized` | vLLM/SGLang version too old | Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10 |
| `hf_quant_config.json not found` | Not a ModelOpt-exported checkpoint | Re-export with `export_hf_checkpoint()`, or remove `--quantization` flag |
| `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors |
| `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` |

Comment on lines +217 to +224
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

vLLM modelopt quantization minimum version requirement

💡 Result:

vLLM's ModelOpt quantization requires a minimum CUDA compute capability of 8.9 (Hopper GPUs like H100 and newer). There is no explicit minimum vLLM software version requirement stated in the official documentation; support is available in recent versions as evidenced by the current docs (published 2025-10-23). ModelOpt quantization was added before April 2025, with extensions like Mixtral support merged around then (PR #15961). Use the latest vLLM version for full compatibility.

Citations:


🌐 Web query:

SGLang modelopt quantization version requirement minimum

💡 Result:

SGLang's native ModelOpt quantization support requires SGLang version 0.5 or later. This feature was introduced via merged pull requests #7149 (merged October 6, 2025), #9991 (merged October 11, 2025), and #10154, with full integration announced in the LMSYS blog post on December 2, 2025. Official documentation lists --quantization modelopt, --quantization modelopt_fp8, --quantization modelopt_fp4, and --modelopt-quant (e.g., 'fp8', 'nvfp4') as supported options, requiring installation of nvidia-modelopt via pip install nvidia-modelopt. Earlier versions had partial ModelOpt support (e.g., PR #5145 merged April 8, 2025, for auto-detection), but native quantization workflow mandates post-October 2025 releases. Use the latest version (e.g., v0.5.10rc0 as of March 2026) for optimal compatibility with Hopper/Blackwell GPUs.

Citations:


🌐 Web query:

vLLM modelopt quantization support version history when added

💡 Result:

vLLM added support for ModelOpt (NVIDIA Model Optimizer) quantization in September 2024 via PR #6112, merged on 2024-09-11. This introduced the "modelopt" quantization method for loading ModelOpt FP8 static scaling checkpoints, usable as LLM(model="path", quantization="modelopt"). It requires CUDA compute capability 8.9+ (Hopper GPUs). Subsequent enhancements include KV cache scaling support (PR #11787, merged Jan 2025) and MXFP8/MXFP4 support in later versions (e.g., v0.14.0+ for some formats). ModelOpt supports FP8 variants like per-tensor, per-channel-per-token, FP8_PB_WO, and mixed precision with NVFP4, using torch._scaled_mm for static weight/activation scales.

Citations:


Correct SGLang version requirement for ModelOpt quantization support.

The error table states SGLang >= 0.4.10 for --quantization modelopt support, but documentation indicates SGLang >= 0.5 is required. Native ModelOpt quantization support was introduced in SGLang v0.5 (full integration December 2025). Update line 257 to reflect: SGLang >= 0.5.

For vLLM, while official documentation does not explicitly state a minimum version, v0.10.1 is a reasonable conservative estimate given ModelOpt support was added in September 2024; this can remain as-is or be updated to "latest version recommended" if you prefer to match vLLM's official guidance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/SKILL.md around lines 254 - 261, Update the table
row for the `quantization="modelopt" not recognized` error to require SGLang >=
0.5 instead of the current `SGLang >= 0.4.10`; locate the row by the exact cell
text `quantization="modelopt" not recognized` and replace the `SGLang >= 0.4.10`
string with `SGLang >= 0.5` (leave the vLLM entry `vLLM >= 0.10.1` as-is or
change to a “latest version recommended” note if desired).

## Success Criteria

1. Server process is running and healthy (`/health` returns 200)
2. Model is listed at `/v1/models`
3. Test generation produces coherent output
4. Server URL and port are reported to the user
5. If benchmarking was requested, throughput/latency numbers are reported
106 changes: 106 additions & 0 deletions .claude/skills/deployment/references/setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Deployment Environment Setup
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need add multi-node support, but it can be complicated, maybe we can support vllm multi node first.


## Framework Installation

### vLLM

```bash
pip install vllm
```

Minimum version: 0.10.1

### SGLang

```bash
pip install "sglang[all]"
```

Minimum version: 0.4.10

### TRT-LLM

TRT-LLM is best installed via NVIDIA container:

```bash
docker pull nvcr.io/nvidia/tensorrt-llm/release:<version>
```

Or via pip (requires CUDA toolkit):

```bash
pip install tensorrt-llm
```

Minimum version: 0.17.0

## SLURM Deployment

For SLURM clusters, deploy inside a container. Container flags MUST be on the `srun` line:

```bash
#!/bin/bash
#SBATCH --job-name=deploy
#SBATCH --account=<account>
#SBATCH --partition=<partition>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=<num_gpus>
#SBATCH --time=04:00:00
#SBATCH --output=deploy_%j.log

srun \
--container-image="<path/to/container.sqsh>" \
--container-mounts="<data_root>:<data_root>" \
--container-workdir="<workdir>" \
--no-container-mount-home \
bash -c "python -m vllm.entrypoints.openai.api_server \
--model <checkpoint_path> \
--quantization modelopt \
--tensor-parallel-size <num_gpus> \
--host 0.0.0.0 --port 8000"
```

To access the server from outside the SLURM node, note the allocated hostname:

```bash
squeue -u $USER -o "%j %N %S" # Get the node name
# Then SSH tunnel or use the node's hostname directly
```

## Docker Deployment
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For real quant, we can use official docker of vllm/sglang/trt-llm. Maybe we can list the docker links here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added official docker.


### Official Images (recommended)

| Framework | Image | Source |
|-----------|-------|--------|
| vLLM | `vllm/vllm-openai:latest` | <https://hub.docker.com/r/vllm/vllm-openai> |
| SGLang | `lmsysorg/sglang:latest` | <https://hub.docker.com/r/lmsysorg/sglang> |
| TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:latest` | <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/> |

Example with the official vLLM image:

```bash
docker run --gpus all -p 8000:8000 \
-v /path/to/checkpoint:/model \
vllm/vllm-openai:latest \
--model /model \
--quantization modelopt \
--host 0.0.0.0 --port 8000
```

### Custom Image (optional)

A Dockerfile is also available at `examples/vllm_serve/Dockerfile` if you need a custom build:

```bash
docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .

docker run --gpus all -p 8000:8000 \
-v /path/to/checkpoint:/model \
vllm-modelopt \
python -m vllm.entrypoints.openai.api_server \
--model /model \
--quantization modelopt \
--host 0.0.0.0 --port 8000
```
Loading
Loading