Add Agent Deployment skill for model serving #1133

coderabbitai · 2026-03-29T19:22:52Z

⚠️ Potential issue | 🟡 Minor

Clarify script path relative to repository root.

The examples use scripts/deploy.sh but the script is at .claude/skills/deployment/scripts/deploy.sh. Users running from repo root would need the full path.

📝 Suggested clarification

Either update paths to be absolute from repo root:

.claude/skills/deployment/scripts/deploy.sh start --model ./qwen3-0.6b-fp8

Or add a note that examples assume the working directory is .claude/skills/deployment/.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/SKILL.md around lines 15 - 33, Update the example commands to use the repository-root-correct path or add a working-directory note: replace occurrences of scripts/deploy.sh with .claude/skills/deployment/scripts/deploy.sh in the examples (e.g., the four start/test/status/stop commands) or add a single-line note above the code block stating the examples assume you run them from .claude/skills/deployment/ so users know to adjust the path when invoking scripts/deploy.sh.

Edwardf0t1 · 2026-03-31T07:03:50Z

How about we delegate this part to common/environment.md?

The PTQ skill (PR #1107) delegates to common/environment-setup.md. I think the deployment skill should do the same for consistency — GPU detection is the same regardless of skill.

Delegate GPU/environment detection to common/environment-setup.md.

coderabbitai · 2026-03-29T19:22:52Z

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

vLLM modelopt quantization minimum version requirement

💡 Result:

vLLM's ModelOpt quantization requires a minimum CUDA compute capability of 8.9 (Hopper GPUs like H100 and newer). There is no explicit minimum vLLM software version requirement stated in the official documentation; support is available in recent versions as evidenced by the current docs (published 2025-10-23). ModelOpt quantization was added before April 2025, with extensions like Mixtral support merged around then (PR #15961). Use the latest vLLM version for full compatibility.

Citations:

1: https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/modelopt.html

2: https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/quantization/modelopt/

3: https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/modelopt/

4: https://docs.vllm.ai/en/v0.9.0/api/vllm/model_executor/layers/quantization/base_config.html

5: https://docs.vllm.ai/en/v0.10.1/api/vllm/model_executor/layers/quantization/modelopt.html

6: https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/index.html

7: [Bug]: RuntimeError: ('Quantization scheme is not supported for ', 'the current GPU. Min capability: 80. ', 'Current capability: 75.') vllm-project/vllm#14040

8: Add support to modelopt quantization of Mixtral model vllm-project/vllm#15961

🌐 Web query:

SGLang modelopt quantization version requirement minimum

💡 Result:

SGLang's native ModelOpt quantization support requires SGLang version 0.5 or later. This feature was introduced via merged pull requests #7149 (merged October 6, 2025), #9991 (merged October 11, 2025), and #10154, with full integration announced in the LMSYS blog post on December 2, 2025. Official documentation lists --quantization modelopt, --quantization modelopt_fp8, --quantization modelopt_fp4, and --modelopt-quant (e.g., 'fp8', 'nvfp4') as supported options, requiring installation of nvidia-modelopt via pip install nvidia-modelopt. Earlier versions had partial ModelOpt support (e.g., PR #5145 merged April 8, 2025, for auto-detection), but native quantization workflow mandates post-October 2025 releases. Use the latest version (e.g., v0.5.10rc0 as of March 2026) for optimal compatibility with Hopper/Blackwell GPUs.

Citations:

1: https://lmsys.org/blog/2025-12-02-modelopt-quantization/

2: https://mintlify.com/sgl-project/sglang/advanced/quantization

3: https://sgl-project.github.io/advanced_features/quantization.html

4: https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/server_arguments.md

5: Enable native ModelOpt quantization support (2/3) sgl-project/sglang#9991

6: Enable native ModelOpt quantization support (1/3) sgl-project/sglang#7149

7: https://github.com/sgl-project/sglang

8: https://github.com/sgl-project/sglang/releases

🌐 Web query:

vLLM modelopt quantization support version history when added

💡 Result:

vLLM added support for ModelOpt (NVIDIA Model Optimizer) quantization in September 2024 via PR #6112, merged on 2024-09-11. This introduced the "modelopt" quantization method for loading ModelOpt FP8 static scaling checkpoints, usable as LLM(model="path", quantization="modelopt"). It requires CUDA compute capability 8.9+ (Hopper GPUs). Subsequent enhancements include KV cache scaling support (PR #11787, merged Jan 2025) and MXFP8/MXFP4 support in later versions (e.g., v0.14.0+ for some formats). ModelOpt supports FP8 variants like per-tensor, per-channel-per-token, FP8_PB_WO, and mixed precision with NVFP4, using torch._scaled_mm for static weight/activation scales.

Citations:

1: https://docs.vllm.ai/en/stable/features/quantization/modelopt/

2: https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/modelopt.html

3: https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/quantization/modelopt/

4: [Hardware][NV] Add support for ModelOpt static scaling checkpoints. vllm-project/vllm#6112

5: [Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. vllm-project/vllm#11787

6: https://github.com/vllm-project/vllm/releases

7: https://docs.vllm.ai/en/v0.6.2/models/engine_args.html

8: https://newreleases.io/project/pypi/vllm/release/0.6.0

Correct SGLang version requirement for ModelOpt quantization support.

The error table states SGLang >= 0.4.10 for --quantization modelopt support, but documentation indicates SGLang >= 0.5 is required. Native ModelOpt quantization support was introduced in SGLang v0.5 (full integration December 2025). Update line 257 to reflect: SGLang >= 0.5.

For vLLM, while official documentation does not explicitly state a minimum version, v0.10.1 is a reasonable conservative estimate given ModelOpt support was added in September 2024; this can remain as-is or be updated to "latest version recommended" if you prefer to match vLLM's official guidance.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/SKILL.md around lines 254 - 261, Update the table row for the `quantization="modelopt" not recognized` error to require SGLang >= 0.5 instead of the current `SGLang >= 0.4.10`; locate the row by the exact cell text `quantization="modelopt" not recognized` and replace the `SGLang >= 0.4.10` string with `SGLang >= 0.5` (leave the vLLM entry `vLLM >= 0.10.1` as-is or change to a “latest version recommended” note if desired).

mxinO · 2026-03-31T03:26:48Z

We need add multi-node support, but it can be complicated, maybe we can support vllm multi node first.

mxinO · 2026-03-31T03:21:51Z

For real quant, we can use official docker of vllm/sglang/trt-llm. Maybe we can list the docker links here.

Added official docker.

-Original file line number
+Diff line change
@@ -0,0 +1,106 @@
+    # Deployment Environment Setup
+    ## Framework Installation
+    ### vLLM
+    ```bash
+    pip install vllm
+    ```
+    Minimum version: 0.10.1
+    ### SGLang
+    ```bash
+    pip install "sglang[all]"
+    ```
+    Minimum version: 0.4.10
+    ### TRT-LLM
+    TRT-LLM is best installed via NVIDIA container:
+    ```bash
+    docker pull nvcr.io/nvidia/tensorrt-llm/release:<version>
+    ```
+    Or via pip (requires CUDA toolkit):
+    ```bash
+    pip install tensorrt-llm
+    ```
+    Minimum version: 0.17.0
+    ## SLURM Deployment
+    For SLURM clusters, deploy inside a container. Container flags MUST be on the `srun` line:
+    ```bash
+    #!/bin/bash
+    #SBATCH --job-name=deploy
+    #SBATCH --account=<account>
+    #SBATCH --partition=<partition>
+    #SBATCH --nodes=1
+    #SBATCH --ntasks-per-node=1
+    #SBATCH --gpus-per-node=<num_gpus>
+    #SBATCH --time=04:00:00
+    #SBATCH --output=deploy_%j.log
+    srun \
+        --container-image="<path/to/container.sqsh>" \
+        --container-mounts="<data_root>:<data_root>" \
+        --container-workdir="<workdir>" \
+        --no-container-mount-home \
+        bash -c "python -m vllm.entrypoints.openai.api_server \
+            --model <checkpoint_path> \
+            --quantization modelopt \
+            --tensor-parallel-size <num_gpus> \
+            --host 0.0.0.0 --port 8000"
+    ```
+    To access the server from outside the SLURM node, note the allocated hostname:
+    ```bash
+    squeue -u $USER -o "%j %N %S"  # Get the node name
+    # Then SSH tunnel or use the node's hostname directly
+    ```
+    ## Docker Deployment
+    ### Official Images (recommended)
+    | Framework | Image | Source |
+    |-----------|-------|--------|
+    | vLLM | `vllm/vllm-openai:latest` | <https://hub.docker.com/r/vllm/vllm-openai> |
+    | SGLang | `lmsysorg/sglang:latest` | <https://hub.docker.com/r/lmsysorg/sglang> |
+    | TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:latest` | <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/> |
+    Example with the official vLLM image:
+    ```bash
+    docker run --gpus all -p 8000:8000 \
+        -v /path/to/checkpoint:/model \
+        vllm/vllm-openai:latest \
+        --model /model \
+        --quantization modelopt \
+        --host 0.0.0.0 --port 8000
+    ```
+    ### Custom Image (optional)
+    A Dockerfile is also available at `examples/vllm_serve/Dockerfile` if you need a custom build:
+    ```bash
+    docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .
+    docker run --gpus all -p 8000:8000 \
+        -v /path/to/checkpoint:/model \
+        vllm-modelopt \
+        python -m vllm.entrypoints.openai.api_server \
+            --model /model \
+            --quantization modelopt \
+            --host 0.0.0.0 --port 8000
+    ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Agent Deployment skill for model serving #1133

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai bot Mar 29, 2026

Uh oh!

Edwardf0t1 Mar 31, 2026

Uh oh!

kaix-nv Apr 1, 2026

Uh oh!

coderabbitai bot Mar 29, 2026

Uh oh!

mxinO Mar 31, 2026

Uh oh!

mxinO Mar 31, 2026

Uh oh!

kaix-nv Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

-Original file line number
+Diff line change
@@ -0,0 +1,231 @@
+    ---
+    name: deployment
+    description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).
+    license: Apache-2.0
+    ---
+    # Deployment Skill
+    Serve a model checkpoint as an OpenAI-compatible inference endpoint. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy).
+    ## Quick Start
+    Prefer `scripts/deploy.sh` for standard local deployments — it handles quant detection, health checks, and server lifecycle. Use the raw framework commands in Step 4 when you need flags the script doesn't support, or for remote deployment.
+    ```bash
+    # Start vLLM server with a ModelOpt checkpoint
+    scripts/deploy.sh start --model ./qwen3-0.6b-fp8
+    # Start with SGLang and tensor parallelism
+    scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4
+    # Start from HuggingFace hub
+    scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8
+    # Test the API
+    scripts/deploy.sh test
+    # Check status
+    scripts/deploy.sh status
+    # Stop
+    scripts/deploy.sh stop
+    ```
+    The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4), server lifecycle (start/stop/restart/status), health check polling, and API testing.
+    ## Decision Flow
+    ### 0. Check workspace (multi-user / Slack bot)
+    If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
+    ```bash
+    ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
+    ```
+    If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.
+    ### 1. Identify the checkpoint
+    Determine what the user wants to deploy:
+    - **Local quantized checkpoint** (from ptq skill or manual export): look for `hf_quant_config.json` in the directory. If coming from a prior PTQ run in the same workspace, check common output locations: `output/`, `outputs/`, `exported_model/`, or the `--export_path` used in the PTQ command.
+    - **HuggingFace model hub** (e.g., `nvidia/Llama-3.1-8B-Instruct-FP8`): use directly
+    - **Unquantized model**: deploy as-is (BF16) or suggest quantizing first with the ptq skill
+    > **Note:** This skill expects HF-format checkpoints (from PTQ with `--export_fmt hf`). TRT-LLM format checkpoints should be deployed directly with TRT-LLM — see `references/trtllm.md`.
+    Check the quantization format if applicable:
+    ```bash
+    cat <checkpoint_path>/hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json"
+    ```
+    If not found, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither exists, the checkpoint is unquantized.
+    ### 2. Choose the framework
+    If the user hasn't specified a framework, recommend based on this priority:
+    | Situation | Recommended | Why |
+    |-----------|-------------|-----|
+    | General use | **vLLM** | Widest ecosystem, easy setup, OpenAI-compatible |
+    | Best SGLang model support | **SGLang** | Strong DeepSeek/Llama 4 support |
+    | Maximum optimization | **TRT-LLM** | Best throughput via engine compilation |
+    | Mixed-precision / AutoQuant | **TRT-LLM AutoDeploy** | Only option for AutoQuant checkpoints |
+    Check the support matrix in `references/support-matrix.md` to confirm the model + format + framework combination is supported.
+    ### 3. Check the environment
+    Read `skills/common/environment-setup.md` for GPU detection, local vs remote, and SLURM/Docker/bare metal detection. After completing it you should know: GPU model/count, local or remote, and execution environment.
+    Then check the **deployment framework** is installed:
+    ```bash
+    python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
+    python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
+    python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"
+    ```
+    If not installed, consult `references/setup.md`.
+    **GPU memory estimate** (to determine tensor parallelism):
+    - BF16: `params × 2 bytes` (8B ≈ 16 GB)
+    - FP8: `params × 1 byte` (8B ≈ 8 GB)
+    - FP4: `params × 0.5 bytes` (8B ≈ 4 GB)
+    - Add ~2-4 GB for KV cache and framework overhead
+    If the model exceeds single GPU memory, use tensor parallelism (`-tp <num_gpus>`).
+    ### 4. Deploy
+    Read the framework-specific reference for detailed instructions:
+    | Framework | Reference file |
+    |-----------|---------------|
+    | vLLM | `references/vllm.md` |
+    | SGLang | `references/sglang.md` |
+    | TRT-LLM | `references/trtllm.md` |
+    **Quick-start commands** (for common cases):
+    #### vLLM
+    ```bash
+    # Serve as OpenAI-compatible endpoint
+    python -m vllm.entrypoints.openai.api_server \
+        --model <checkpoint_path> \
+        --quantization modelopt \
+        --tensor-parallel-size <num_gpus> \
+        --host 0.0.0.0 --port 8000
+    ```
+    For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
+    #### SGLang
+    ```bash
+    python -m sglang.launch_server \
+        --model-path <checkpoint_path> \
+        --quantization modelopt \
+        --tp <num_gpus> \
+        --host 0.0.0.0 --port 8000
+    ```
+    #### TRT-LLM (direct)
+    ```python
+    from tensorrt_llm import LLM, SamplingParams
+    llm = LLM(model="<checkpoint_path>")
+    outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95))
+    ```
+    #### TRT-LLM AutoDeploy
+    For AutoQuant or mixed-precision checkpoints, see `references/trtllm.md`.
+    ### 5. Verify the deployment
+    After the server starts, verify it's healthy:
+    ```bash
+    # Health check
+    curl -s http://localhost:8000/health
+    # List models
+    curl -s http://localhost:8000/v1/models | python -m json.tool
+    # Test generation
+    curl -s http://localhost:8000/v1/completions \
+        -H "Content-Type: application/json" \
+        -d '{
+            "model": "<model_name>",
+            "prompt": "The capital of France is",
+            "max_tokens": 32
+        }' | python -m json.tool
+    ```
+    All checks must pass before reporting success to the user.
+    ### 6. Remote deployment (SSH/SLURM)
+    If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
+. **Source remote utilities:**
+       ```bash
+       source .claude/skills/common/remote_exec.sh
+       remote_load_cluster
+       remote_check_ssh
+       remote_detect_env
+       ```
+. **Sync the checkpoint** (only if it was produced locally):
+       If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:
+       ```bash
+       remote_sync_to <local_checkpoint_path> checkpoints/
+       ```
+. **Deploy based on remote environment:**
+       - **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). Use `remote_submit_job` and `remote_poll_job` to manage the job. Get the node hostname from `squeue -j $JOBID -o %N`.
+       - **Bare metal / Docker** — use `remote_run` to start the server directly:
+         ```bash
+         remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &"
+         ```
+. **Verify remotely:**
+       ```bash
+       remote_run "curl -s http://localhost:8000/health"
+       remote_run "curl -s http://localhost:8000/v1/models"
+       ```
+. **Report the endpoint** — include the remote hostname and port so the user can connect (e.g., `http://<node_hostname>:8000`). For SLURM, note that the port is only reachable from within the cluster network.
+    For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically.
+    ## Error Handling
+    | Error | Cause | Fix |
+    |-------|-------|-----|
+    | `CUDA out of memory` | Model too large for GPU(s) | Increase `--tensor-parallel-size` or use a smaller model |
+    | `quantization="modelopt" not recognized` | vLLM/SGLang version too old | Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10 |
+    | `hf_quant_config.json not found` | Not a ModelOpt-exported checkpoint | Re-export with `export_hf_checkpoint()`, or remove `--quantization` flag |
+    | `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors |
+    | `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` |
+    ## Success Criteria
+. Server process is running and healthy (`/health` returns 200)
+. Model is listed at `/v1/models`
+. Test generation produces coherent output
+. Server URL and port are reported to the user
+. If benchmarking was requested, throughput/latency numbers are reported

Add Agent Deployment skill for model serving #1133

Are you sure you want to change the base?

Uh oh!

Add Agent Deployment skill for model serving #1133

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

kaix-nv Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

mxinO Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

mxinO Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

kaix-nv Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!