Add Agent Evaluation skill for accuracy benchmarking by kaix-nv · Pull Request #1132 · NVIDIA/Model-Optimizer

kaix-nv · 2026-03-28T23:10:09Z

What does this PR do?

Type of change: New feature

Add a Claude Code skill for evaluating LLM accuracy using NeMo Evaluator Launcher (NEL). Based on the upstream nel-assistant skill with ModelOpt-specific additions:

Auto-detect ModelOpt quantization format from hf_quant_config.json (with config.json fallback) and set the correct vLLM/SGLang --quantization flag
Quantization-aware benchmark defaults — recommend MMLU, GSM8K, ARC-Challenge for quantized models (sensitive to precision loss)
Workspace management for multi-user environments (Step 0)
Progressive disclosure — model card research checklist and multi-node patterns extracted to references/ for on-demand loading

Skill structure

evaluation/
├── SKILL.md                              310 lines (core 8-step workflow)
├── references/
│   ├── model-card-research.md            Sampling params, reasoning config, ARM64, pre_cmd
│   └── multi-node.md                     HAProxy multi-instance, Ray TP/PP patterns
└── evals/
    └── nemotron3-nano-bf16-reasoning.json

The skill guides users through: NEL installation check → config generation via nel skills build-config → model card research → parameter tuning → task selection → multi-node setup → interceptors → execution with dry-run/test/full modes.

Testing

Invoke in Claude Code:

claude -p "evaluate outputs/Qwen3-0.6B-FP8 on mmlu"

Before your PR is "Ready for review"

Is this change backward compatible?: N/A (new feature)
If you copied code from any other sources, did you follow guidance in CONTRIBUTING.md: ✅ (NEL skill attributed in frontmatter)
Did you write any new necessary tests?: N/A (skill evals provided separately)

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Added comprehensive NeMo evaluation skill documentation with interactive workflow guidance
- Added reference guides for model card extraction, multi-node evaluation patterns, and quantization-aware benchmark selection
Tests
- Added evaluation workflow test definitions
Chores
- Updated linting configuration rules

coderabbitai · 2026-03-28T23:10:26Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

A new NeMo evaluation skill documentation suite was added, comprising comprehensive instructions for running accuracy evaluations on quantized and unquantized LLMs using NeMo Evaluator Launcher, along with supporting reference guides for model card extraction, multi-node evaluation patterns, quantization-aware benchmark selection, and test specifications validating the complete workflow.

Changes

Cohort / File(s)	Summary
Main Skill Documentation `.claude/skills/evaluation/SKILL.md`	End-to-end evaluation workflow including prerequisites, base config generation from five user inputs, model path configuration with quantization auto-detection from checkpoint metadata, model card extraction guidance, remaining config field population, SLURM/tracking customization, task finalization, and monitoring instructions via CLI and SSH-based log inspection.
Reference Documentation `.claude/skills/evaluation/references/model-card-research.md`, `multi-node.md`, `quantization-benchmarks.md`	Supporting guides covering model card web-search and extraction procedures for deployment settings, three multi-node evaluation patterns with YAML examples and constraint clarifications, and sensitivity-ranked benchmark categorization with scenario-specific curated sets for quantized model validation.
Test Specifications `.claude/skills/evaluation/tests/evals.json`	Three test definitions validating the evaluation skill workflow: Nemotron 3 Nano BF16 reasoning setup and two quantized checkpoint scenarios, each covering config generation, quantization detection, model card extraction, dry-run error handling, and result reporting.
Linting Configuration `.markdownlint-cli2.yaml`	Disabled Markdown lint rules `MD029` (ordered-list prefix format) and `MD036` (emphasis used as headings).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly describes the main purpose of the PR: adding an evaluation skill for accuracy benchmarking, which is the primary feature across all changed files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	PR adds only documentation and configuration files with no Python code modifications, making SECURITY.md practices inapplicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kaix/agent-evaluation

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-28T23:13:56Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1132/
Built to branch `gh-pages` at 2026-04-01 19:33 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 269-275: Update the SKILL.md snippet that documents
NEMO_EVALUATOR_TRUST_PRE_CMD to include a clear security warning: note that
setting NEMO_EVALUATOR_TRUST_PRE_CMD=1 enables execution of pre_cmd and post_cmd
which run arbitrary shell commands with the evaluator's privileges, instruct
users to review pre_cmd content, only trust configs from known sources, and be
cautious when using nemo_skills.* self-deployment tasks; reference the
environment variable name (NEMO_EVALUATOR_TRUST_PRE_CMD) and the config keys
(pre_cmd, post_cmd, nemo_skills.*) so readers can find and audit them.
- Around line 100-116: Update the documentation in SKILL.md to remove the
incorrect per-algorithm vLLM flag mapping and instead state that if
hf_quant_config.json exists (read quantization.quant_algo), vLLM uses a single
unified flag --quantization modelopt which auto-detects NVFP4, W4A8_AWQ, FP8,
etc.; replace the table and related lines with a concise statement: "If
hf_quant_config.json exists, vLLM auto-detects the quantization format and you
should pass --quantization modelopt (no format-specific flags required)."

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 697ac2e8-f1a5-41d4-9870-adb1ff730091

📥 Commits

Reviewing files that changed from the base of the PR and between 24ceba6 and bf4941b.

📒 Files selected for processing (3)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
.markdownlint-cli2.yaml

coderabbitai · 2026-03-28T23:16:12Z

.claude/skills/evaluation/SKILL.md

+**Auto-detect ModelOpt quantization format** (checkpoint paths only):
+
+Check for `hf_quant_config.json` in the checkpoint directory:
+
+```bash
+cat <checkpoint_path>/hf_quant_config.json 2>/dev/null
+```
+
+If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantization flag in `deployment.extra_args`:
+
+| `quant_algo` | Flag to add |
+|-------------|-------------|
+| `FP8` | `--quantization modelopt` |
+| `W4A8_AWQ` | `--quantization modelopt` |
+| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |
+
+If no `hf_quant_config.json`, the checkpoint is unquantized — no flag needed.


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check for existing ModelOpt quantization format references in the codebase echo "=== Searching for hf_quant_config.json references ===" rg -n 'hf_quant_config' -C3 echo -e "\n=== Searching for quantization.quant_algo references ===" rg -n 'quant_algo' -C3 echo -e "\n=== Searching for FP8, W4A8_AWQ, NVFP4 algorithm references ===" rg -n 'FP8|W4A8_AWQ|NVFP4' -C2

Repository: NVIDIA/Model-Optimizer

Length of output: 50379

🌐 Web query:

What are the supported quantization formats in vLLM for ModelOpt quantized models? Specifically, does vLLM support --quantization modelopt and --quantization modelopt_fp4 flags?

💡 Result:

vLLM supports ModelOpt quantized models via the --quantization modelopt flag. This includes formats like FP8 (ModelOpt FP8), NVFP4 (a 4-bit floating point format from NVIDIA Model Optimizer), MXFP8, and mixed precision (e.g., FP8 dense + NVFP4 MoE). There is no specific --quantization modelopt_fp4 flag; FP4 support (NVFP4) is handled under the unified modelopt quantization method, auto-detected from the model's quantization config (e.g., hf_quant_config.json or quantization_config in config.json). For CLI usage, specify --quantization modelopt when loading ModelOpt-quantized checkpoints from Hugging Face, as shown in examples like nvidia/Llama-3.1-8B-Instruct-FP8.

Citations:

1: https://docs.vllm.ai/en/latest/features/quantization/

2: https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/modelopt.html

3: https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/quantization/modelopt/

4: https://docs.vllm.ai/en/v0.9.0/features/quantization/modelopt.html

5: https://docs.vllm.ai/projects/llm-compressor/en/0.8.0/examples/quantization_w4a4_fp4/

6: https://docs.vllm.ai/en/stable/features/quantization/modelopt.html

Incorrect vLLM flag mapping for ModelOpt quantization auto-detection.

The JSON path quantization.quant_algo is correct and the algorithm names (FP8, W4A8_AWQ, NVFP4, NVFP4_AWQ) are confirmed in the ModelOpt codebase. However, the proposed flag mapping is incorrect:

--quantization modelopt_fp4 does not exist in vLLM. vLLM provides a single unified flag: --quantization modelopt, which auto-detects the quantization format from the model's quantization config (either hf_quant_config.json or config.json's quantization_config field).

NVFP4 is auto-detected, not mapped to a separate flag. vLLM automatically recognizes NVFP4, W4A8_AWQ, FP8, and other formats when quant_algo is present in the quantization config.

Remove the flag mapping table and replace with: "If hf_quant_config.json exists, vLLM auto-detects the quantization format and applies --quantization modelopt automatically. No additional format-specific flags are needed."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/SKILL.md around lines 100 - 116, Update the documentation in SKILL.md to remove the incorrect per-algorithm vLLM flag mapping and instead state that if hf_quant_config.json exists (read quantization.quant_algo), vLLM uses a single unified flag --quantization modelopt which auto-detects NVFP4, W4A8_AWQ, FP8, etc.; replace the table and related lines with a concise statement: "If hf_quant_config.json exists, vLLM auto-detects the quantization format and you should pass --quantization modelopt (no format-specific flags required)."

.claude/skills/evaluation/SKILL.md

codecov · 2026-03-28T23:22:45Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.18%. Comparing base (f1beaba) to head (27a9aa1).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1132      +/-   ##
==========================================
- Coverage   70.20%   70.18%   -0.03%     
==========================================
  Files         230      230              
  Lines       26080    26080              
==========================================
- Hits        18310    18304       -6     
- Misses       7770     7776       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

♻️ Duplicate comments (2)

.claude/skills/evaluation/SKILL.md (2)

194-199: ⚠️ Potential issue | 🟠 Major

Add explicit security warning for NEMO_EVALUATOR_TRUST_PRE_CMD.

Line 195 enables trusted command execution, but the snippet lacks a caution that pre_cmd/post_cmd run arbitrary shell commands with evaluator privileges. Please add an explicit warning in this section.

Proposed doc fix

-# If using pre_cmd or post_cmd:
+# If using pre_cmd or post_cmd:
+# WARNING: NEMO_EVALUATOR_TRUST_PRE_CMD=1 allows `pre_cmd`/`post_cmd` execution
+# from config with your current privileges. Review `pre_cmd`, `post_cmd`, and
+# `nemo_skills.*` task settings, and only run trusted configs.
 export NEMO_EVALUATOR_TRUST_PRE_CMD=1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 194 - 199, Add a clear
security warning near the NEMO_EVALUATOR_TRUST_PRE_CMD export: explain that
setting NEMO_EVALUATOR_TRUST_PRE_CMD=1 allows evaluator to run arbitrary shell
commands via pre_cmd and post_cmd with evaluator privileges, warn against
enabling it in untrusted environments or on production hosts, and recommend
alternatives (avoid enabling, run in isolated sandbox/container, or validate
commands) while referencing the environment variable name
NEMO_EVALUATOR_TRUST_PRE_CMD and the pre_cmd/post_cmd hooks so readers know
exactly which settings are risky.

108-115: ⚠️ Potential issue | 🔴 Critical

Incorrect ModelOpt flag mapping for NVFP4 in vLLM docs path.

Line 114 maps NVFP4 / NVFP4_AWQ to --quantization modelopt_fp4, which is likely invalid; this should use the unified ModelOpt quantization flag flow instead. This can misconfigure deployments at runtime.

Proposed doc fix

-| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |
-| Other values | Try `--quantization modelopt`; consult vLLM/SGLang docs if unsure |
+| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt` |
+| Other values | Use `--quantization modelopt`; verify backend support in docs if unsure |

What are the valid vLLM CLI values for `--quantization` when loading NVIDIA ModelOpt-quantized checkpoints, and is `modelopt_fp4` a supported value?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 108 - 115, The doc
incorrectly maps NVFP4/NVFP4_AWQ to the unsupported flag `--quantization
modelopt_fp4`; update the mapping logic so that when reading
quantization.quant_algo you add the unified ModelOpt flag (e.g., `--quantization
modelopt`) into deployment.extra_args for NVFP4/NVFP4_AWQ instead of
`modelopt_fp4`, and add a note clarifying that vLLM/SGLang use `modelopt` for
NVIDIA ModelOpt formats and to consult vLLM docs for any newer flag names;
ensure the table rows referencing `NVFP4`, `NVFP4_AWQ`, `modelopt_fp4` are
replaced with `--quantization modelopt` and adjust any code/comments that
conditionally look for `modelopt_fp4`.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 194-199: Add a clear security warning near the
NEMO_EVALUATOR_TRUST_PRE_CMD export: explain that setting
NEMO_EVALUATOR_TRUST_PRE_CMD=1 allows evaluator to run arbitrary shell commands
via pre_cmd and post_cmd with evaluator privileges, warn against enabling it in
untrusted environments or on production hosts, and recommend alternatives (avoid
enabling, run in isolated sandbox/container, or validate commands) while
referencing the environment variable name NEMO_EVALUATOR_TRUST_PRE_CMD and the
pre_cmd/post_cmd hooks so readers know exactly which settings are risky.
- Around line 108-115: The doc incorrectly maps NVFP4/NVFP4_AWQ to the
unsupported flag `--quantization modelopt_fp4`; update the mapping logic so that
when reading quantization.quant_algo you add the unified ModelOpt flag (e.g.,
`--quantization modelopt`) into deployment.extra_args for NVFP4/NVFP4_AWQ
instead of `modelopt_fp4`, and add a note clarifying that vLLM/SGLang use
`modelopt` for NVIDIA ModelOpt formats and to consult vLLM docs for any newer
flag names; ensure the table rows referencing `NVFP4`, `NVFP4_AWQ`,
`modelopt_fp4` are replaced with `--quantization modelopt` and adjust any
code/comments that conditionally look for `modelopt_fp4`.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 626f8ed2-3b5d-4a8f-b618-c68007312945

📥 Commits

Reviewing files that changed from the base of the PR and between bf4941b and a5eb3a6.

📒 Files selected for processing (15)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/evals/base-model-local-execution.json
.claude/skills/evaluation/evals/external-deployment-eval.json
.claude/skills/evaluation/evals/interceptor-configuration.json
.claude/skills/evaluation/evals/multi-node-evaluation.json
.claude/skills/evaluation/evals/nel-not-installed.json
.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
.claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
.claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json
.claude/skills/evaluation/evals/reasoning-model-sglang.json
.claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
.claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
.claude/skills/evaluation/evals/workspace-reuse-from-ptq.json
.claude/skills/evaluation/references/model-card-research.md
.claude/skills/evaluation/references/multi-node.md

✅ Files skipped from review due to trivial changes (13)

.claude/skills/evaluation/evals/nel-not-installed.json
.claude/skills/evaluation/evals/interceptor-configuration.json
.claude/skills/evaluation/evals/external-deployment-eval.json
.claude/skills/evaluation/evals/reasoning-model-sglang.json
.claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
.claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
.claude/skills/evaluation/evals/workspace-reuse-from-ptq.json
.claude/skills/evaluation/evals/base-model-local-execution.json
.claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
.claude/skills/evaluation/evals/multi-node-evaluation.json
.claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json
.claude/skills/evaluation/references/model-card-research.md
.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json

Edwardf0t1

Left a few comments - I think overall it's in a good shape and aligned well with the design we discussed 👍

Edwardf0t1 · 2026-03-31T17:13:29Z

.claude/skills/evaluation/SKILL.md

@@ -0,0 +1,310 @@
+---
+name: evaluation
+description: Evaluate accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Use when user says "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel", or needs to measure how quantization affects model quality. Handles model deployment, config generation, and evaluation execution.


Similar to my comment in the deployment skills PR, we can add some negative triggers as well.

.claude/skills/evaluation/SKILL.md

Edwardf0t1 · 2026-03-31T17:36:16Z

.claude/skills/evaluation/SKILL.md

+
+If no `hf_quant_config.json`, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither is found, the checkpoint is unquantized — no flag needed.
+
+**Quantization-aware benchmark defaults:**


I think we can consider extracting quantization benchmarks including benchmark sensitivity ranking and recommended sets to a reference file, e.g., references/quantization-benchmarks.md, so it can be reused by the compare-results skills later.

The reason to have a compare-results skill is that evaluation is about configuring and running NEL, while compare-results is about interpreting and acting on results from multiple runs.

Added references/quantization-benchmarks.md.

.claude/skills/evaluation/references/model-card-research.md

Edwardf0t1 · 2026-03-31T17:42:15Z

.claude/skills/evaluation/SKILL.md

+When you have all the answers, run the script to build the base config:
+
+```bash
+nel skills build-config --execution <local|slurm> --deployment <none|vllm|sglang|nim|trtllm> --model_type <base|chat|reasoning> --benchmarks <standard|code|math_reasoning|safety|multilingual> [--export <none|mlflow|wandb>] [--output <OUTPUT>]


I think we need to verify benchmark categories to match NEL's build-config CLI - If NEL's categories changed, we need to update accordingly.

.claude/skills/evaluation/SKILL.md

.claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json

coderabbitai

♻️ Duplicate comments (1)

.claude/skills/evaluation/SKILL.md (1)

110-116: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Incorrect vLLM quantization flag mapping persists.

This issue was previously flagged but remains unresolved. The table mapping quant_algo to vLLM flags is incorrect:

--quantization modelopt_fp4 does not exist in vLLM. vLLM provides only --quantization modelopt, which auto-detects all ModelOpt formats (FP8, W4A8_AWQ, NVFP4, NVFP4_AWQ) from hf_quant_config.json or config.json.
Remove the per-algorithm flag mapping. Replace lines 108-116 with a single statement: "If hf_quant_config.json or config.json contains quant_method: "modelopt", add --quantization modelopt to deployment.extra_args. vLLM auto-detects the specific format (FP8, NVFP4, W4A8_AWQ, etc.)."

🔧 Proposed fix

-**Auto-detect ModelOpt quantization format** (checkpoint paths only):
-
-Check for `hf_quant_config.json` in the checkpoint directory:
-
-```bash
-cat <checkpoint_path>/hf_quant_config.json 2>/dev/null
-```
-
-If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantization flag in `deployment.extra_args`:
-
-| `quant_algo` | Flag to add |
-|-------------|-------------|
-| `FP8` | `--quantization modelopt` |
-| `W4A8_AWQ` | `--quantization modelopt` |
-| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |
-| Other values | Try `--quantization modelopt`; consult vLLM/SGLang docs if unsure |
-
-If no `hf_quant_config.json`, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither is found, the checkpoint is unquantized — no flag needed.
+**Auto-detect ModelOpt quantization format** (checkpoint paths only):
+
+Check for `hf_quant_config.json` in the checkpoint directory:
+
+```bash
+cat <checkpoint_path>/hf_quant_config.json 2>/dev/null
+```
+
+If found (or if `config.json` contains `quantization_config` with `quant_method: "modelopt"`), add `--quantization modelopt` to `deployment.extra_args`. vLLM auto-detects the specific quantization format (FP8, W4A8_AWQ, NVFP4, NVFP4_AWQ, etc.) from the config. No format-specific flags are needed.
+
+If neither config is found, the checkpoint is unquantized — no quantization flag needed.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 110 - 116, Replace the
per-algorithm mapping logic that reads quantization.quant_algo and emits
algorithm-specific flags with a single, deterministic check: if
hf_quant_config.json exists in the checkpoint directory or config.json contains
quantization_config with quant_method: "modelopt", then add "--quantization
modelopt" to deployment.extra_args (vLLM will auto-detect FP8/NVFP4/W4A8_AWQ
formats); otherwise emit no quantization flag. Update any references to
quantization.quant_algo, the table mapping entries, and the logic that would add
"--quantization modelopt_fp4" so they instead perform the single modelopt
presence check and set deployment.extra_args accordingly.

🧹 Nitpick comments (3)

.claude/skills/evaluation/SKILL.md (3)
3-3: Consider adding negative triggers for skill activation.

The description includes positive triggers ("evaluate model", "benchmark accuracy", etc.) that indicate when to use this skill. Consider adding negative triggers to clarify when NOT to use it, e.g., "Do not use for: deploying models (use deployment skill), training/fine-tuning, or inference-only workflows."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md at line 3, Update the SKILL.md
description entry for this skill to include negative triggers clarifying when
not to use it: append a "Do not use for" clause after the existing trigger
phrases (the description line containing "Evaluate accuracy of quantized or
unquantized LLMs..." and the positive triggers like "evaluate model", "benchmark
accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel").
List concise exclusions such as "Do not use for: deploying models (use
deployment skill), training/fine-tuning, inference-only workflows, or model
serving/ops tasks" so the skill activation is unambiguous.
229-289: Consider moving monitoring section to a standalone skill.

The monitoring section (lines 229-289) provides detailed nel status, nel logs, and SSH log inspection guidance. This could be extracted to a reusable run-and-monitor skill that handles progress tracking, log inspection, and failure diagnosis for both evaluation and deployment workflows.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 229 - 289, Extract the
"Monitoring Progress" section (the block starting with "Monitoring Progress"
that documents `nel status`, `nel logs`, `nel info`, and SSH log inspection)
into a new standalone skill named e.g. "run-and-monitor" so it can be reused by
other skills; create a new SKILL.md for that skill containing the full
procedures (check job status, stream logs, SSH inspection, grep search) and
update the original .claude/skills/evaluation/SKILL.md by removing the extracted
block and adding a short cross-reference or link to the new "run-and-monitor"
skill; ensure references to the commands `nel status`, `nel logs`, and `nel
info` remain identical so examples and searchability are preserved.
119-128: Consider extracting quantization benchmarks to a reference file.

The quantization-aware benchmark recommendations (sensitivity ranking, recommended sets) are valuable knowledge that could be reused by other skills (e.g., a future compare-results skill). Consider extracting this to references/quantization-benchmarks.md for consistency and reusability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 119 - 128, Extract the
"Quantization-aware benchmark defaults" section (the list referencing MMLU,
GSM8K, ARC-Challenge, HumanEval, Winogrande, IFEval) into a reusable reference
document named quantization-benchmarks.md and replace the full block in SKILL.md
with a short pointer to that reference; ensure the skill logic still presents
the recommendations to the user and asks which to include, and if the user
already specified benchmarks it preserves their choice while mentioning any
missed accuracy-sensitive benchmarks (MMLU, GSM8K, ARC-Challenge, HumanEval,
Winogrande) in the prompt.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 110-116: Replace the per-algorithm mapping logic that reads
quantization.quant_algo and emits algorithm-specific flags with a single,
deterministic check: if hf_quant_config.json exists in the checkpoint directory
or config.json contains quantization_config with quant_method: "modelopt", then
add "--quantization modelopt" to deployment.extra_args (vLLM will auto-detect
FP8/NVFP4/W4A8_AWQ formats); otherwise emit no quantization flag. Update any
references to quantization.quant_algo, the table mapping entries, and the logic
that would add "--quantization modelopt_fp4" so they instead perform the single
modelopt presence check and set deployment.extra_args accordingly.

---

Nitpick comments:
In @.claude/skills/evaluation/SKILL.md:
- Line 3: Update the SKILL.md description entry for this skill to include
negative triggers clarifying when not to use it: append a "Do not use for"
clause after the existing trigger phrases (the description line containing
"Evaluate accuracy of quantized or unquantized LLMs..." and the positive
triggers like "evaluate model", "benchmark accuracy", "run MMLU", "evaluate
quantized model", "accuracy drop", "run nel"). List concise exclusions such as
"Do not use for: deploying models (use deployment skill), training/fine-tuning,
inference-only workflows, or model serving/ops tasks" so the skill activation is
unambiguous.
- Around line 229-289: Extract the "Monitoring Progress" section (the block
starting with "Monitoring Progress" that documents `nel status`, `nel logs`,
`nel info`, and SSH log inspection) into a new standalone skill named e.g.
"run-and-monitor" so it can be reused by other skills; create a new SKILL.md for
that skill containing the full procedures (check job status, stream logs, SSH
inspection, grep search) and update the original
.claude/skills/evaluation/SKILL.md by removing the extracted block and adding a
short cross-reference or link to the new "run-and-monitor" skill; ensure
references to the commands `nel status`, `nel logs`, and `nel info` remain
identical so examples and searchability are preserved.
- Around line 119-128: Extract the "Quantization-aware benchmark defaults"
section (the list referencing MMLU, GSM8K, ARC-Challenge, HumanEval, Winogrande,
IFEval) into a reusable reference document named quantization-benchmarks.md and
replace the full block in SKILL.md with a short pointer to that reference;
ensure the skill logic still presents the recommendations to the user and asks
which to include, and if the user already specified benchmarks it preserves
their choice while mentioning any missed accuracy-sensitive benchmarks (MMLU,
GSM8K, ARC-Challenge, HumanEval, Winogrande) in the prompt.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 506c0f6a-d3ea-4b11-9c74-7262cfb545e6

📥 Commits

Reviewing files that changed from the base of the PR and between a5eb3a6 and 0a4a607.

📒 Files selected for processing (6)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/references/model-card-research.md
.claude/skills/evaluation/references/multi-node.md
.claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json
.claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json
.markdownlint-cli2.yaml

✅ Files skipped from review due to trivial changes (4)

.markdownlint-cli2.yaml
.claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json
.claude/skills/evaluation/references/model-card-research.md
.claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json

cjluo-nv

Summary: Adds a new Claude Code skill (evaluation/) that guides users through LLM accuracy evaluation using NeMo Evaluator Launcher (NEL), with ModelOpt-specific quantization auto-detection and benchmark recommendations.

Issues Found:

[Correctness] Missing dependency not verifiable — skills/common/workspace-management.md does not exist on main or any visible branch (SKILL.md:19)
The skill references skills/common/workspace-management.md when MODELOPT_WORKSPACE_ROOT is set, but the dependency PR #1107 that introduces common files has no corresponding branch (kaix/agent-common doesn't exist). If #1107 isn't merged first, this skill will fail at Step 0. Either:
- Merge #1107 before this PR, or
- Remove the workspace step until the dependency lands, or
- Include workspace-management.md in this PR.
[Correctness] PR description says evals/ directory but actual path is tests/ (PR description vs actual files)
The description's directory tree shows evals/nemotron3-nano-bf16-reasoning.json, but the actual files are under tests/. The deployment skill on kaix/agent-deployment also uses tests/. Minor inconsistency in the description only — the actual files are correct.
[Readability] Numbered list restart issue in Step 2 benchmarks (SKILL.md:71-77)
The benchmark choices use 1. through 5. but the overall Step 2 questions also use numbered list items. At line 71, the benchmarks list starts with 1. immediately after 5. Benchmarks:, creating a confusing nested numbered list that may render ambiguously in some markdown parsers. Consider using bullet points or lettered sub-items for the benchmark categories.
[Readability] quant_algo fallback guidance is vague (SKILL.md:119)
"Other values | Try --quantization modelopt; consult vLLM/SGLang docs if unsure" — this puts the AI agent in a position of guessing. Consider either enumerating known values exhaustively or providing a more actionable fallback (e.g., warn the user and ask them to specify).
[Tests] Test specs are comprehensive but untestable in CI (tests/*.json)
The test JSON files describe expected interactive behaviors (web search, user questions, SSH log inspection) that can't be validated by any automated test. The PR says "N/A (skill evals provided separately)" — what is the plan for validating these? Are they run manually, or is there a skill-eval harness? Clarification would be helpful.
[Readability] SKILL.md line 81 refers to --benchmarks options but uses different casing than the list (SKILL.md:77 vs 85)
The user-facing list shows "Math & Reasoning" but the CLI flag uses math_reasoning. The note at line 87 partially addresses this, but consider adding the CLI value next to each user-facing label (e.g., "3. Math & Reasoning (math_reasoning)") to reduce ambiguity.
[Correctness] Markdown lint rule disabling rationale is narrow (.markdownlint-cli2.yaml)
MD029 and MD036 are disabled globally with comments saying "upstream NEL skill uses..." — but these rules affect the entire repo. If only the evaluation skill triggers these, consider using a per-directory .markdownlint-cli2.yaml override in .claude/skills/evaluation/ instead of weakening global linting.

Suggestions:

The quantization-benchmarks.md reference is well-structured and useful. Consider whether similar sensitivity data could be auto-generated from ModelOpt's own benchmark results in future.
The model-card-research.md checklist is thorough. The curl over wget tip and --no-cache-dir for pip are good operational details.
The deployment skill on kaix/agent-deployment uses scripts/ for shell scripts — this skill has no scripts, which is fine, but worth noting the consistent pattern of references/ + tests/ across skills.

Overall Assessment: This is a well-structured documentation-only PR. The skill workflow is comprehensive and the ModelOpt-specific additions (quant auto-detection, benchmark recommendations) add clear value. The main concern is the unresolved dependency on PR #1107 for common files — the skill will break at Step 0 if that PR doesn't land first. The global markdownlint changes should ideally be scoped more narrowly. No Python code changes, so correctness risk is low.

coderabbitai

♻️ Duplicate comments (1)

.claude/skills/evaluation/SKILL.md (1)
102-119: ⚠️ Potential issue | 🔴 Critical

Critical: Remove incorrect vLLM flag mapping table.

Lines 112-117 document a per-algorithm flag mapping that includes --quantization modelopt_fp4, which does not exist in vLLM. According to the past review comment, web search results (citing vLLM docs), and context snippets from ModelOpt's codebase:

vLLM provides a single unified flag: --quantization modelopt

This flag auto-detects the quantization format (FP8, NVFP4, W4A8_AWQ, etc.) from the model's hf_quant_config.json or config.json

There is no separate --quantization modelopt_fp4 flag

The current documentation will cause the AI agent to generate invalid vLLM commands that will fail at runtime.
🔧 Proposed fix

Replace lines 102-119 with:
 **Auto-detect ModelOpt quantization format** (checkpoint paths only):

 Check for `hf_quant_config.json` in the checkpoint directory:

 ```bash
 cat <checkpoint_path>/hf_quant_config.json 2>/dev/null
-If found, read quantization.quant_algo and set the correct vLLM/SGLang quantization flag in deployment.extra_args:

- quant_algo Flag to add

- FP8 --quantization modelopt

- W4A8_AWQ --quantization modelopt

- NVFP4, NVFP4_AWQ --quantization modelopt_fp4

- Other values Try --quantization modelopt; consult vLLM/SGLang docs if unsure

-If no hf_quant_config.json, also check config.json for a quantization_config section with quant_method: "modelopt". If neither is found, the checkpoint is unquantized — no flag needed.
+If found, vLLM/SGLang will auto-detect the quantization format (FP8, NVFP4, W4A8_AWQ, etc.) from quantization.quant_algo. Add to deployment.extra_args:
+
+```yaml
+deployment:

extra_args: "--quantization modelopt"
+```

+If no hf_quant_config.json, also check config.json for a quantization_config section with quant_method: "modelopt". If found, use the same --quantization modelopt flag. If neither is found, the checkpoint is unquantized — no flag needed.
</details>

This aligns with the past review feedback and vLLM's unified ModelOpt quantization support.

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 102 - 119, The documentation
incorrectly maps specific ModelOpt algorithms to multiple vLLM flags (e.g.,
--quantization modelopt_fp4); instead, update the section that inspects
hf_quant_config.json/quantization.quant_algo (and fallback config.json →
quantization_config.quant_method) to instruct adding a single unified flag in
deployment.extra_args: --quantization modelopt, since vLLM/SGLang
auto-detects the concrete format from the checkpoint; remove the per-algorithm
table and replace it with guidance to use --quantization modelopt when
quant_method: "modelopt" or quantization.quant_algo is present, otherwise
omit the flag.
</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:

Around line 102-119: The documentation incorrectly maps specific ModelOpt
algorithms to multiple vLLM flags (e.g., --quantization modelopt_fp4);
instead, update the section that inspects
hf_quant_config.json/quantization.quant_algo (and fallback config.json →
quantization_config.quant_method) to instruct adding a single unified flag in
deployment.extra_args: --quantization modelopt, since vLLM/SGLang
auto-detects the concrete format from the checkpoint; remove the per-algorithm
table and replace it with guidance to use --quantization modelopt when
quant_method: "modelopt" or quantization.quant_algo is present, otherwise
omit the flag.
</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `abfe18ed-90df-4a26-94ef-79642a168e99`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 0a4a607de5360e4e848e9a691fcc090cc11b8e49 and 27a9aa12ac71d8a86b8f2fce5d8e7e07d57e691b.

</details>

<details>
<summary>📒 Files selected for processing (7)</summary>

* `.claude/skills/evaluation/SKILL.md`
* `.claude/skills/evaluation/references/model-card-research.md`
* `.claude/skills/evaluation/references/multi-node.md`
* `.claude/skills/evaluation/references/quantization-benchmarks.md`
* `.claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json`
* `.claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json`
* `.markdownlint-cli2.yaml`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (4)</summary>

* .markdownlint-cli2.yaml
* .claude/skills/evaluation/references/quantization-benchmarks.md
* .claude/skills/evaluation/references/model-card-research.md
* .claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (1)</summary>

* .claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json

</details>

</details>

Edwardf0t1

Discussed offline on all comments - LGTM

cjluo-nv · 2026-04-01T07:12:17Z

why does it depend on remote_exec.sh?

meenchen

Thanks @kaix-nv for the PR, I am looking forward to trying this. Left a few questions

meenchen · 2026-04-01T17:27:15Z

.claude/skills/evaluation/SKILL.md

+5. Benchmarks:
+  Allow for multiple choices in this question.
+1. Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...)
+2. Code Evaluation (like HumanEval, MBPP, and LiveCodeBench)
+3. Math & Reasoning (like AIME, GPQA, MATH-500, ...)
+4. Safety & Security (like Garak and Safety Harness)
+5. Multilingual (like MMATH, Global MMLU, MMLU-Prox)


Is it possible to let the agent take one custom yaml file for benchmarks?

Good point. I've added support for custom yaml file in Step 1. If the user already has a config file, the skill now skips config generation.

meenchen · 2026-04-01T17:44:19Z

.claude/skills/evaluation/SKILL.md

+
+If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantization flag in `deployment.extra_args`:
+
+| `quant_algo` | Flag to add |


For some quantized checkpoint, we need to set env vars like:
VLLM_NVFP4_GEMM_BACKEND=marlin VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm for Nemotron Super, will the skill capture this?

The skill currently only sets --quantization based on quant_algo. Model-specific env vars like VLLM_NVFP4_GEMM_BACKEND are not captured from the quant config since they come from the model card. But the skill does handle this already, just in a different step. In Step 3, references/model-card-research.md instructs:

Deployment-specific extra_args for vLLM/SGLang (look for the vLLM/SGLang deployment command)

Add a Claude Code skill for evaluating LLM accuracy using NeMo Evaluator Launcher (NEL). Based on the upstream nel-assistant skill (commit f1fa073) with ModelOpt-specific additions: - Auto-detect ModelOpt quantization format from hf_quant_config.json and set the correct vLLM/SGLang --quantization flag - Quantization-aware benchmark defaults (recommend MMLU, GSM8K, ARC-Challenge for quantized models) - Workspace management for multi-user environments (Step 0) - Disable MD036/MD029 markdownlint rules for upstream NEL formatting The skill guides users through NEL config generation, model card research, and evaluation execution (local and SLURM). Signed-off-by: Kai Xu <kaix@nvidia.com>

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv · 2026-04-01T19:11:14Z

why does it depend on remote_exec.sh?
It doesn’t depend on remote_exec.sh; it depends on the PTQ skill PR.

Signed-off-by: Kai Xu <kaix@nvidia.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

.claude/skills/evaluation/SKILL.md (2)

189-190: ⚠️ Potential issue | 🟡 Minor

Security note should explicitly include both pre_cmd and post_cmd.

Line 189 currently warns about pre_cmd only, but enabling NEMO_EVALUATOR_TRUST_PRE_CMD=1 affects trusted command execution behavior beyond that single key; call out both keys explicitly to reduce audit misses.

Suggested doc tweak

-# If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands):
+# If using pre_cmd or post_cmd (review both `pre_cmd` and `post_cmd` content before enabling — they run arbitrary commands):
 export NEMO_EVALUATOR_TRUST_PRE_CMD=1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 189 - 190, Update the
security note that currently references only pre_cmd to explicitly call out both
pre_cmd and post_cmd: change the sentence referencing
NEMO_EVALUATOR_TRUST_PRE_CMD to warn that enabling this trust flag affects
execution of both the pre_cmd and post_cmd hooks (i.e., both keys are treated as
trusted arbitrary commands), and ensure the text names the exact symbols
pre_cmd, post_cmd and the environment toggle NEMO_EVALUATOR_TRUST_PRE_CMD so
auditors see all relevant identifiers.

112-117: ⚠️ Potential issue | 🔴 Critical

Remove invalid modelopt_fp4 mapping and use unified modelopt flag.

This mapping is still incorrect for vLLM: --quantization modelopt_fp4 is not a valid mode; ModelOpt variants (including NVFP4) are handled via --quantization modelopt.

In current vLLM docs, what are valid values for --quantization when loading NVIDIA ModelOpt checkpoints, and is --quantization modelopt_fp4 supported?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 112 - 117, The mapping for
`quant_algo` is incorrect: remove any references to `--quantization
modelopt_fp4` and unify ModelOpt-related entries to use `--quantization
modelopt`; update the table rows for `NVFP4`, `NVFP4_AWQ` (and any other
ModelOpt variants) to list `--quantization modelopt` instead of `modelopt_fp4`,
and ensure the generic guidance for other values still suggests `--quantization
modelopt` or consulting vLLM/SGLang docs; look for the `quant_algo` table and
the `--quantization modelopt_fp4` string to make the change.

🧹 Nitpick comments (1)

.claude/skills/evaluation/references/multi-node.md (1)

3-3: Fix pattern count inconsistency.

Line 3 says there are two patterns, but the doc also defines a third combined Pattern A+B. Please reword to avoid ambiguity when users select a setup.

Suggested doc tweak

-There are two multi-node patterns. Ask the user which applies:
+There are two base multi-node patterns plus one combined pattern. Ask the user which applies:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/references/multi-node.md at line 3, The opening
sentence claiming "two multi-node patterns" is inconsistent with the later
definitions that include "Pattern A", "Pattern B", and the combined "Pattern
A+B"; update the line that currently says there are two patterns to either state
there are three patterns (Pattern A, Pattern B, and Pattern A+B) or change
wording to ask users which of the following patterns (Pattern A, Pattern B, or
the combined Pattern A+B) applies to their setup so selection is unambiguous;
ensure you reference the names "Pattern A", "Pattern B", and "Pattern A+B" in
the revised sentence.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 99-100: The path-type detection in the docs only lists `/` and
`./` as local checkpoint prefixes, which misses common forms like `../`, `~/`,
and plain relative directories (e.g., `checkpoints/model`); update the guidance
around setting `deployment.checkpoint_path` vs `deployment.hf_model_handle` to
treat any path that starts with `/`, `./`, `../`, `~/`, or a bare relative
directory (no colon or slash indicating an HF handle) as a checkpoint path and
show examples, and state that HF handles should be of the form `org/model-name`
with `deployment.hf_model_handle` while `deployment.checkpoint_path` should be
used for the listed local path forms.

---

Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 189-190: Update the security note that currently references only
pre_cmd to explicitly call out both pre_cmd and post_cmd: change the sentence
referencing NEMO_EVALUATOR_TRUST_PRE_CMD to warn that enabling this trust flag
affects execution of both the pre_cmd and post_cmd hooks (i.e., both keys are
treated as trusted arbitrary commands), and ensure the text names the exact
symbols pre_cmd, post_cmd and the environment toggle
NEMO_EVALUATOR_TRUST_PRE_CMD so auditors see all relevant identifiers.
- Around line 112-117: The mapping for `quant_algo` is incorrect: remove any
references to `--quantization modelopt_fp4` and unify ModelOpt-related entries
to use `--quantization modelopt`; update the table rows for `NVFP4`, `NVFP4_AWQ`
(and any other ModelOpt variants) to list `--quantization modelopt` instead of
`modelopt_fp4`, and ensure the generic guidance for other values still suggests
`--quantization modelopt` or consulting vLLM/SGLang docs; look for the
`quant_algo` table and the `--quantization modelopt_fp4` string to make the
change.

---

Nitpick comments:
In @.claude/skills/evaluation/references/multi-node.md:
- Line 3: The opening sentence claiming "two multi-node patterns" is
inconsistent with the later definitions that include "Pattern A", "Pattern B",
and the combined "Pattern A+B"; update the line that currently says there are
two patterns to either state there are three patterns (Pattern A, Pattern B, and
Pattern A+B) or change wording to ask users which of the following patterns
(Pattern A, Pattern B, or the combined Pattern A+B) applies to their setup so
selection is unambiguous; ensure you reference the names "Pattern A", "Pattern
B", and "Pattern A+B" in the revised sentence.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 97eaffba-9efb-45fb-a21b-7be4fb9cffef

📥 Commits

Reviewing files that changed from the base of the PR and between 27a9aa1 and b8c45e8.

📒 Files selected for processing (6)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/references/model-card-research.md
.claude/skills/evaluation/references/multi-node.md
.claude/skills/evaluation/references/quantization-benchmarks.md
.claude/skills/evaluation/tests/evals.json
.markdownlint-cli2.yaml

✅ Files skipped from review due to trivial changes (4)

.markdownlint-cli2.yaml
.claude/skills/evaluation/references/quantization-benchmarks.md
.claude/skills/evaluation/references/model-card-research.md
.claude/skills/evaluation/tests/evals.json

coderabbitai · 2026-04-01T19:33:52Z

.claude/skills/evaluation/SKILL.md

+- Checkpoint path (starts with `/` or `./`) → set `deployment.checkpoint_path: <path>` and `deployment.hf_model_handle: null`
+- HF handle (e.g., `org/model-name`) → set `deployment.hf_model_handle: <handle>` and `deployment.checkpoint_path: null`


⚠️ Potential issue | 🟠 Major

Path-type detection misses common local path forms.

Line 99 only treats / and ./ as local paths. ../..., ~/..., and plain relative dirs (e.g., checkpoints/model) will be misclassified as HF handles.

Suggested doc tweak

-- Checkpoint path (starts with `/` or `./`) → set `deployment.checkpoint_path: <path>` and `deployment.hf_model_handle: null` -- HF handle (e.g., `org/model-name`) → set `deployment.hf_model_handle: <handle>` and `deployment.checkpoint_path: null` +- Checkpoint path (absolute, `./`, `../`, `~/`, or any path that exists locally) → set `deployment.checkpoint_path: <path>` and `deployment.hf_model_handle: null` +- HF handle (e.g., `org/model-name`) only if the value is not an existing local path → set `deployment.hf_model_handle: <handle>` and `deployment.checkpoint_path: null`

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- Checkpoint path (starts with `/` or `./`) → set `deployment.checkpoint_path: <path>` and `deployment.hf_model_handle: null`

- HF handle (e.g., `org/model-name`) → set `deployment.hf_model_handle: <handle>` and `deployment.checkpoint_path: null`

- Checkpoint path (absolute, `./`, `../`, `~/`, or any path that exists locally) → set `deployment.checkpoint_path: <path>` and `deployment.hf_model_handle: null`

- HF handle (e.g., `org/model-name`) only if the value is not an existing local path → set `deployment.hf_model_handle: <handle>` and `deployment.checkpoint_path: null`

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/SKILL.md around lines 99 - 100, The path-type detection in the docs only lists `/` and `./` as local checkpoint prefixes, which misses common forms like `../`, `~/`, and plain relative directories (e.g., `checkpoints/model`); update the guidance around setting `deployment.checkpoint_path` vs `deployment.hf_model_handle` to treat any path that starts with `/`, `./`, `../`, `~/`, or a bare relative directory (no colon or slash indicating an HF handle) as a checkpoint path and show examples, and state that HF handles should be of the form `org/model-name` with `deployment.hf_model_handle` while `deployment.checkpoint_path` should be used for the listed local path forms.

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

kaix-nv requested review from ChenhanYu, Edwardf0t1, cjluo-nv, kevalmorabia97, meenchen, mxinO and realAsma March 29, 2026 18:06

Edwardf0t1 reviewed Mar 31, 2026

View reviewed changes

kaix-nv force-pushed the kaix/agent-evaluation branch from 68fe89e to 0a4a607 Compare April 1, 2026 00:43

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

kaix-nv force-pushed the kaix/agent-evaluation branch from 0a4a607 to 27a9aa1 Compare April 1, 2026 06:26

cjluo-nv reviewed Apr 1, 2026

View reviewed changes

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

Edwardf0t1 approved these changes Apr 1, 2026

View reviewed changes

meenchen reviewed Apr 1, 2026

View reviewed changes

kaix-nv added 4 commits April 1, 2026 12:09

Refactor and add modelopt path

31ff035

Signed-off-by: Kai Xu <kaix@nvidia.com>

Remove extra eval files, keep 2 core scenario

3eb1591

Signed-off-by: Kai Xu <kaix@nvidia.com>

Address review comments

99dac7d

Signed-off-by: Kai Xu <kaix@nvidia.com>

Address review comments

b8c45e8

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/agent-evaluation branch from 27a9aa1 to b8c45e8 Compare April 1, 2026 19:29

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes


		If no `hf_quant_config.json`, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither is found, the checkpoint is unquantized — no flag needed.

		Quantization-aware benchmark defaults:

-	`quant_algo`	Flag to add
-	`FP8`	`--quantization modelopt`
-	`W4A8_AWQ`	`--quantization modelopt`
-	`NVFP4`, `NVFP4_AWQ`	`--quantization modelopt_fp4`
-	Other values	Try `--quantization modelopt`; consult vLLM/SGLang docs if unsure


		If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantization flag in `deployment.extra_args`:

		\| `quant_algo` \| Flag to add \|

		- Checkpoint path (starts with `/` or `./`) → set `deployment.checkpoint_path: <path>` and `deployment.hf_model_handle: null`
		- HF handle (e.g., `org/model-name`) → set `deployment.hf_model_handle: <handle>` and `deployment.checkpoint_path: null`

Conversation

kaix-nv commented Mar 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Skill structure

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

github-actions bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-01 19:33 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaix-nv Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

-If found, read quantization.quant_algo and set the correct vLLM/SGLang quantization flag in deployment.extra_args:

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv commented Apr 1, 2026

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaix-nv commented Apr 1, 2026

kaix-nv commented Mar 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 28, 2026 •

edited

Loading

github-actions bot commented Mar 28, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-01 19:33 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov bot commented Mar 28, 2026 •

edited

Loading

kaix-nv Apr 1, 2026 •

edited

Loading

-If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantization flag in `deployment.extra_args`: