Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
📝 WalkthroughWalkthroughThe PR introduces comprehensive remote execution infrastructure and post-training quantization (PTQ) workflows for ModelOpt. It adds a cluster configuration schema, shared environment detection and remote execution utilities written in Bash, detailed documentation for setting up local and remote environments, workspace management patterns, SLURM job handling, and a complete PTQ skill definition with test specifications and reference guides for launcher usage and unsupported model handling. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1107 +/- ##
==========================================
+ Coverage 70.16% 70.18% +0.02%
==========================================
Files 229 230 +1
Lines 26008 26080 +72
==========================================
+ Hits 18248 18304 +56
- Misses 7760 7776 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new Claude skill (ptq) to guide/execute NVIDIA ModelOpt post-training quantization workflows, including remote execution and SLURM-oriented operational guidance.
Changes:
- Introduces the
ptqskill playbook and reference guides for unsupported models, SLURM setup, and remote execution. - Adds a reusable
remote_exec.shlibrary for SSH ControlMaster sessions, remote command execution, rsync syncing, and SLURM helpers. - Documents multi-user workspace management and provides a sample
clusters.yamlconfiguration.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
.claude/skills/ptq/SKILL.md |
PTQ decision flow and key rules for supported vs unsupported models, plus environment selection (local/remote/SLURM). |
.claude/skills/ptq/references/unsupported-models.md |
Patterns and investigation steps for quantizing unsupported architectures. |
.claude/skills/ptq/references/slurm-setup.md |
SLURM container/job-script templates and operational guidance (smoke tests, monitoring). |
.claude/skills/ptq/references/remote-execution.md |
Remote execution workflow and cluster configuration guidance. |
.claude/skills/common/remote_exec.sh |
Bash utility library for persistent SSH, remote run/sync, environment detection, and SLURM helpers. |
.claude/skills/common/workspace-management.md |
Workspace creation/reuse guidance for multi-user (Slack bot) environments. |
.claude/clusters.yaml.example |
Example cluster configuration template. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add a Claude Code skill that guides the agent through post-training quantization (PTQ) of LLMs/VLMs using ModelOpt. Includes: - PTQ skill with decision tree for supported/unsupported models - Launcher integration for SLURM/Docker execution - Common environment setup (shared across future skills) - Remote execution and SLURM setup references (manual fallback) - Unsupported model monkey-patching guide - Workspace management for multi-user environments Signed-off-by: Meng Xin <mxin@nvidia.com>
Workspace-by-model-name is useful for single users too, not just multi-user Slack bot. Rewrite to cover both: ./workspaces/ for single-user, $MODELOPT_WORKSPACE_ROOT for multi-user. Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Meng Xin <mxin@nvidia.com>
Edwardf0t1
left a comment
There was a problem hiding this comment.
Thanks @mxinO for refactoring the ptq skills based on our discussions, and I think the PR is in good shape.
- Let's consider adding a
tests.jsonwith some basic cases (e.g., supported model on SLURM, unsupported VLM, format selection on Hopper vs Blackwell, smoke test failure handling) - We can also add negative triggers to the frontmatter description, especially when deployment and evaluation skills are added. cc @kaix-nv
There was a problem hiding this comment.
I think we can consider adding a references/supported-models.md as well. Currently it's inline in SKILL.md, and the hf_ptq.py workflow and flag reference is just a few lines in SKILL.md rather than a reference. Acceptable for now since it's short, but if it grows, we can extract it.
|
|
||
| # ModelOpt Post-Training Quantization | ||
|
|
||
| Produce a quantized checkpoint from a pretrained model. **Read `examples/llm_ptq/README.md` first** — it has the support matrix, CLI flags, and accuracy guidance. |
There was a problem hiding this comment.
I think this works but means the agent loads the entire README rather than a focused GPU → format matrix. Consider extracting later, e.g., adding a references/format-selection.md.
There was a problem hiding this comment.
I think it's better to refer to readme, otherwise after each update of new models, we need update the skill. we can talk about this later, but currently let's keep it referring the readme.
There was a problem hiding this comment.
SG - it will be better to point to a specific section if we know what we want claude to read specifically.
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Meng Xin <mxin@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
.claude/skills/common/remote-execution.md (1)
112-112: Usesbatch --parsablefor deterministic job ID extraction.The grep-based approach can misparse output if SLURM includes multiple numbers. Use
--parsablewithcut -d';' -f1for reliable, machine-readable extraction that handles both single-cluster and federated SLURM environments.Proposed doc fix
-JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1) +JOBID=$(remote_run "sbatch --parsable /remote/path/scripts/job_slurm.sh" | cut -d';' -f1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/common/remote-execution.md at line 112, The current JOBID extraction using remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep ... is brittle; change the sbatch invocation to use --parsable and extract the first field to get a deterministic job id (e.g., run remote_run "sbatch --parsable /remote/path/scripts/job_slurm.sh" and pipe to cut -d';' -f1) so JOBID reliably contains only the numeric job id even in federated SLURM setups; update the assignment of JOBID and any references to JOBID accordingly..claude/skills/common/slurm-setup.md (1)
58-59: Usesbatch --parsablefor stable job ID capture.Parsing the 4th whitespace token is brittle and depends on human-readable output format, which can break with localized SLURM installations or site-specific wrappers. The
--parsableflag is the SLURM-recommended approach for scripting, outputting structured format (job ID or job ID;cluster name).Proposed doc fix
-JOBID=$(sbatch <script>.sh | awk '{print $4}') +JOBID=$(sbatch --parsable <script>.sh | cut -d';' -f1) echo "Submitted job $JOBID"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/common/slurm-setup.md around lines 58 - 59, The job submission snippet uses brittle whitespace parsing to extract JOBID; replace the sbatch invocation with the parsable mode and assign its output directly to JOBID (e.g., use sbatch --parsable) and, if you need only the numeric ID, split on ';' or take the part before any cluster suffix; update the lines that set and echo the JOBID variable (JOBID=$(sbatch ...) and echo "Submitted job $JOBID") to use this robust approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/common/environment-setup.md:
- Around line 18-31: The doc assumes .claude is in $PWD and mis-specifies SLURM
detection; update the recipe to instruct callers to resolve the
repository/project root (the same ancestor-walk used by your implementation)
before referencing .claude/clusters.yaml and invoking remote_load_cluster and
remote_check_ssh, and clarify that remote_detect_env should detect SLURM by
checking for sbatch (not both srun and sbatch) so PATH/partial-login-node cases
don't pick the wrong branch; also state that remote_load_cluster must be passed
the resolved cluster config path (not a relative .claude path) so subdirectory
invocation behaves identically to the implementation.
In @.claude/skills/common/remote_exec.sh:
- Around line 217-245: The code currently exports REMOTE_WORKSPACE without
validating it, which can result in accidental root (/) or empty paths used by
remote_sync_to/remote_sync_from/remote_ensure_workspace/remote_docker_run; after
parsing REMOTE_WORKSPACE (the REMOTE_WORKSPACE assignment) add a validation: if
REMOTE_WORKSPACE is empty or equals "/" (or otherwise looks invalid) then fail
fast (echo a clear error referencing the cluster name and return 1) so the
config is rejected rather than exported; ensure this validation happens before
the export of REMOTE_WORKSPACE and before printing "Loaded cluster..." so
consumers like remote_sync_to, remote_sync_from, remote_ensure_workspace and
remote_docker_run never receive an invalid workspace.
- Around line 66-82: The _parse_yaml_value function currently injects the
dot.path into an inlined Python string (keys = '$path'), which enables code
injection; change it to accept the file and path via Python argv (use sys.argv
for file and lookup keys) and print the resolved scalar, avoiding any string
interpolation of YAML inputs. Likewise remove uses of eval that build commands
from cluster_name, REMOTE_HOST, REMOTE_SSH_KEY and sync path variables: replace
all eval-based command construction with explicit arrays/argument lists for ssh
and rsync (construct subprocess/exec argument arrays or shell-safe arrays in
bash) and pass variables as quoted argv elements rather than interpolating them
into command strings so no untrusted config can execute locally. Ensure
references to the functions/places using eval (the sections building SSH
connection strings and rsync invocations) are updated to use safe argument
arrays and direct variable passing.
In @.claude/skills/common/workspace-management.md:
- Line 45: Replace the awkward phrase "Task needs output from a previous step
(e.g., eval needs the PTQ checkpoint)" with clearer wording: "Task requires
output from a previous step (e.g., eval requires the PTQ checkpoint)"; update
the string in the document where that exact text appears (search for "Task needs
output from a previous step") so both instances in the example use "requires"
for readability and consistency.
In @.claude/skills/ptq/references/launcher-guide.md:
- Around line 46-50: Clarify that the find command targeting
tools/launcher/local_experiments and matching */exported_model/*/config.json is
specific to local Docker runs (local-launcher) by adding a short note that this
lookup only applies to local Docker mode and to scope the host-path lookup
accordingly; also add a separate brief note describing where artifacts live for
remote SLURM runs (e.g., central experiment/artifact storage or node-specific
scratch paths) and suggest the user consult SLURM job logs or the remote
experiment directory rather than the local tools/launcher/local_experiments
path.
In @.claude/skills/ptq/SKILL.md:
- Line 12: The reference paths in the sentence "Read
`skills/common/environment-setup.md` and
`skills/common/workspace-management.md`." should be normalized to match
companion docs by prefixing with ".claude/"; update those two references to
".claude/skills/common/environment-setup.md" and
".claude/skills/common/workspace-management.md" (and scan the rest of
.claude/skills/ptq/SKILL.md for any other `skills/...` references and change
them to the ".claude/skills/..." style to keep consistency).
---
Nitpick comments:
In @.claude/skills/common/remote-execution.md:
- Line 112: The current JOBID extraction using remote_run "sbatch
/remote/path/scripts/job_slurm.sh" | grep ... is brittle; change the sbatch
invocation to use --parsable and extract the first field to get a deterministic
job id (e.g., run remote_run "sbatch --parsable
/remote/path/scripts/job_slurm.sh" and pipe to cut -d';' -f1) so JOBID reliably
contains only the numeric job id even in federated SLURM setups; update the
assignment of JOBID and any references to JOBID accordingly.
In @.claude/skills/common/slurm-setup.md:
- Around line 58-59: The job submission snippet uses brittle whitespace parsing
to extract JOBID; replace the sbatch invocation with the parsable mode and
assign its output directly to JOBID (e.g., use sbatch --parsable) and, if you
need only the numeric ID, split on ';' or take the part before any cluster
suffix; update the lines that set and echo the JOBID variable (JOBID=$(sbatch
...) and echo "Submitted job $JOBID") to use this robust approach.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 60724936-1b1c-4a85-a31d-a30327a23eed
📒 Files selected for processing (10)
.claude/clusters.yaml.example.claude/skills/common/environment-setup.md.claude/skills/common/remote-execution.md.claude/skills/common/remote_exec.sh.claude/skills/common/slurm-setup.md.claude/skills/common/workspace-management.md.claude/skills/ptq/SKILL.md.claude/skills/ptq/references/launcher-guide.md.claude/skills/ptq/references/slurm-setup.md.claude/skills/ptq/references/unsupported-models.md
There was a problem hiding this comment.
LGTM overall 👍 Left minor comments in lines.
Additional comments:
- We can add a
tests.jsonfor the ptq skills in a follow-up PR. - I would consider further extracting the execution and monitoring parts into a separate
run-and-monitorskills when we land deployment and evaluation skills to avoid duplicating logics.
Signed-off-by: Meng Xin <mxin@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
.claude/skills/common/workspace-management.md (1)
45-45:⚠️ Potential issue | 🟡 MinorUse clearer wording for the reuse criterion.
Line 45 reads awkwardly; use “Task requires output from a previous step (e.g., eval requires the PTQ checkpoint)” for clarity and consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/common/workspace-management.md at line 45, Replace the awkward phrase "Task needs output from a previous step (e.g., eval needs the PTQ checkpoint)" with the clearer wording "Task requires output from a previous step (e.g., eval requires the PTQ checkpoint)" in the workspace-management text; find the sentence that mentions "Task needs output" and update both occurrences of "needs" to "requires" to ensure consistency and clearer phrasing.
🧹 Nitpick comments (2)
.claude/skills/ptq/references/unsupported-models.md (2)
78-84: Make the parameter-name inspection snippet robust for non-sharded checkpoints.The current snippet assumes
model.safetensors.index.jsonalways exists. That breaks for single-file checkpoints.Proposed doc snippet update
-import json -idx = json.load(open('<ckpt_path>/model.safetensors.index.json')) -import re -names = set(re.sub(r'\.\d+\.', '.N.', k) for k in idx['weight_map']) -for n in sorted(names): print(n) +import json, re, os +from safetensors import safe_open + +index_path = "<ckpt_path>/model.safetensors.index.json" +names = set() +if os.path.exists(index_path): + idx = json.load(open(index_path)) + names = set(idx["weight_map"].keys()) +else: + # fallback for single-file safetensors + with safe_open("<ckpt_path>/model.safetensors", framework="pt") as f: + names = set(f.keys()) + +normalized = set(re.sub(r"\.\d+\.", ".N.", k) for k in names) +for n in sorted(normalized): + print(n)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/ptq/references/unsupported-models.md around lines 78 - 84, The snippet assumes model.safetensors.index.json always exists; change it to first check for that file (use os.path.exists) and if present load idx = json.load(... ) and use idx['weight_map'], otherwise open the single-file checkpoint and extract parameter names (e.g., via safetensors.safe_open or safetensors.torch.load_file to list keys) and then apply the same regex normalization (re.sub(r'\.\d+\.', '.N.', k)) to produce names; update variable usage (idx, names) so both branches yield the same set for printing.
9-9: Fix the internal doc path to avoid a broken reference.Line 9 points to
skills/common/workspace-management.md, but this doc tree is under.claude/skills/.... Please align the path format used elsewhere in this PR so this reference resolves reliably.Proposed doc fix
-**Download first.** Follow `skills/common/workspace-management.md` to set up local and remote workspaces, sync ModelOpt source, and download the model on the target machine. This avoids downloading twice and gives access to README, custom modeling code, and tokenizer config. +**Download first.** Follow `.claude/skills/common/workspace-management.md` to set up local and remote workspaces, sync ModelOpt source, and download the model on the target machine. This avoids downloading twice and provides access to README, custom modeling code, and tokenizer config.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/ptq/references/unsupported-models.md at line 9, Update the broken internal doc link in the sentence that currently references "skills/common/workspace-management.md" so it uses the same .claude/skills-relative path format used elsewhere in this PR; replace that target with ".claude/skills/common/workspace-management.md" (the same canonical path style as other references in this doc tree) to ensure the link resolves reliably.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/ptq/references/unsupported-models.md:
- Around line 235-243: Add explicit shape guards before the FP8 block dequant
reshape: verify that w.shape[-2] is divisible by s.shape[-2] and w.shape[-1] is
divisible by s.shape[-1] (the block_m/block_n computation for variables w, s,
block_m, block_n) and raise a clear ValueError (or AssertionError) that includes
the offending shapes if not divisible so we fail fast instead of performing an
invalid reshape; apply this check in the code path where scale.dim() == 3 before
computing block_m/block_n and reshaping/scaling param.data.
---
Duplicate comments:
In @.claude/skills/common/workspace-management.md:
- Line 45: Replace the awkward phrase "Task needs output from a previous step
(e.g., eval needs the PTQ checkpoint)" with the clearer wording "Task requires
output from a previous step (e.g., eval requires the PTQ checkpoint)" in the
workspace-management text; find the sentence that mentions "Task needs output"
and update both occurrences of "needs" to "requires" to ensure consistency and
clearer phrasing.
---
Nitpick comments:
In @.claude/skills/ptq/references/unsupported-models.md:
- Around line 78-84: The snippet assumes model.safetensors.index.json always
exists; change it to first check for that file (use os.path.exists) and if
present load idx = json.load(... ) and use idx['weight_map'], otherwise open the
single-file checkpoint and extract parameter names (e.g., via
safetensors.safe_open or safetensors.torch.load_file to list keys) and then
apply the same regex normalization (re.sub(r'\.\d+\.', '.N.', k)) to produce
names; update variable usage (idx, names) so both branches yield the same set
for printing.
- Line 9: Update the broken internal doc link in the sentence that currently
references "skills/common/workspace-management.md" so it uses the same
.claude/skills-relative path format used elsewhere in this PR; replace that
target with ".claude/skills/common/workspace-management.md" (the same canonical
path style as other references in this doc tree) to ensure the link resolves
reliably.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9107923f-5183-44df-8d92-439acfcc8e31
📒 Files selected for processing (5)
.claude/skills/common/environment-setup.md.claude/skills/common/workspace-management.md.claude/skills/ptq/SKILL.md.claude/skills/ptq/references/unsupported-models.md.claude/skills/ptq/tests.json
✅ Files skipped from review due to trivial changes (3)
- .claude/skills/ptq/tests.json
- .claude/skills/common/environment-setup.md
- .claude/skills/ptq/SKILL.md
Signed-off-by: Meng Xin <mxin@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
.claude/skills/common/remote_exec.sh (1)
105-127:⚠️ Potential issue | 🔴 CriticalReplace remaining
evalcommand construction (still injectable).Line 155, Line 261, Line 348, Line 381, and Line 393 still execute interpolated command strings via
eval. Because host/key/proxy/path values come from config or function args, this keeps a local command-injection path open.🔧 Safer pattern (argv arrays, no
eval)-_ssh_base_opts() { - local opts="-o BatchMode=yes -o ConnectTimeout=15 -o StrictHostKeyChecking=accept-new" - local ctl_path - ctl_path="$(_ssh_control_path)" - opts+=" -o ControlPath='${ctl_path}'" - opts+=" -o ControlMaster=auto" - if [[ -n "${REMOTE_SSH_KEY:-}" ]]; then - opts+=" -i $REMOTE_SSH_KEY" - fi - if [[ -n "${REMOTE_SSH_PROXY:-}" ]]; then - opts+=" -o ProxyCommand='${REMOTE_SSH_PROXY}'" - fi - echo "$opts" -} - -_ssh_base_cmd() { - echo "ssh $(_ssh_base_opts) ${REMOTE_USER}@${REMOTE_HOST}" -} +_ssh_base_argv() { + SSH_BASE_ARGV=( + -o BatchMode=yes + -o ConnectTimeout=15 + -o StrictHostKeyChecking=accept-new + -o "ControlPath=$(_ssh_control_path)" + -o ControlMaster=auto + ) + [[ -n "${REMOTE_SSH_KEY:-}" ]] && SSH_BASE_ARGV+=(-i "$REMOTE_SSH_KEY") + [[ -n "${REMOTE_SSH_PROXY:-}" ]] && SSH_BASE_ARGV+=(-o "ProxyCommand=${REMOTE_SSH_PROXY}") + SSH_BASE_ARGV+=("${REMOTE_USER}@${REMOTE_HOST}") +} @@ - eval "ssh $opts -f -N ${REMOTE_USER}@${REMOTE_HOST}" 2>&1 + ssh "${start_argv[@]}" -f -N 2>&1 @@ - if result=$(eval "$(_ssh_base_cmd)" '"echo SSH_OK"' 2>&1); then + _ssh_base_argv + if result=$(ssh "${SSH_BASE_ARGV[@]}" "echo SSH_OK" 2>&1); then @@ - result=$(eval "$(_ssh_base_cmd)" "'echo $encoded | base64 -d | bash'" 2>&1) && rc=$? || rc=$? + _ssh_base_argv + result=$(ssh "${SSH_BASE_ARGV[@]}" "echo $encoded | base64 -d | bash" 2>&1) && rc=$? || rc=$? @@ - eval "$rsync_cmd" + rsync "${rsync_argv[@]}" @@ - eval "rsync -avz --progress -e \"ssh $(_ssh_base_opts)\" '${remote_src}/' '${local_path}/'" + rsync "${rsync_argv[@]}"#!/bin/bash # Verify no eval-based execution remains in this script. rg -n '\beval\b|_ssh_base_cmd\b|rsync_cmd=' .claude/skills/common/remote_exec.shExpected: no
evalusage for ssh/rsync execution paths.Also applies to: 155-155, 261-261, 348-348, 371-381, 393-393
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/common/remote_exec.sh around lines 105 - 127, The script still builds and executes interpolated command strings via eval (notably around _ssh_base_cmd/_ssh_base_opts and places that set rsync_cmd), leaving command-injection risk; fix by eliminating eval and switching to POSIX/Bash argv arrays: change callers that currently do eval "$(_ssh_base_cmd) ...", instead build an array like local -a ssh_cmd=(ssh -o BatchMode=yes -o ConnectTimeout=15 -o StrictHostKeyChecking=accept-new) then conditionally append args with quoted variables (e.g. [[ -n $REMOTE_SSH_KEY ]] && ssh_cmd+=(-i "$REMOTE_SSH_KEY"); [[ -n $REMOTE_SSH_PROXY ]] && ssh_cmd+=(-o "ProxyCommand=$REMOTE_SSH_PROXY"); ssh_cmd+=("${REMOTE_USER}@${REMOTE_HOST}")), and execute with "${ssh_cmd[@]}" (similarly convert any rsync_cmd string to an array and call "${rsync_cmd[@]}"); update or add an array-returning helper (or rename _ssh_base_opts to produce/apply options as array pieces) and remove all eval usages so no interpolated strings are executed.
🧹 Nitpick comments (1)
.claude/skills/ptq/references/unsupported-models.md (1)
52-53: Pin transformers install by commit SHA for reproducibility.The current doc line installs from a moving
maincheckout, which makes troubleshooting non-reproducible. Recommend documenting install by commit SHA:- - **Found** → install from that clone: `pip install /tmp/transformers-main --quiet`, then re-run `AutoConfig.from_pretrained()`. + - **Found** → install from that clone at a pinned commit, then re-run `AutoConfig.from_pretrained()`: + `cd /tmp/transformers-main && git rev-parse HEAD` + `pip install /tmp/transformers-main --quiet`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/ptq/references/unsupported-models.md around lines 52 - 53, The doc currently tells users to install from a moving main checkout ("pip install /tmp/transformers-main --quiet") which harms reproducibility; update the instruction to pin the installed transformers to a specific commit SHA by checking out the repo to that SHA before installing, and update the surrounding guidance (the sentence referencing AutoConfig.from_pretrained() and the follow-up question about <ArchName>) to instruct using the pinned-commit install for reproducible debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/common/remote_exec.sh:
- Around line 217-239: REMOTE_WORKSPACE may contain a literal "~/" which is not
expanded and breaks remote_run(), remote_sync_to(), rsync and mounts; add the
same tilde handling used for REMOTE_SSH_KEY to expand "~/" to "${HOME}/..."
right after REMOTE_WORKSPACE is assigned (or alternatively reject values
beginning with "~/" with an error), and ensure references in remote_run and
remote_sync_to continue to use the expanded REMOTE_WORKSPACE value; update
validation that currently checks REMOTE_WORKSPACE to run after the expansion so
the non-root and non-empty checks operate on the expanded path.
In @.claude/skills/ptq/references/unsupported-models.md:
- Around line 14-15: Update the guidance around using --trust_remote_code
(referenced in the sentence about modeling_*.py and tokenization_*.py and the
tools hf_ptq.py, AutoConfig, AutoTokenizer, AutoModel and
trust_remote_code=True) to include a security warning: state that
--trust_remote_code executes arbitrary Python from the model repo, recommend
running it only in isolated or ephemeral environments (e.g., containers or VMs),
advise pinning to a specific trusted model revision/commit and verifying the
remote code before use, and suggest running with least privileges and network
restrictions where possible.
---
Duplicate comments:
In @.claude/skills/common/remote_exec.sh:
- Around line 105-127: The script still builds and executes interpolated command
strings via eval (notably around _ssh_base_cmd/_ssh_base_opts and places that
set rsync_cmd), leaving command-injection risk; fix by eliminating eval and
switching to POSIX/Bash argv arrays: change callers that currently do eval
"$(_ssh_base_cmd) ...", instead build an array like local -a ssh_cmd=(ssh -o
BatchMode=yes -o ConnectTimeout=15 -o StrictHostKeyChecking=accept-new) then
conditionally append args with quoted variables (e.g. [[ -n $REMOTE_SSH_KEY ]]
&& ssh_cmd+=(-i "$REMOTE_SSH_KEY"); [[ -n $REMOTE_SSH_PROXY ]] && ssh_cmd+=(-o
"ProxyCommand=$REMOTE_SSH_PROXY"); ssh_cmd+=("${REMOTE_USER}@${REMOTE_HOST}")),
and execute with "${ssh_cmd[@]}" (similarly convert any rsync_cmd string to an
array and call "${rsync_cmd[@]}"); update or add an array-returning helper (or
rename _ssh_base_opts to produce/apply options as array pieces) and remove all
eval usages so no interpolated strings are executed.
---
Nitpick comments:
In @.claude/skills/ptq/references/unsupported-models.md:
- Around line 52-53: The doc currently tells users to install from a moving main
checkout ("pip install /tmp/transformers-main --quiet") which harms
reproducibility; update the instruction to pin the installed transformers to a
specific commit SHA by checking out the repo to that SHA before installing, and
update the surrounding guidance (the sentence referencing
AutoConfig.from_pretrained() and the follow-up question about <ArchName>) to
instruct using the pinned-commit install for reproducible debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 251ab2f2-b085-471c-93b7-5fd5c7045659
📒 Files selected for processing (7)
.claude/skills/common/remote-execution.md.claude/skills/common/remote_exec.sh.claude/skills/common/workspace-management.md.claude/skills/ptq/SKILL.md.claude/skills/ptq/references/launcher-guide.md.claude/skills/ptq/references/slurm-setup-ptq.md.claude/skills/ptq/references/unsupported-models.md
✅ Files skipped from review due to trivial changes (5)
- .claude/skills/ptq/references/slurm-setup-ptq.md
- .claude/skills/common/workspace-management.md
- .claude/skills/ptq/SKILL.md
- .claude/skills/common/remote-execution.md
- .claude/skills/ptq/references/launcher-guide.md
| REMOTE_WORKSPACE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.workspace")" | ||
| REMOTE_GPU_TYPE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.gpu_type")" | ||
| REMOTE_CONTAINER_IMAGE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.container_image")" | ||
| REMOTE_ENV_TYPE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.env_type")" | ||
|
|
||
| # SLURM-specific | ||
| REMOTE_SLURM_ACCOUNT="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.slurm.default_account")" | ||
| REMOTE_SLURM_PARTITION="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.slurm.default_partition")" | ||
|
|
||
| # Expand ~ in ssh_key | ||
| if [[ "${REMOTE_SSH_KEY:-}" == "~/"* ]]; then | ||
| REMOTE_SSH_KEY="${HOME}/${REMOTE_SSH_KEY#\~/}" | ||
| fi | ||
|
|
||
| # Validate required fields | ||
| if [[ -z "$REMOTE_HOST" ]]; then | ||
| echo "ERROR: Cluster '$cluster_name' has no login_node defined." >&2 | ||
| return 1 | ||
| fi | ||
| if [[ -z "${REMOTE_WORKSPACE:-}" || "$REMOTE_WORKSPACE" == "/" ]]; then | ||
| echo "ERROR: Cluster '$cluster_name' must define a non-root workspace." >&2 | ||
| return 1 | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find workspace values in example/docs that may still use '~'
rg -n 'workspace:\s*~/?' .claude README.md docs 2>/dev/nullRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# First, get file size and structure
wc -l .claude/skills/common/remote_exec.shRepository: NVIDIA/Model-Optimizer
Length of output: 106
🏁 Script executed:
# Read the section in question (lines 217-239) and surrounding context
sed -n '215,245p' .claude/skills/common/remote_exec.sh | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 1816
🏁 Script executed:
# Read lines around 341-343 to see how REMOTE_WORKSPACE is used
sed -n '335,350p' .claude/skills/common/remote_exec.sh | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 881
🏁 Script executed:
# Read lines around 369-370 to see how REMOTE_WORKSPACE is used
sed -n '360,380p' .claude/skills/common/remote_exec.sh | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 932
Add validation to reject ~ in workspace values, or expand them like ssh_key.
REMOTE_WORKSPACE parsed from config can contain ~/..., but unlike REMOTE_SSH_KEY (lines 227–228), it has no tilde expansion. When used in remote_run() (line 348, double-quoted) and remote_sync_to() (line 379, then wrapped in single quotes at line 388), the tilde is passed literally and won't expand remotely, causing cd, rsync, and mounts to fail.
Either expand ~ in workspace values to $HOME form (matching SSH_KEY pattern), or reject them with validation:
Suggested validation
if [[ -z "${REMOTE_WORKSPACE:-}" || "$REMOTE_WORKSPACE" == "/" ]]; then
echo "ERROR: Cluster '$cluster_name' must define a non-root workspace." >&2
return 1
fi
+ if [[ "$REMOTE_WORKSPACE" == "~" || "$REMOTE_WORKSPACE" == "~/"* ]]; then
+ echo "ERROR: Cluster '$cluster_name' workspace must not use '~'. Use an absolute path or \$HOME/... form." >&2
+ return 1
+ fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| REMOTE_WORKSPACE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.workspace")" | |
| REMOTE_GPU_TYPE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.gpu_type")" | |
| REMOTE_CONTAINER_IMAGE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.container_image")" | |
| REMOTE_ENV_TYPE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.env_type")" | |
| # SLURM-specific | |
| REMOTE_SLURM_ACCOUNT="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.slurm.default_account")" | |
| REMOTE_SLURM_PARTITION="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.slurm.default_partition")" | |
| # Expand ~ in ssh_key | |
| if [[ "${REMOTE_SSH_KEY:-}" == "~/"* ]]; then | |
| REMOTE_SSH_KEY="${HOME}/${REMOTE_SSH_KEY#\~/}" | |
| fi | |
| # Validate required fields | |
| if [[ -z "$REMOTE_HOST" ]]; then | |
| echo "ERROR: Cluster '$cluster_name' has no login_node defined." >&2 | |
| return 1 | |
| fi | |
| if [[ -z "${REMOTE_WORKSPACE:-}" || "$REMOTE_WORKSPACE" == "/" ]]; then | |
| echo "ERROR: Cluster '$cluster_name' must define a non-root workspace." >&2 | |
| return 1 | |
| fi | |
| REMOTE_WORKSPACE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.workspace")" | |
| REMOTE_GPU_TYPE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.gpu_type")" | |
| REMOTE_CONTAINER_IMAGE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.container_image")" | |
| REMOTE_ENV_TYPE="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.env_type")" | |
| # SLURM-specific | |
| REMOTE_SLURM_ACCOUNT="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.slurm.default_account")" | |
| REMOTE_SLURM_PARTITION="$(_parse_yaml_value "$config_file" "clusters.${cluster_name}.slurm.default_partition")" | |
| # Expand ~ in ssh_key | |
| if [[ "${REMOTE_SSH_KEY:-}" == "~/"* ]]; then | |
| REMOTE_SSH_KEY="${HOME}/${REMOTE_SSH_KEY#\~/}" | |
| fi | |
| # Validate required fields | |
| if [[ -z "$REMOTE_HOST" ]]; then | |
| echo "ERROR: Cluster '$cluster_name' has no login_node defined." >&2 | |
| return 1 | |
| fi | |
| if [[ -z "${REMOTE_WORKSPACE:-}" || "$REMOTE_WORKSPACE" == "/" ]]; then | |
| echo "ERROR: Cluster '$cluster_name' must define a non-root workspace." >&2 | |
| return 1 | |
| fi | |
| if [[ "$REMOTE_WORKSPACE" == "~" || "$REMOTE_WORKSPACE" == "~/"* ]]; then | |
| echo "ERROR: Cluster '$cluster_name' workspace must not use '~'. Use an absolute path or \$HOME/... form." >&2 | |
| return 1 | |
| fi |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 227-227: Tilde does not expand in quotes. Use $HOME.
(SC2088)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/common/remote_exec.sh around lines 217 - 239,
REMOTE_WORKSPACE may contain a literal "~/" which is not expanded and breaks
remote_run(), remote_sync_to(), rsync and mounts; add the same tilde handling
used for REMOTE_SSH_KEY to expand "~/" to "${HOME}/..." right after
REMOTE_WORKSPACE is assigned (or alternatively reject values beginning with "~/"
with an error), and ensure references in remote_run and remote_sync_to continue
to use the expanded REMOTE_WORKSPACE value; update validation that currently
checks REMOTE_WORKSPACE to run after the expansion so the non-root and non-empty
checks operate on the expanded path.
| 2. **Check for `modeling_*.py` or `tokenization_*.py`** — custom code shipped with the model. If found, **always use `--trust_remote_code`** with `hf_ptq.py`, and `trust_remote_code=True` in any custom scripts. Without it, `AutoConfig`, `AutoTokenizer`, and `AutoModel` will fail to resolve custom classes. | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n .claude/skills/ptq/references/unsupported-models.md | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 2380
🏁 Script executed:
wc -l .claude/skills/ptq/references/unsupported-models.mdRepository: NVIDIA/Model-Optimizer
Length of output: 121
🏁 Script executed:
grep -n -i "security\|isolation\|revision\|trust_remote_code\|sandbox" .claude/skills/ptq/references/unsupported-models.mdRepository: NVIDIA/Model-Optimizer
Length of output: 662
🏁 Script executed:
cat -n .claude/skills/ptq/references/unsupported-models.md | sed -n '1,30p'Repository: NVIDIA/Model-Optimizer
Length of output: 1946
🏁 Script executed:
# Check the workspace-management.md reference mentioned in line 9 to see if security guidance is there
cat -n .claude/skills/common/workspace-management.md 2>/dev/null | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 1515
🏁 Script executed:
# Search for security-related content in the broader context
grep -r -n -i "trust_remote_code\|sandbox\|isolation" .claude/skills/ 2>/dev/null | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1431
Add security guidance for --trust_remote_code usage.
Line 14 mandates --trust_remote_code but omits any warning that this executes arbitrary Python from model repositories. Add guidance to run only in isolated environments and pin a trusted model revision/commit.
Suggested patch
2. **Check for `modeling_*.py` or `tokenization_*.py`** — custom code shipped with the model. If found, **always use `--trust_remote_code`** with `hf_ptq.py`, and `trust_remote_code=True` in any custom scripts. Without it, `AutoConfig`, `AutoTokenizer`, and `AutoModel` will fail to resolve custom classes.
+ **Security note:** `trust_remote_code` executes model-repo Python. Use only in isolated environments and pin a trusted `revision`/commit.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 2. **Check for `modeling_*.py` or `tokenization_*.py`** — custom code shipped with the model. If found, **always use `--trust_remote_code`** with `hf_ptq.py`, and `trust_remote_code=True` in any custom scripts. Without it, `AutoConfig`, `AutoTokenizer`, and `AutoModel` will fail to resolve custom classes. | |
| 2. **Check for `modeling_*.py` or `tokenization_*.py`** — custom code shipped with the model. If found, **always use `--trust_remote_code`** with `hf_ptq.py`, and `trust_remote_code=True` in any custom scripts. Without it, `AutoConfig`, `AutoTokenizer`, and `AutoModel` will fail to resolve custom classes. | |
| **Security note:** `trust_remote_code` executes model-repo Python. Use only in isolated environments and pin a trusted `revision`/commit. | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/ptq/references/unsupported-models.md around lines 14 - 15,
Update the guidance around using --trust_remote_code (referenced in the sentence
about modeling_*.py and tokenization_*.py and the tools hf_ptq.py, AutoConfig,
AutoTokenizer, AutoModel and trust_remote_code=True) to include a security
warning: state that --trust_remote_code executes arbitrary Python from the model
repo, recommend running it only in isolated or ephemeral environments (e.g.,
containers or VMs), advise pinning to a specific trusted model revision/commit
and verifying the remote code before use, and suggest running with least
privileges and network restrictions where possible.
What does this PR do?
Type of change: New feature
Add a PTQ skill that supports
Non-launcher path:
clusters.yaml(or give info to the agent directly), then say "quantize Qwen3-8B in xxx". The agent will use SSH ControlMaster to keep a persistent ssh session, so the remote server won't be abused and connection overhead can be reduced significantly.launcher path:
We currently only use launcher for (supported models + non bare metal) case.
What are not supported now:
Usage
Testing
Before your PR is "Ready for review"
CONTRIBUTING.md: N/AWill add tests later.
Additional Information
Summary by CodeRabbit
Documentation
Chores