add: HF PTQ support and modelopt_recipes mount in launcher#1089
add: HF PTQ support and modelopt_recipes mount in launcher#1089
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds a new HuggingFace PTQ Bash launcher and example job YAML, mounts model-optimizer recipes into Slurm/Docker executors, and updates the default Slurm container image tag. Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler as Slurm/Docker Scheduler
participant Job as Job Task (hf_ptq_local)
participant Launcher as hf_ptq.sh
participant Python as hf_ptq.py (in container)
participant FS as Filesystem (HF model, modelopt_recipes, export)
Scheduler->>Job: start task
Job->>Launcher: execute `common/hf_ptq/hf_ptq.sh`
Launcher->>Launcher: source utils, set env defaults, register traps
Launcher->>FS: rely on modelopt_recipes bind-mounted into container
Launcher->>Python: run `PYTHONPATH=${HF_PTQ_DIR} python hf_ptq.py` with args
Python->>FS: read HF model, load recipes, run calibration
Python->>FS: write exported model to EXPORT_PATH
Python-->>Launcher: exit status
Launcher->>Scheduler: report_result "PASS/FAIL"
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tools/launcher/core.py (1)
302-311: Consider adding existence check formodelopt_recipesdirectory.The Docker executor derives
modelopt_recipes_src_pathfrom the parent ofmodelopt_src_path, but doesn't verify the directory exists. Ifmodelopt_recipesis missing locally, the container will fail to start with a confusing mount error.🛡️ Proposed defensive check
modelopt_recipes_src_path = os.path.join(os.path.dirname(modelopt_src_path), "modelopt_recipes") + if not os.path.isdir(modelopt_recipes_src_path): + raise FileNotFoundError( + f"modelopt_recipes directory not found at {modelopt_recipes_src_path}. " + "Ensure the Model-Optimizer submodule is fully initialized." + ) exp_title_src = os.path.join(job_dir, experiment_title)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/launcher/core.py` around lines 302 - 311, The code builds modelopt_recipes_src_path and unconditionally adds a mount for modelopt_recipes_dst which will break container startup if the directory is missing; update the logic around modelopt_recipes_src_path/modelopt_recipes_dst to check os.path.exists(modelopt_recipes_src_path) (or create it with os.makedirs(..., exist_ok=True) if desired) and only append the f"{modelopt_recipes_src_path}:{modelopt_recipes_dst}" entry to container_mounts when the directory exists (or after creating it), leaving other mounts unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tools/launcher/common/hf_ptq/hf_ptq.sh`:
- Line 19: The source command uses an unquoted variable expansion (SCRIPT_DIR)
which can break when the path contains spaces; update the invocation in
hf_ptq.sh to quote the expansion (use " ${SCRIPT_DIR}/../service_utils.sh " with
proper quoting around the whole path) so word splitting is prevented and the
script reliably sources service_utils.sh even if SCRIPT_DIR contains spaces.
- Around line 32-37: The command invoking hf_ptq.py in hf_ptq.sh uses unquoted
variable expansions (${HF_MODEL}, ${QFORMAT}, ${CALIB_SIZE}, ${EXPORT_PATH})
which can break when paths or values contain spaces; update the python
invocation to quote each expansion (e.g., "${HF_MODEL}", "${QFORMAT}",
"${CALIB_SIZE}", "${EXPORT_PATH}") while keeping the already-quoted "$@" so all
arguments are passed intact; modify the hf_ptq.sh call that runs hf_ptq.py to
use these quoted variables.
- Line 25: The script hf_ptq.sh currently runs `cd
modules/Model-Optimizer/examples/llm_ptq` without checking failure; update the
script to detect a failed cd and abort before running hf_ptq.py by testing the
exit status and printing a clear error message (e.g., "Failed to change
directory to modules/Model-Optimizer/examples/llm_ptq") and exiting non‑zero;
ensure this check occurs immediately after the cd so subsequent commands (like
invoking hf_ptq.py) never run in the wrong directory.
In `@tools/launcher/slurm_config.py`:
- Line 56: The default container variable container currently points to the RC
image "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2"; either update this to the
stable image (e.g., "nvcr.io/nvidia/tensorrt-llm/release:1.0.0") or add a
concise inline comment next to the container assignment explaining exactly why
the 1.3.0rc2 release candidate is required (mention feature/bugfix and
compatibility constraints), and ensure the comment references the container
string and any related runtime dependency so reviewers can verify the rationale.
---
Nitpick comments:
In `@tools/launcher/core.py`:
- Around line 302-311: The code builds modelopt_recipes_src_path and
unconditionally adds a mount for modelopt_recipes_dst which will break container
startup if the directory is missing; update the logic around
modelopt_recipes_src_path/modelopt_recipes_dst to check
os.path.exists(modelopt_recipes_src_path) (or create it with os.makedirs(...,
exist_ok=True) if desired) and only append the
f"{modelopt_recipes_src_path}:{modelopt_recipes_dst}" entry to container_mounts
when the directory exists (or after creating it), leaving other mounts
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d1574a92-50f8-4a7f-b078-e06f67ccf17d
📥 Commits
Reviewing files that changed from the base of the PR and between c3a3975 and 6df6d3d38425beb841406c96eea0abe1d261029d.
📒 Files selected for processing (4)
tools/launcher/common/hf_ptq/hf_ptq.shtools/launcher/core.pytools/launcher/examples/Qwen/Qwen3-8B/hf_ptq_local.yamltools/launcher/slurm_config.py
| trap 'exit_handler' EXIT | ||
| ################################################################################################### | ||
|
|
||
| cd modules/Model-Optimizer/examples/llm_ptq |
There was a problem hiding this comment.
Handle cd failure to avoid running commands in the wrong directory.
If the cd fails (e.g., directory doesn't exist), the script continues execution in the current directory, which could lead to running hf_ptq.py from an unexpected location or failing with a confusing error.
🔧 Proposed fix
-cd modules/Model-Optimizer/examples/llm_ptq
+cd modules/Model-Optimizer/examples/llm_ptq || exit 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cd modules/Model-Optimizer/examples/llm_ptq | |
| cd modules/Model-Optimizer/examples/llm_ptq || exit 1 |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 25-25: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/launcher/common/hf_ptq/hf_ptq.sh` at line 25, The script hf_ptq.sh
currently runs `cd modules/Model-Optimizer/examples/llm_ptq` without checking
failure; update the script to detect a failed cd and abort before running
hf_ptq.py by testing the exit status and printing a clear error message (e.g.,
"Failed to change directory to modules/Model-Optimizer/examples/llm_ptq") and
exiting non‑zero; ensure this check occurs immediately after the cd so
subsequent commands (like invoking hf_ptq.py) never run in the wrong directory.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1089 +/- ##
==========================================
+ Coverage 70.18% 70.21% +0.02%
==========================================
Files 228 228
Lines 25952 25952
==========================================
+ Hits 18215 18221 +6
+ Misses 7737 7731 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6df6d3d to
e6f066f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tools/launcher/core.py`:
- Around line 238-240: The path for modelopt_recipes_dst is computed using
os.path.dirname(slurm_config.modelopt_install_path) which can return an
incorrect parent when modelopt_install_path ends with a trailing slash;
normalize the path first by calling an appropriate normalization function (e.g.,
os.path.normpath or similar) on slurm_config.modelopt_install_path before
passing it to os.path.dirname so modelopt_recipes_dst is always the intended
parent directory (apply the same normalization wherever modelopt_install_path is
used to construct mount destinations such as the other occurrence around
modelopt_recipes).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 16976c0a-e4ed-48a9-87f8-7436754c49ac
📥 Commits
Reviewing files that changed from the base of the PR and between 6df6d3d38425beb841406c96eea0abe1d261029d and e6f066f.
📒 Files selected for processing (4)
tools/launcher/common/hf_ptq/hf_ptq.shtools/launcher/core.pytools/launcher/examples/Qwen/Qwen3-8B/hf_ptq_local.yamltools/launcher/slurm_config.py
✅ Files skipped from review due to trivial changes (1)
- tools/launcher/examples/Qwen/Qwen3-8B/hf_ptq_local.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- tools/launcher/slurm_config.py
- tools/launcher/common/hf_ptq/hf_ptq.sh
kevalmorabia97
left a comment
There was a problem hiding this comment.
Left 2 minor comments. Otherwise LGTM
|
/ok to test f32d70d |
e6f066f to
f32d70d
Compare
|
- Add common/hf_ptq/hf_ptq.sh script for running hf_ptq.py directly - Add Qwen3-8B hf_ptq_local.yaml example config - Mount modelopt_recipes alongside modelopt in both Slurm and Docker executors so modelopt.recipe imports work with the overlay - Update default container to tensorrt-llm/release:1.3.0rc2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
- Quote variable expansion in hf_ptq.sh source command - Use os.path.normpath() before os.path.dirname() to handle trailing slashes in modelopt_install_path - Upgrade default container from 1.3.0rc2 to 1.3.0rc8 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
f32d70d to
2d1b093
Compare
- Add common/hf_ptq/hf_ptq.sh script for running hf_ptq.py directly - Add Qwen3-8B hf_ptq_local.yaml example config - Mount modelopt_recipes alongside modelopt in both Slurm and Docker executors so modelopt.recipe imports work with the overlay - Update default container to tensorrt-llm/release:1.3.0rc2 ### What does this PR do? Type of change: ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <!-- Details about the change. --> ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a Hugging Face PTQ workflow to run post-training quantization for models. * Added a local single-GPU pipeline for Qwen3-8B with fp8 quantization and export support. * Added a small command-line launcher to invoke the PTQ workflow with configurable model, quantization, calibration, and export options. * **Chores** * Updated the default runtime container image to a newer release. * Included model-optimization recipe files in container mounts. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Chores