Conversation
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
…ntizer, NVFP4MSECalibrator (#849) **Type of change:** ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** ? <!-- You can potentially add a usage example below. --> ```python ``` <!-- Mention how have you tested your change if applicable. --> <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **New Features** * Added NVFP4StaticQuantizer for improved 4-bit quantization with enhanced precision control * Introduced NVFP4MSECalibrator with flexible candidate generation for calibration optimization * **Improvements** * Optimized GPU kernels for Hopper+ graphics cards with better performance * Extended Triton support to broader GPU compatibility * Enhanced backward compatibility for restoring previously quantized models * **Tests** * Added comprehensive test coverage for new quantizers and calibration methods <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: realAsma <akuriparambi@nvidia.com> Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
…FP4QTensor Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR refactors the GPTQ quantization algorithm by renaming the prior Changes
Sequence Diagram(s)sequenceDiagram
participant Model as Model
participant SeqCal as Sequential Calibrate
participant Helper as GPTQHelper
participant HessianPatch as Forward Patch
participant WeightUpdate as Weight Updater
SeqCal->>SeqCal: For each layer
SeqCal->>Helper: Create GPTQHelper instance
Helper->>HessianPatch: Patch layer forward
SeqCal->>Model: Forward pass (collect Hessians)
Model->>HessianPatch: Forward called
HessianPatch->>Helper: Store activation tensors
Helper->>Helper: Accumulate Hessian
SeqCal->>Helper: Prepare inverse Hessian
Helper->>Helper: Handle dead neurons + damping
Helper->>WeightUpdate: For each weight block
WeightUpdate->>WeightUpdate: Compute GPTQ delta
WeightUpdate->>WeightUpdate: Quantize & propagate error
WeightUpdate->>Model: Update weight
Helper->>HessianPatch: Unpatch forward
SeqCal->>SeqCal: Next layer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
examples/gpt-oss/configs/sft_full.yaml (1)
25-25: Unrelated change: disabling external reporting.The change from
report_to: [trackio]toreport_to: [none]disables external tracking/reporting. This appears unrelated to the main PR objective of fixing the deprecatedtorch_dtypeparameter. If this change was intentional, consider documenting the reason in the PR description or commit message.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/gpt-oss/configs/sft_full.yaml` at line 25, The PR unintentionally changed the logging/telemetry setting by replacing the config key value report_to: [trackio] with report_to: [none]; revert this unrelated change (restore report_to: [trackio]) unless telemetry was intentionally disabled—if intentional, add a short note to the PR description or commit message explaining why external reporting was turned off and reference the report_to setting so reviewers know it was deliberate.modelopt/torch/quantization/model_calib.py (1)
1795-1809: Consider memory optimization for large weight matrices.Line 1795 clones the entire weight for each block iteration (
wblk = self.weight.clone()), and line 1801 quantizes the full matrix for each column. While this matches standard GPTQ, it can be memory-intensive for large models.The current implementation is functionally correct. If memory becomes a concern for very large layers, consider lazy quantization or partial matrix operations in future iterations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 1795 - 1809, The code clones and quantizes the entire self.weight inside the block loop (wblk = self.weight.clone(); qdq = quantizer(wblk)), which is memory-heavy; instead, operate on slices: clone only the block slice (e.g., wblk = self.weight[:, block_start:block_end].clone()), call quantizer on that slice, compute qdq and err for the local columns using h_inv_cho_blk and errs, and update self.weight only for the affected columns and the trailing addmm_ using h_inv[block_start:block_end, block_end:]—keep the same variable names (weight, wblk, quantizer, h_inv_cho_blk, errs, block_start, block_end, n_cols_blk) so logic is identical but memory use is reduced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/gpt-oss/sft.py`:
- Line 75: The config key mismatch causes model_args.dtype to remain unset
because examples/gpt-oss/configs/sft_lora.yaml uses torch_dtype instead of
dtype; update that YAML to use the key name expected by the code. Edit
examples/gpt-oss/configs/sft_lora.yaml and change the entry torch_dtype:
bfloat16 to dtype: bfloat16 (matching sft_full.yaml) so the value is picked up
by the code path that reads model_args.dtype in sft.py.
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1532-1547: The code currently assumes non-empty input and can
divide by zero when batch_size==0; add an early guard in the function that
computes/updates the hessian (the block using input_flat, batch_size, hessian,
n_samples) that, if batch_size is 0 (i.e., input.reshape yields zero tokens),
immediately returns the unchanged hessian and n_samples; place this check after
computing input_flat and batch_size and before adjusting hessian/n_samples and
the scaled_input calculations to avoid the division by zero and unnecessary
work.
- Around line 1835-1850: Wrap the Hessian computation and weight-update phases
in try/finally blocks to ensure patched forward methods are always cleaned up
and resources freed: after creating gptq_handles and calling handle.setup(), run
forward_loop(model) inside a try block and call each handle.cleanup() in the
finally; similarly, when running the update_weights loop, ensure any exception
still triggers handle.free() (and optionally cleanup) in a finally. Reference
GPTQHelper.setup, GPTQHelper.cleanup, GPTQHelper.update_weights,
GPTQHelper.free, gptq_handles, disabled_weight_quantizers, and forward_loop to
locate and modify the code.
---
Nitpick comments:
In `@examples/gpt-oss/configs/sft_full.yaml`:
- Line 25: The PR unintentionally changed the logging/telemetry setting by
replacing the config key value report_to: [trackio] with report_to: [none];
revert this unrelated change (restore report_to: [trackio]) unless telemetry was
intentionally disabled—if intentional, add a short note to the PR description or
commit message explaining why external reporting was turned off and reference
the report_to setting so reviewers know it was deliberate.
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1795-1809: The code clones and quantizes the entire self.weight
inside the block loop (wblk = self.weight.clone(); qdq = quantizer(wblk)), which
is memory-heavy; instead, operate on slices: clone only the block slice (e.g.,
wblk = self.weight[:, block_start:block_end].clone()), call quantizer on that
slice, compute qdq and err for the local columns using h_inv_cho_blk and errs,
and update self.weight only for the affected columns and the trailing addmm_
using h_inv[block_start:block_end, block_end:]—keep the same variable names
(weight, wblk, quantizer, h_inv_cho_blk, errs, block_start, block_end,
n_cols_blk) so logic is identical but memory use is reduced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e1724464-f96a-4dbd-b311-4f2cec4b6a06
📒 Files selected for processing (8)
examples/gpt-oss/configs/sft_full.yamlexamples/gpt-oss/sft.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/mode.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/model_quant.pymodelopt/torch/quantization/utils/core_utils.pytests/gpu/torch/quantization/test_gptq.py
💤 Files with no reviewable changes (1)
- modelopt/torch/quantization/model_quant.py
| "trust_remote_code": model_args.trust_remote_code, | ||
| "attn_implementation": model_args.attn_implementation, | ||
| "torch_dtype": getattr(model_args, "dtype", "float32"), | ||
| "torch_dtype": getattr(model_args, "dtype", "bfloat16"), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Find all YAML config files that still use the deprecated torch_dtype key
# Search for torch_dtype in YAML config files
rg -n 'torch_dtype:' --glob 'examples/gpt-oss/configs/*.yaml'Repository: NVIDIA/Model-Optimizer
Length of output: 128
🏁 Script executed:
# Check the content of sft_lora.yaml and sft_full.yaml to see the full picture
find examples/gpt-oss/configs -type f -name "*.yaml" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 179
🏁 Script executed:
# Get the first 10 lines of the YAML config files to see structure
for file in $(find examples/gpt-oss/configs -type f -name "*.yaml"); do
echo "=== $file ==="
head -10 "$file"
doneRepository: NVIDIA/Model-Optimizer
Length of output: 884
🏁 Script executed:
# Search for all occurrences of "dtype" (both torch_dtype and dtype) in YAML files
rg -n '(torch_)?dtype:' --glob 'examples/gpt-oss/configs/*.yaml'Repository: NVIDIA/Model-Optimizer
Length of output: 185
Update examples/gpt-oss/configs/sft_lora.yaml to use the correct config key.
The Python code at line 75 reads model_args.dtype (with fallback to "bfloat16"). However, examples/gpt-oss/configs/sft_lora.yaml provides torch_dtype: bfloat16 instead of dtype: bfloat16. This means the config value is silently ignored and the hardcoded default is used instead.
Change line 4 in sft_lora.yaml from torch_dtype: bfloat16 to dtype: bfloat16 to match sft_full.yaml and ensure the configuration is properly applied.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/gpt-oss/sft.py` at line 75, The config key mismatch causes
model_args.dtype to remain unset because examples/gpt-oss/configs/sft_lora.yaml
uses torch_dtype instead of dtype; update that YAML to use the key name expected
by the code. Edit examples/gpt-oss/configs/sft_lora.yaml and change the entry
torch_dtype: bfloat16 to dtype: bfloat16 (matching sft_full.yaml) so the value
is picked up by the code path that reads model_args.dtype in sft.py.
|
|
||
| Note: input must be non-empty (batch_size > 0); a zero-sized input causes division by zero. | ||
| """ | ||
| batch_size = input.shape[0] | ||
| # Flatten to 2D (total_tokens, features) first, so batch_size counts tokens | ||
| input_flat = input.reshape(-1, input.shape[-1]).t().float() | ||
| batch_size = input_flat.shape[1] | ||
|
|
||
| # Incremental averaging: scale down old hessian | ||
| hessian *= n_samples / (n_samples + batch_size) | ||
| n_samples += batch_size | ||
|
|
||
| # Compute outer product: H += (2/n_samples) * X @ X^T | ||
| # where X is the flattened input reshaped to (features, batch*seq) | ||
| input_flat = input.reshape(-1, input.shape[-1]).t().float() | ||
| scaled_input = math.sqrt(2 / n_samples) * input_flat | ||
| hessian.add_((scaled_input @ scaled_input.t()).to(hessian.device)) | ||
|
|
||
| return hessian, n_samples |
There was a problem hiding this comment.
Handle zero-sized inputs to prevent division by zero.
The comment at line 1533 documents the precondition but doesn't enforce it. If input has zero elements along the batch/sequence dimensions, batch_size will be 0, causing division by zero at line 1540.
Consider adding an early return:
+ if input.numel() == 0:
+ return hessian, n_samples
# Flatten to 2D (total_tokens, features) first, so batch_size counts tokens
input_flat = input.reshape(-1, input.shape[-1]).t().float()
batch_size = input_flat.shape[1]This is especially relevant for MoE models where some experts may receive no tokens during calibration.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Note: input must be non-empty (batch_size > 0); a zero-sized input causes division by zero. | |
| """ | |
| batch_size = input.shape[0] | |
| # Flatten to 2D (total_tokens, features) first, so batch_size counts tokens | |
| input_flat = input.reshape(-1, input.shape[-1]).t().float() | |
| batch_size = input_flat.shape[1] | |
| # Incremental averaging: scale down old hessian | |
| hessian *= n_samples / (n_samples + batch_size) | |
| n_samples += batch_size | |
| # Compute outer product: H += (2/n_samples) * X @ X^T | |
| # where X is the flattened input reshaped to (features, batch*seq) | |
| input_flat = input.reshape(-1, input.shape[-1]).t().float() | |
| scaled_input = math.sqrt(2 / n_samples) * input_flat | |
| hessian.add_((scaled_input @ scaled_input.t()).to(hessian.device)) | |
| return hessian, n_samples | |
| Note: input must be non-empty (batch_size > 0); a zero-sized input causes division by zero. | |
| """ | |
| if input.numel() == 0: | |
| return hessian, n_samples | |
| # Flatten to 2D (total_tokens, features) first, so batch_size counts tokens | |
| input_flat = input.reshape(-1, input.shape[-1]).t().float() | |
| batch_size = input_flat.shape[1] | |
| # Incremental averaging: scale down old hessian | |
| hessian *= n_samples / (n_samples + batch_size) | |
| n_samples += batch_size | |
| # Compute outer product: H += (2/n_samples) * X @ X^T | |
| scaled_input = math.sqrt(2 / n_samples) * input_flat | |
| hessian.add_((scaled_input @ scaled_input.t()).to(hessian.device)) | |
| return hessian, n_samples |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/model_calib.py` around lines 1532 - 1547, The
code currently assumes non-empty input and can divide by zero when
batch_size==0; add an early guard in the function that computes/updates the
hessian (the block using input_flat, batch_size, hessian, n_samples) that, if
batch_size is 0 (i.e., input.reshape yields zero tokens), immediately returns
the unchanged hessian and n_samples; place this check after computing input_flat
and batch_size and before adjusting hessian/n_samples and the scaled_input
calculations to avoid the division by zero and unnecessary work.
| gptq_handles = {name: GPTQHelper(m, name, offload_to_cpu=True) for name, m in quantized_layers} | ||
| for handle in gptq_handles.values(): | ||
| handle.setup() | ||
|
|
||
| calib_func(layer, _layer_forward_loop, **calib_kwargs) | ||
| print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...") | ||
|
|
||
| del layer_inputs | ||
| torch.cuda.empty_cache() | ||
| finally: | ||
| input_getter._unpatch_all_layers() | ||
| with disabled_weight_quantizers(model): | ||
| forward_loop(model) | ||
|
|
||
| for handle in gptq_handles.values(): | ||
| handle.cleanup() | ||
|
|
||
| print_rank_0("Updating weights using GPTQ algorithm...") | ||
| for handle in gptq_handles.values(): | ||
| handle.update_weights(block_size, percdamp) | ||
| handle.free() |
There was a problem hiding this comment.
Add exception safety for forward method cleanup.
If an exception occurs during Hessian collection (lines 1841-1842) or weight updates (lines 1848-1850), the patched forward methods won't be cleaned up, leaving the model in an inconsistent state.
🛡️ Suggested fix with try-finally
gptq_handles = {name: GPTQHelper(m, name, offload_to_cpu=True) for name, m in quantized_layers}
for handle in gptq_handles.values():
handle.setup()
print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...")
- with disabled_weight_quantizers(model):
- forward_loop(model)
-
- for handle in gptq_handles.values():
- handle.cleanup()
+ try:
+ with disabled_weight_quantizers(model):
+ forward_loop(model)
+ finally:
+ for handle in gptq_handles.values():
+ handle.cleanup()
print_rank_0("Updating weights using GPTQ algorithm...")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/model_calib.py` around lines 1835 - 1850, Wrap
the Hessian computation and weight-update phases in try/finally blocks to ensure
patched forward methods are always cleaned up and resources freed: after
creating gptq_handles and calling handle.setup(), run forward_loop(model) inside
a try block and call each handle.cleanup() in the finally; similarly, when
running the update_weights loop, ensure any exception still triggers
handle.free() (and optionally cleanup) in a finally. Reference GPTQHelper.setup,
GPTQHelper.cleanup, GPTQHelper.update_weights, GPTQHelper.free, gptq_handles,
disabled_weight_quantizers, and forward_loop to locate and modify the code.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1130 +/- ##
==========================================
- Coverage 70.18% 70.15% -0.04%
==========================================
Files 228 230 +2
Lines 25952 26053 +101
==========================================
+ Hits 18215 18278 +63
- Misses 7737 7775 +38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: Bug fix
torch_dtype is deprecated and must be changed in the config yaml file to be detected. In addition to this, set the default value to bfloat16 instead of float32 as TRTLLM does not support float32 deployment.
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Improvements
Configuration
gptq_litetogptqwith simplified parameter requirements.Tests