Skip to content

Bug fix 6012573#1130

Closed
sugunav14 wants to merge 41 commits intomainfrom
svelury/bug-fix-6012573
Closed

Bug fix 6012573#1130
sugunav14 wants to merge 41 commits intomainfrom
svelury/bug-fix-6012573

Conversation

@sugunav14
Copy link
Copy Markdown
Contributor

@sugunav14 sugunav14 commented Mar 28, 2026

What does this PR do?

Type of change: Bug fix

torch_dtype is deprecated and must be changed in the config yaml file to be detected. In addition to this, set the default value to bfloat16 instead of float32 as TRTLLM does not support float32 deployment.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • Improvements

    • Refactored GPTQ quantization algorithm with enhanced sequential Hessian-based weight updates and improved blockwise precision handling.
  • Configuration

    • Renamed GPTQ algorithm configuration from gptq_lite to gptq with simplified parameter requirements.
    • Updated default model precision to bfloat16.
  • Tests

    • Added quantization roundtrip verification tests for export and dequantization workflows.

Fridah-nv and others added 30 commits March 25, 2026 17:51
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
…ntizer, NVFP4MSECalibrator (#849)

**Type of change:** ? <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->

**Overview:** ?

<!-- You can potentially add a usage example below. -->

```python
```

<!-- Mention how have you tested your change if applicable. -->

<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->

<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **New Features**
* Added NVFP4StaticQuantizer for improved 4-bit quantization with
enhanced precision control
* Introduced NVFP4MSECalibrator with flexible candidate generation for
calibration optimization

* **Improvements**
* Optimized GPU kernels for Hopper+ graphics cards with better
performance
  * Extended Triton support to broader GPU compatibility
* Enhanced backward compatibility for restoring previously quantized
models

* **Tests**
* Added comprehensive test coverage for new quantizers and calibration
methods

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
…FP4QTensor

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
@sugunav14 sugunav14 requested review from a team as code owners March 28, 2026 06:03
@sugunav14 sugunav14 requested review from kevalmorabia97 and mxinO and removed request for mxinO March 28, 2026 06:03
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 28, 2026

📝 Walkthrough

Walkthrough

This PR refactors the GPTQ quantization algorithm by renaming the prior gptq_lite API to gptq, updating configuration classes, mode descriptors, and implementation logic. Changes include removing the hessian_state_path field, updating model precision defaults, and replacing the calibration flow with a new sequential Hessian-based approach using forward patching and per-module helper classes.

Changes

Cohort / File(s) Summary
Configuration & API Refactoring
modelopt/torch/quantization/config.py, modelopt/torch/quantization/mode.py
Renamed GPTQLiteConfig to GPTQConfig with method field changed from "gptq_lite" to "gptq"; made percdamp and block_size non-optional with defaults; removed hessian_state_path field. Updated corresponding mode descriptor class and imports.
Core GPTQ Implementation
modelopt/torch/quantization/model_calib.py
Replaced gptq_lite implementation with new gptq function using GPTQHelper class for Hessian management, forward-method patching instead of global state, and column-wise weight updates. Updated update_hessian to flatten inputs as token counts and added _print_mse_error logging.
Utilities & Configuration
modelopt/torch/quantization/utils/core_utils.py, modelopt/torch/quantization/model_quant.py
Added disabled_weight_quantizers context manager for temporary weight quantizer disabling during calibration. Removed manual gptq_lite algorithm override in example flow.
Examples & Configuration
examples/gpt-oss/configs/sft_full.yaml, examples/gpt-oss/sft.py
Updated config key from torch_dtype to dtype and changed report_to from [trackio] to [none]. Changed default torch_dtype fallback from "float32" to "bfloat16".
Tests
tests/gpu/torch/quantization/test_gptq.py
Refactored test_gptq_updates to use new gptq API directly instead of manual Hessian construction; updated test_update_hessian assertions for flattened token counting; added test_gptq_export_roundtrip for NVFP4 weight export/dequantization verification; updated test_gptq_e2e_flow configuration.

Sequence Diagram(s)

sequenceDiagram
    participant Model as Model
    participant SeqCal as Sequential Calibrate
    participant Helper as GPTQHelper
    participant HessianPatch as Forward Patch
    participant WeightUpdate as Weight Updater
    
    SeqCal->>SeqCal: For each layer
    SeqCal->>Helper: Create GPTQHelper instance
    Helper->>HessianPatch: Patch layer forward
    
    SeqCal->>Model: Forward pass (collect Hessians)
    Model->>HessianPatch: Forward called
    HessianPatch->>Helper: Store activation tensors
    Helper->>Helper: Accumulate Hessian
    
    SeqCal->>Helper: Prepare inverse Hessian
    Helper->>Helper: Handle dead neurons + damping
    
    Helper->>WeightUpdate: For each weight block
    WeightUpdate->>WeightUpdate: Compute GPTQ delta
    WeightUpdate->>WeightUpdate: Quantize & propagate error
    WeightUpdate->>Model: Update weight
    
    Helper->>HessianPatch: Unpatch forward
    SeqCal->>SeqCal: Next layer
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • GPTQ Lite implementation #555: Directly related as this PR refactors and renames the earlier "gptq_lite" implementation from #555 into the new "gptq" API, removing deprecated fields and updating the calibration logic.

Suggested reviewers

  • kaix-nv
  • RalphMao

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error The test file contains hardcoded trust_remote_code=True when loading a HuggingFace tokenizer, which should be configurable and default to False. Refactor to make trust_remote_code a configurable parameter defaulting to False, only enabling it when explicitly needed for specific test scenarios.
Docstring Coverage ⚠️ Warning Docstring coverage is 76.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Bug fix 6012573' is too vague and does not describe the actual changes; it only references a bug number without explaining what was fixed or changed. Replace with a descriptive title that explains the main change, such as 'Update GPTQ implementation and change default dtype to bfloat16' or 'Migrate from gptq_lite to gptq and update default precision'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch svelury/bug-fix-6012573

Comment @coderabbitai help to get the list of available commands and usage tips.

@sugunav14 sugunav14 requested a review from realAsma March 28, 2026 06:03
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 28, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-03-28 18:42 UTC

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
examples/gpt-oss/configs/sft_full.yaml (1)

25-25: Unrelated change: disabling external reporting.

The change from report_to: [trackio] to report_to: [none] disables external tracking/reporting. This appears unrelated to the main PR objective of fixing the deprecated torch_dtype parameter. If this change was intentional, consider documenting the reason in the PR description or commit message.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/gpt-oss/configs/sft_full.yaml` at line 25, The PR unintentionally
changed the logging/telemetry setting by replacing the config key value
report_to: [trackio] with report_to: [none]; revert this unrelated change
(restore report_to: [trackio]) unless telemetry was intentionally disabled—if
intentional, add a short note to the PR description or commit message explaining
why external reporting was turned off and reference the report_to setting so
reviewers know it was deliberate.
modelopt/torch/quantization/model_calib.py (1)

1795-1809: Consider memory optimization for large weight matrices.

Line 1795 clones the entire weight for each block iteration (wblk = self.weight.clone()), and line 1801 quantizes the full matrix for each column. While this matches standard GPTQ, it can be memory-intensive for large models.

The current implementation is functionally correct. If memory becomes a concern for very large layers, consider lazy quantization or partial matrix operations in future iterations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1795 - 1809, The
code clones and quantizes the entire self.weight inside the block loop (wblk =
self.weight.clone(); qdq = quantizer(wblk)), which is memory-heavy; instead,
operate on slices: clone only the block slice (e.g., wblk = self.weight[:,
block_start:block_end].clone()), call quantizer on that slice, compute qdq and
err for the local columns using h_inv_cho_blk and errs, and update self.weight
only for the affected columns and the trailing addmm_ using
h_inv[block_start:block_end, block_end:]—keep the same variable names (weight,
wblk, quantizer, h_inv_cho_blk, errs, block_start, block_end, n_cols_blk) so
logic is identical but memory use is reduced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/gpt-oss/sft.py`:
- Line 75: The config key mismatch causes model_args.dtype to remain unset
because examples/gpt-oss/configs/sft_lora.yaml uses torch_dtype instead of
dtype; update that YAML to use the key name expected by the code. Edit
examples/gpt-oss/configs/sft_lora.yaml and change the entry torch_dtype:
bfloat16 to dtype: bfloat16 (matching sft_full.yaml) so the value is picked up
by the code path that reads model_args.dtype in sft.py.

In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1532-1547: The code currently assumes non-empty input and can
divide by zero when batch_size==0; add an early guard in the function that
computes/updates the hessian (the block using input_flat, batch_size, hessian,
n_samples) that, if batch_size is 0 (i.e., input.reshape yields zero tokens),
immediately returns the unchanged hessian and n_samples; place this check after
computing input_flat and batch_size and before adjusting hessian/n_samples and
the scaled_input calculations to avoid the division by zero and unnecessary
work.
- Around line 1835-1850: Wrap the Hessian computation and weight-update phases
in try/finally blocks to ensure patched forward methods are always cleaned up
and resources freed: after creating gptq_handles and calling handle.setup(), run
forward_loop(model) inside a try block and call each handle.cleanup() in the
finally; similarly, when running the update_weights loop, ensure any exception
still triggers handle.free() (and optionally cleanup) in a finally. Reference
GPTQHelper.setup, GPTQHelper.cleanup, GPTQHelper.update_weights,
GPTQHelper.free, gptq_handles, disabled_weight_quantizers, and forward_loop to
locate and modify the code.

---

Nitpick comments:
In `@examples/gpt-oss/configs/sft_full.yaml`:
- Line 25: The PR unintentionally changed the logging/telemetry setting by
replacing the config key value report_to: [trackio] with report_to: [none];
revert this unrelated change (restore report_to: [trackio]) unless telemetry was
intentionally disabled—if intentional, add a short note to the PR description or
commit message explaining why external reporting was turned off and reference
the report_to setting so reviewers know it was deliberate.

In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1795-1809: The code clones and quantizes the entire self.weight
inside the block loop (wblk = self.weight.clone(); qdq = quantizer(wblk)), which
is memory-heavy; instead, operate on slices: clone only the block slice (e.g.,
wblk = self.weight[:, block_start:block_end].clone()), call quantizer on that
slice, compute qdq and err for the local columns using h_inv_cho_blk and errs,
and update self.weight only for the affected columns and the trailing addmm_
using h_inv[block_start:block_end, block_end:]—keep the same variable names
(weight, wblk, quantizer, h_inv_cho_blk, errs, block_start, block_end,
n_cols_blk) so logic is identical but memory use is reduced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e1724464-f96a-4dbd-b311-4f2cec4b6a06

📥 Commits

Reviewing files that changed from the base of the PR and between 7f5fd65 and 86557bb.

📒 Files selected for processing (8)
  • examples/gpt-oss/configs/sft_full.yaml
  • examples/gpt-oss/sft.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/mode.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/model_quant.py
  • modelopt/torch/quantization/utils/core_utils.py
  • tests/gpu/torch/quantization/test_gptq.py
💤 Files with no reviewable changes (1)
  • modelopt/torch/quantization/model_quant.py

"trust_remote_code": model_args.trust_remote_code,
"attn_implementation": model_args.attn_implementation,
"torch_dtype": getattr(model_args, "dtype", "float32"),
"torch_dtype": getattr(model_args, "dtype", "bfloat16"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Find all YAML config files that still use the deprecated torch_dtype key

# Search for torch_dtype in YAML config files
rg -n 'torch_dtype:' --glob 'examples/gpt-oss/configs/*.yaml'

Repository: NVIDIA/Model-Optimizer

Length of output: 128


🏁 Script executed:

# Check the content of sft_lora.yaml and sft_full.yaml to see the full picture
find examples/gpt-oss/configs -type f -name "*.yaml" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 179


🏁 Script executed:

# Get the first 10 lines of the YAML config files to see structure
for file in $(find examples/gpt-oss/configs -type f -name "*.yaml"); do
  echo "=== $file ==="
  head -10 "$file"
done

Repository: NVIDIA/Model-Optimizer

Length of output: 884


🏁 Script executed:

# Search for all occurrences of "dtype" (both torch_dtype and dtype) in YAML files
rg -n '(torch_)?dtype:' --glob 'examples/gpt-oss/configs/*.yaml'

Repository: NVIDIA/Model-Optimizer

Length of output: 185


Update examples/gpt-oss/configs/sft_lora.yaml to use the correct config key.

The Python code at line 75 reads model_args.dtype (with fallback to "bfloat16"). However, examples/gpt-oss/configs/sft_lora.yaml provides torch_dtype: bfloat16 instead of dtype: bfloat16. This means the config value is silently ignored and the hardcoded default is used instead.

Change line 4 in sft_lora.yaml from torch_dtype: bfloat16 to dtype: bfloat16 to match sft_full.yaml and ensure the configuration is properly applied.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/gpt-oss/sft.py` at line 75, The config key mismatch causes
model_args.dtype to remain unset because examples/gpt-oss/configs/sft_lora.yaml
uses torch_dtype instead of dtype; update that YAML to use the key name expected
by the code. Edit examples/gpt-oss/configs/sft_lora.yaml and change the entry
torch_dtype: bfloat16 to dtype: bfloat16 (matching sft_full.yaml) so the value
is picked up by the code path that reads model_args.dtype in sft.py.

Comment on lines +1532 to 1547

Note: input must be non-empty (batch_size > 0); a zero-sized input causes division by zero.
"""
batch_size = input.shape[0]
# Flatten to 2D (total_tokens, features) first, so batch_size counts tokens
input_flat = input.reshape(-1, input.shape[-1]).t().float()
batch_size = input_flat.shape[1]

# Incremental averaging: scale down old hessian
hessian *= n_samples / (n_samples + batch_size)
n_samples += batch_size

# Compute outer product: H += (2/n_samples) * X @ X^T
# where X is the flattened input reshaped to (features, batch*seq)
input_flat = input.reshape(-1, input.shape[-1]).t().float()
scaled_input = math.sqrt(2 / n_samples) * input_flat
hessian.add_((scaled_input @ scaled_input.t()).to(hessian.device))

return hessian, n_samples
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle zero-sized inputs to prevent division by zero.

The comment at line 1533 documents the precondition but doesn't enforce it. If input has zero elements along the batch/sequence dimensions, batch_size will be 0, causing division by zero at line 1540.

Consider adding an early return:

+    if input.numel() == 0:
+        return hessian, n_samples
     # Flatten to 2D (total_tokens, features) first, so batch_size counts tokens
     input_flat = input.reshape(-1, input.shape[-1]).t().float()
     batch_size = input_flat.shape[1]

This is especially relevant for MoE models where some experts may receive no tokens during calibration.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Note: input must be non-empty (batch_size > 0); a zero-sized input causes division by zero.
"""
batch_size = input.shape[0]
# Flatten to 2D (total_tokens, features) first, so batch_size counts tokens
input_flat = input.reshape(-1, input.shape[-1]).t().float()
batch_size = input_flat.shape[1]
# Incremental averaging: scale down old hessian
hessian *= n_samples / (n_samples + batch_size)
n_samples += batch_size
# Compute outer product: H += (2/n_samples) * X @ X^T
# where X is the flattened input reshaped to (features, batch*seq)
input_flat = input.reshape(-1, input.shape[-1]).t().float()
scaled_input = math.sqrt(2 / n_samples) * input_flat
hessian.add_((scaled_input @ scaled_input.t()).to(hessian.device))
return hessian, n_samples
Note: input must be non-empty (batch_size > 0); a zero-sized input causes division by zero.
"""
if input.numel() == 0:
return hessian, n_samples
# Flatten to 2D (total_tokens, features) first, so batch_size counts tokens
input_flat = input.reshape(-1, input.shape[-1]).t().float()
batch_size = input_flat.shape[1]
# Incremental averaging: scale down old hessian
hessian *= n_samples / (n_samples + batch_size)
n_samples += batch_size
# Compute outer product: H += (2/n_samples) * X @ X^T
scaled_input = math.sqrt(2 / n_samples) * input_flat
hessian.add_((scaled_input @ scaled_input.t()).to(hessian.device))
return hessian, n_samples
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1532 - 1547, The
code currently assumes non-empty input and can divide by zero when
batch_size==0; add an early guard in the function that computes/updates the
hessian (the block using input_flat, batch_size, hessian, n_samples) that, if
batch_size is 0 (i.e., input.reshape yields zero tokens), immediately returns
the unchanged hessian and n_samples; place this check after computing input_flat
and batch_size and before adjusting hessian/n_samples and the scaled_input
calculations to avoid the division by zero and unnecessary work.

Comment on lines +1835 to +1850
gptq_handles = {name: GPTQHelper(m, name, offload_to_cpu=True) for name, m in quantized_layers}
for handle in gptq_handles.values():
handle.setup()

calib_func(layer, _layer_forward_loop, **calib_kwargs)
print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...")

del layer_inputs
torch.cuda.empty_cache()
finally:
input_getter._unpatch_all_layers()
with disabled_weight_quantizers(model):
forward_loop(model)

for handle in gptq_handles.values():
handle.cleanup()

print_rank_0("Updating weights using GPTQ algorithm...")
for handle in gptq_handles.values():
handle.update_weights(block_size, percdamp)
handle.free()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add exception safety for forward method cleanup.

If an exception occurs during Hessian collection (lines 1841-1842) or weight updates (lines 1848-1850), the patched forward methods won't be cleaned up, leaving the model in an inconsistent state.

🛡️ Suggested fix with try-finally
     gptq_handles = {name: GPTQHelper(m, name, offload_to_cpu=True) for name, m in quantized_layers}
     for handle in gptq_handles.values():
         handle.setup()
 
     print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...")
 
-    with disabled_weight_quantizers(model):
-        forward_loop(model)
-
-    for handle in gptq_handles.values():
-        handle.cleanup()
+    try:
+        with disabled_weight_quantizers(model):
+            forward_loop(model)
+    finally:
+        for handle in gptq_handles.values():
+            handle.cleanup()
 
     print_rank_0("Updating weights using GPTQ algorithm...")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1835 - 1850, Wrap
the Hessian computation and weight-update phases in try/finally blocks to ensure
patched forward methods are always cleaned up and resources freed: after
creating gptq_handles and calling handle.setup(), run forward_loop(model) inside
a try block and call each handle.cleanup() in the finally; similarly, when
running the update_weights loop, ensure any exception still triggers
handle.free() (and optionally cleanup) in a finally. Reference GPTQHelper.setup,
GPTQHelper.cleanup, GPTQHelper.update_weights, GPTQHelper.free, gptq_handles,
disabled_weight_quantizers, and forward_loop to locate and modify the code.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

❌ Patch coverage is 20.49689% with 128 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.15%. Comparing base (ae965a9) to head (86557bb).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 16.90% 118 Missing ⚠️
modelopt/torch/quantization/utils/core_utils.py 18.18% 9 Missing ⚠️
modelopt/torch/quantization/mode.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1130      +/-   ##
==========================================
- Coverage   70.18%   70.15%   -0.04%     
==========================================
  Files         228      230       +2     
  Lines       25952    26053     +101     
==========================================
+ Hits        18215    18278      +63     
- Misses       7737     7775      +38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sugunav14 sugunav14 closed this Mar 28, 2026
@sugunav14 sugunav14 deleted the svelury/bug-fix-6012573 branch March 28, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants