GPTQ official by sugunav14 · Pull Request #853 · NVIDIA/Model-Optimizer

sugunav14 · 2026-02-04T18:23:02Z

What does this PR do?

Implements official version of GPTQ with decoder level sequential calibration flow.
Ref: https://github.com/IST-DASLab/FP-Quant/tree/master

Type of change: New feature

Overview:

Deletes gptq_lite configuration. gptq() is a more generic implementation and will resolve to gptq_lite when we set use_sequential:False
Introduce GPTQHelper class to handle hessian collection patching/unpatching, blockwise weight update, hessian initialization, hessian_inverse computation etc

Usage

# Add a code snippet demonstrating how to use this
python hf_ptq.py     --pyt_ckpt_path Qwen/Qwen3-8B     --qformat nvfp4_gptq     --kv_cache_qformat none     --dataset cnn_dailymail     --batch_size 32     --calib_seq 512     --calib_size 512    --export_path exported_model

Testing

Measure perplexity and activation MSE on the following models

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Added a new NVFP4 GPTQ quantization option for CLI/configuration.
Improvements
- Replaced GPTQ-lite with a full GPTQ calibration pipeline for more accurate, Hessian-aware blockwise updates and token-based sampling.
- Added sequential layer-by-layer calibration support and automatic promotion of NVFP4 static quantizers.
- Improved logging and timing for calibration runs.
Tests
- Expanded GPTQ tests, including export/roundtrip validation for quantized weights.

copy-pr-bot · 2026-02-04T18:23:07Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-02-04T18:23:12Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces the GPTQ "lite" calibration with a full GPTQ implementation (new GPTQConfig, GPTQModeDescriptor, and a gptq(...) pipeline that collects Hessians and applies blockwise updates), adds an nvfp4_gptq CLI/example option, and updates tests and examples to the new API.

Changes

Cohort / File(s)	Summary
CLI / Example `examples/llm_ptq/hf_ptq.py`	Added `nvfp4_gptq` to quantization format choices and tidied blank-line formatting around export/low-memory flags.
Quantization Config `modelopt/torch/quantization/config.py`	Replaced `GPTQLiteConfig` with `GPTQConfig` (method literal → `"gptq"`), removed `hessian_state_path`, retained `percdamp` and `block_size`.
Calibration Mode `modelopt/torch/quantization/mode.py`	Registered `GPTQModeDescriptor` (was GPTQLite); switched `config_class` to `GPTQConfig`, bound `_calib_func` to `gptq`, and allowed `"gptq"` in sequential method checks.
Calibration Implementation `modelopt/torch/quantization/model_calib.py`	Large refactor: removed GPTQ-lite helpers, added `gptq(...)` pipeline and `GPTQHelper`/handle-style logic; Hessian accumulation now flattens token dims, forward-patches per-layer Hessians, promotes NVFP4 static quantizers, performs damped inverse Hessian blockwise updates, and updated logging/timing.
Tests `tests/gpu/torch/quantization/test_gptq.py`	Adjusted tests for flattened-token `n_samples`, switched to calling `gptq(...)`, added export+dequantization roundtrip test, and updated expectations/imports.
Docs / Examples Update `modelopt/torch/quantization/model_quant.py`	Doc example updated to use `{"method": "gptq", "sequential": True}` instead of the lite variant.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as CLI / Caller
    participant Calib as gptq
    participant ForwardLoop as ForwardLoop
    participant Model as Model
    participant Quant as Quantizers
    participant Handle as GPTQHandle
    participant Logger as Logger

    CLI->>Calib: gptq(model, forward_loop, percdamp, block_size)
    Calib->>Calib: max_calibrate(model)
    Calib->>Quant: promote eligible quantizers to NVFP4StaticQuantizer
    Calib->>Model: patch module forwards for Hessian collection

    loop batches
        ForwardLoop->>Model: run forward batch
        Model->>Quant: quantize/dequantize activations
        Quant-->>Calib: provide activations
        Calib->>Handle: update_hessian(flattened_inputs)
    end

    Calib->>Handle: compute damped inverse Hessian per-block
    Handle->>Handle: apply blockwise weight updates
    Handle->>Logger: log per-module metrics and timing
    Calib-->>CLI: return (model weights updated)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'GPTQ official' is extremely vague and does not clearly convey the main change. It lacks specificity about what 'official' means or what aspect of GPTQ is being changed.	Use a more descriptive title that clarifies the primary change, such as 'Replace GPTQ-Lite with official GPTQ implementation' or 'Migrate GPTQ quantization from lite to official variant'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Security Anti-Patterns	✅ Passed	Comprehensive security audit confirms no critical security anti-patterns in modified files and no new non-permissive license dependencies.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch svelury/gptq-official

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-02-04T18:49:55Z

Codecov Report

❌ Patch coverage is 20.49689% with 128 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.18%. Comparing base (ae965a9) to head (068e8a9).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_calib.py	16.90%	118 Missing ⚠️
modelopt/torch/quantization/utils/core_utils.py	18.18%	9 Missing ⚠️
modelopt/torch/quantization/mode.py	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #853      +/-   ##
==========================================
- Coverage   70.18%   70.18%   -0.01%     
==========================================
  Files         228      228              
  Lines       25952    25961       +9     
==========================================
+ Hits        18215    18221       +6     
- Misses       7737     7740       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sugunav14 · 2026-02-14T01:59:39Z

modelopt/torch/quantization/model_calib.py

+
+
+@torch.no_grad()
+def sequential_calibrate(


@Fridah-nv reference for sequential calibration
cc @realAsma

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (3)

tests/gpu/torch/quantization/test_gptq.py (1)
186-199: Use a GPTQ/static preset in this round-trip test.

NVFP4_DEFAULT_CFG leaves the weight quantizer dynamic, so _export_quantized_weight() stays on the existing _amax path instead of the new NVFP4StaticQuantizer.global_amax flow that official GPTQ relies on. This gives you a generic NVFP4 export check, but not coverage of the new feature path.

As per coding guidelines, "All test coverage checks in PRs must pass for new features and examples".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_gptq.py` around lines 186 - 199, Replace
the dynamic NVFP4 preset with the GPTQ/static NVFP4 preset so the quantizer
follows the NVFP4StaticQuantizer.global_amax export path used by GPTQ;
specifically change the quant_cfg assignment (currently mtq.NVFP4_DEFAULT_CFG)
to the static/GPTQ preset (e.g., mtq.NVFP4_GPTQ_STATIC_CFG or the project's
named GPTQ static preset) before calling mtq.quantize(model, quant_cfg, ...),
leaving the subsequent weight restore, update_hessian and
blockwise_weight_update calls intact so the test exercises the new
_export_quantized_weight global_amax flow.
modelopt/torch/quantization/model_calib.py (2)
2154-2154: Consider cleaner conditional synchronization.

The inline conditional torch.cuda.synchronize() if torch.cuda.is_available() else None works but the else None is a no-op that may confuse readers.
♻️ Suggested cleaner pattern
-    torch.cuda.synchronize() if torch.cuda.is_available() else None
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()
Apply the same pattern on line 2171.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` at line 2154, Replace the inline
conditional expression "torch.cuda.synchronize() if torch.cuda.is_available()
else None" with a clear conditional block using "if torch.cuda.is_available():
torch.cuda.synchronize()" to remove the no-op else and improve readability;
update both occurrences of the inline form (the one shown and the other
occurrence later in the file) so all cuda synchronizations use the explicit
if-check around torch.cuda.synchronize().
2106-2107: Potential attribute collision on module.name.

Assigning module.name = name could overwrite an existing name attribute if the module class defines one. Consider using a prefixed attribute name to avoid conflicts.
♻️ Suggested safer attribute name
-            module.name = name  # Attach name for easy access in hooks
+            module._gptq_name = name  # Attach name for easy access in hooks
Then update references in blockwise_weight_update call (line 2165) and the internal state key (line 2162).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 2106 - 2107, The
code assigns module.name = name which may collide with existing attributes;
change to a unique prefixed attribute (e.g. module._quant_name or
module._qo_name) wherever the module name is attached, then update all usages
that read that attribute—specifically the references inside
blockwise_weight_update and the internal state key lookup that currently expect
module.name—so they read the new prefixed attribute instead.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 935-959: The activation-MSE path assumes text inputs and a
tokenizer; guard it so it only runs when a tokenizer is available (and/or when
the calibration flow is text-only) by early-checking tokenizer is not None
before calling ActivationMSELogger.load_raw_text, materialize_raw_text, or
tokenize_raw_text and emit a clear warning/skipped-info when tokenizer is None;
specifically change the block using mse_input_path,
ActivationMSELogger.materialize_raw_text, ActivationMSELogger.tokenize_raw_text,
calib_dataloader, and args.calib_seq to skip MSE creation/processing if
tokenizer is None (and likewise protect the alternate block around lines
976-981) to avoid touching multimodal calib_dataloader outputs meant for
full_model when measuring activation MSE.
- Around line 688-692: The perplexity path currently calls
compute_perplexity(full_model, eval_data) which can fail for
multimodal/container wrappers; change the call to use the extracted language
model instead (compute_perplexity(language_model, eval_data)) when
args.eval_perplexity is true and tokenizer is present, ensuring you reference
the existing symbols compute_perplexity, language_model, full_model, and args so
the LM-specific forward is used for perplexity evaluation.

In `@modelopt/torch/quantization/metrics/activation_mse.py`:
- Line 577: The torch.load call that creates payload (payload = torch.load(path,
map_location="cpu", weights_only=False)) uses weights_only=False which permits
arbitrary code execution; change it to weights_only=True to restrict loading to
tensor/weight objects (i.e., payload = torch.load(path, map_location="cpu",
weights_only=True)), or if you must keep weights_only=False add a brief inline
comment next to this torch.load explaining why the file is trusted (e.g.,
"checkpoint is internally generated and validated") and the explicit reason it
is safe.

In `@modelopt/torch/quantization/metrics/perplexity.py`:
- Around line 19-22: Update the file-level attribution in the perplexity.py
module: replace the minimal "Ported from FP-Quant" docstring with a full
FP-Quant attribution block that includes a commit-pinned source URL (repo +
specific commit hash), the original FP-Quant copyright and license text (as in
their repository), and the required NVIDIA Apache-2.0 header used across this
project; keep the existing short description but prepend the full multi-line
header so the file contains the commit-pinned reference, original
copyright/license metadata, and the NVIDIA Apache-2.0 copyright/license notice.
- Around line 30-58: compute_perplexity currently proceeds on empty data which
leaves nll_running uninitialized as a tensor and later causes an AttributeError;
add an explicit guard at the top of compute_perplexity to detect an empty data
(e.g., if not data or len(data) == 0) and raise a ValueError with a clear
message that the evaluation dataset is empty (mentioning sequence_length/corpus
size if helpful); this ensures immediate, clear failure instead of the
downstream error involving nll_running/tokens_processed.

In `@modelopt/torch/quantization/mode.py`:
- Around line 509-518: The GPTQ mode currently defaults to non-sequential
because GPTQConfig still inherits use_sequential=False; update
GPTQModeDescriptor (and/or GPTQConfig) so GPTQ always runs sequentially: set the
mode's use_sequential to True (or override a property on GPTQModeDescriptor to
return True) or validate in GPTQModeDescriptor.__init__/config_class that
use_sequential is True and raise an error if not; ensure the handler for
_calib_func = gptq cannot be used when use_sequential is False so gptq is only
invoked under sequential_calibrate().

In `@modelopt/torch/quantization/triton/gptq_fused_kernel.py`:
- Around line 165-174: gptq_fused_block currently may mutate the caller's
w_block because calling .contiguous() doesn't copy an already-contiguous tensor;
ensure the wrapper truly clones inputs before the kernel writes to them by
replacing w_block = w_block.contiguous() (and similarly for amax_grouped and
h_inv_cho_blk if they must not be mutated) with explicit copies like w_block =
w_block.clone().contiguous() (and amax_grouped =
amax_grouped.clone().contiguous(), h_inv_cho_blk =
h_inv_cho_blk.clone().contiguous()) so the kernel (_gptq_fused_block_kernel)
operates on clones and only qw_block/err_block are written back, preserving the
non-mutating contract of gptq_fused_block.

---

Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Line 2154: Replace the inline conditional expression "torch.cuda.synchronize()
if torch.cuda.is_available() else None" with a clear conditional block using "if
torch.cuda.is_available(): torch.cuda.synchronize()" to remove the no-op else
and improve readability; update both occurrences of the inline form (the one
shown and the other occurrence later in the file) so all cuda synchronizations
use the explicit if-check around torch.cuda.synchronize().
- Around line 2106-2107: The code assigns module.name = name which may collide
with existing attributes; change to a unique prefixed attribute (e.g.
module._quant_name or module._qo_name) wherever the module name is attached,
then update all usages that read that attribute—specifically the references
inside blockwise_weight_update and the internal state key lookup that currently
expect module.name—so they read the new prefixed attribute instead.

In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 186-199: Replace the dynamic NVFP4 preset with the GPTQ/static
NVFP4 preset so the quantizer follows the NVFP4StaticQuantizer.global_amax
export path used by GPTQ; specifically change the quant_cfg assignment
(currently mtq.NVFP4_DEFAULT_CFG) to the static/GPTQ preset (e.g.,
mtq.NVFP4_GPTQ_STATIC_CFG or the project's named GPTQ static preset) before
calling mtq.quantize(model, quant_cfg, ...), leaving the subsequent weight
restore, update_hessian and blockwise_weight_update calls intact so the test
exercises the new _export_quantized_weight global_amax flow.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8c14b6e0-c7f4-4f64-a587-56ad77719a45

📥 Commits

Reviewing files that changed from the base of the PR and between 839fa3d and b4acce9.

📒 Files selected for processing (10)

examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/__init__.py
modelopt/torch/quantization/config.py
modelopt/torch/quantization/metrics/__init__.py
modelopt/torch/quantization/metrics/activation_mse.py
modelopt/torch/quantization/metrics/perplexity.py
modelopt/torch/quantization/mode.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/triton/gptq_fused_kernel.py
tests/gpu/torch/quantization/test_gptq.py

examples/llm_ptq/hf_ptq.py

modelopt/torch/quantization/metrics/activation_mse.py

modelopt/torch/quantization/metrics/perplexity.py

modelopt/torch/quantization/mode.py

modelopt/torch/quantization/triton/gptq_fused_kernel.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/gpu/torch/quantization/test_gptq.py (1)
280-281: ⚠️ Potential issue | 🟠 Major

Update the e2e test to the renamed GPTQ mode.

Line 281 still selects gptq_lite, but this PR wires the calibration path to gptq. As written, this test is targeting the removed mode name and will stop covering the new flow.
Suggested fix
-    quant_cfg["algorithm"] = "gptq_lite"
+    quant_cfg["algorithm"] = {"method": "gptq", "use_sequential": True}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_gptq.py` around lines 280 - 281, The test
sets quant_cfg["algorithm"] to the removed mode "gptq_lite"; update the test to
use the renamed mode by changing the algorithm value to "gptq" so the
calibration path wired to GPTQ is exercised (locate the assignment to
quant_cfg["algorithm"] in the test_gptq.py snippet and replace "gptq_lite" with
"gptq").
modelopt/torch/quantization/config.py (1)
1419-1431: ⚠️ Potential issue | 🟠 Major

Reject None and non-positive GPTQ parameters at config-parse time.

GPTQConfig accepts None for percdamp and block_size due to the type annotations float | None and int | None. However, the runtime gptq function requires concrete numeric values: percdamp is multiplied by a tensor in _prepare_hessian_inverse(), and block_size is used in range() in _blockwise_update(). When model_dump() is called on the config, it includes None values in the kwargs dict (Pydantic does not exclude None by default), which override the gptq function's default parameters and cause runtime failures.
Suggested fix
-    percdamp: float | None = ModeloptField(
+    percdamp: float = ModeloptField(
         default=0.01,
         gt=0.0,
         le=1.0,
         title="Percentage damping factor.",
         description="The percentage of average Hessian diagonal used for damping.",
     )
-    block_size: int | None = ModeloptField(
+    block_size: int = ModeloptField(
         default=128,
+        gt=0,
         title="Block size for GPTQ weight update.",
         description="""The block size for GPTQ weight update, which must be a multiple of the
         group_size used in the quantization.""",
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 1419 - 1431, GPTQConfig
currently allows percdamp: float | None and block_size: int | None which lets
None leak into model_dump and override gptq defaults; change both annotations to
non-optional (float and int), remove the `| None`, set ModeloptField defaults
and validators to enforce gt=0 (e.g., percdamp ModeloptField(default=0.01,
gt=0.0, le=1.0) and block_size ModeloptField(default=128, gt=0)), and add a
small pydantic validator on GPTQConfig (or use ModeloptField) to raise on None
or non-positive values so model_dump never emits None for percdamp or block_size
and runtime functions like _prepare_hessian_inverse() and _blockwise_update()
always receive concrete positive numbers.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1555-1568: The patched hessian_forward hook can leave
weight_quantizer disabled and the module forward patched if an exception occurs;
wrap the code that disables the quantizer, calls _forward_no_gptq_hessian, and
re-enables the quantizer in a try/finally so weight_quantizer.enable() always
runs, and ensure the hook/installed forward is restored or uninstalled in a
finally block as well (i.e., capture the original forward, install the patched
hessian_forward, call forward inside try, and in finally reassign the original
forward and call weight_quantizer.enable()). Apply the same try/finally pattern
to the analogous hook at the other location referenced (the block around lines
1824-1832) so both hessian_forward and the other patched forward used by
forward_loop always restore state on exceptions.
- Around line 1677-1687: The code must skip processing when batch_size is zero
to avoid the divide-by-zero in math.sqrt(2 / n_samples); in the block around
input_flat, batch_size, hessian, n_samples and scaled_input add a guard that if
batch_size == 0 you do not modify hessian or n_samples (i.e., continue/return
from this iteration) and only perform the incremental averaging and
outer-product update when batch_size > 0 so math.sqrt and the division are never
invoked with zero effective samples.
- Around line 1597-1614: Replace detecting dead columns from weight values with
detection from the Hessian diagonal inside _prepare_hessian_inverse: compute
dead_cols as torch.nonzero(torch.diag(h).eq(0)).unsqueeze(-1) (or equivalent)
using torch.diag(h), then zero out h rows/cols and set h[dead_cols, dead_cols]=1
and also zero the corresponding columns in self.weight so those weight columns
are treated as dead; keep the damping addition using percdamp as before and
ensure all references use the h-based dead column mask rather than
self.weight.eq(0).

---

Outside diff comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 1419-1431: GPTQConfig currently allows percdamp: float | None and
block_size: int | None which lets None leak into model_dump and override gptq
defaults; change both annotations to non-optional (float and int), remove the `|
None`, set ModeloptField defaults and validators to enforce gt=0 (e.g., percdamp
ModeloptField(default=0.01, gt=0.0, le=1.0) and block_size
ModeloptField(default=128, gt=0)), and add a small pydantic validator on
GPTQConfig (or use ModeloptField) to raise on None or non-positive values so
model_dump never emits None for percdamp or block_size and runtime functions
like _prepare_hessian_inverse() and _blockwise_update() always receive concrete
positive numbers.

In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 280-281: The test sets quant_cfg["algorithm"] to the removed mode
"gptq_lite"; update the test to use the renamed mode by changing the algorithm
value to "gptq" so the calibration path wired to GPTQ is exercised (locate the
assignment to quant_cfg["algorithm"] in the test_gptq.py snippet and replace
"gptq_lite" with "gptq").

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f42eadc8-d5b1-48e6-a3a1-4f8e887e5c46

📥 Commits

Reviewing files that changed from the base of the PR and between 76002c8 and db2e44c.

📒 Files selected for processing (4)

modelopt/torch/quantization/config.py
modelopt/torch/quantization/mode.py
modelopt/torch/quantization/model_calib.py
tests/gpu/torch/quantization/test_gptq.py

modelopt/torch/quantization/model_calib.py

github-actions · 2026-03-23T21:15:39Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-853/
Built to branch `gh-pages` at 2026-03-25 18:25 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/config.py (1)

1369-1382: ⚠️ Potential issue | 🟠 Major

Validate GPTQ block_size at config parse time.

block_size currently accepts 0/negative values and only fails later in GPTQ update loops. Add a positive bound here so invalid configs fail fast.

Suggested fix

 class GPTQConfig(QuantizeAlgorithmConfig):
@@
     block_size: int | None = ModeloptField(
         default=128,
+        gt=0,
         title="Block size for GPTQ weight update.",
         description="""The block size for GPTQ weight update, which must be a multiple of the
         group_size used in the quantization.""",
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 1369 - 1382, The
block_size ModeloptField in modelopt/torch/quantization/config.py currently
allows zero/negative values and should be validated at parse time; update the
block_size field (the ModeloptField definition named block_size) to require a
positive integer (e.g., add gt=0 or ge=1) so invalid configs fail fast during
config parsing rather than later in GPTQ update loops.

♻️ Duplicate comments (2)

modelopt/torch/quantization/model_calib.py (2)

1679-1689: ⚠️ Potential issue | 🟠 Major

Guard zero-token batches before Hessian scaling.

batch_size == 0 still reaches math.sqrt(2 / n_samples) and can divide by zero on empty expert/token batches.

Suggested fix

 def update_hessian(input, hessian, n_samples):
@@
     input_flat = input.reshape(-1, input.shape[-1]).t().float()
     batch_size = input_flat.shape[1]
+    if batch_size == 0:
+        return hessian, n_samples
@@
     hessian *= n_samples / (n_samples + batch_size)
     n_samples += batch_size

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1679 - 1689, Guard
against empty token batches in the Hessian update: check batch_size (computed
from input_flat.shape[1]) immediately after flattening and skip the
scaling/averaging/Hessian update if batch_size == 0 to avoid computing
math.sqrt(2 / n_samples) or dividing by zero; ensure you do not update n_samples
or call hessian.add_() for empty batches (i.e., return/continue before the
"hessian *= n_samples / (n_samples + batch_size)" and subsequent lines in the
Hessian update block that reference n_samples, scaled_input, and hessian).

1555-1568: ⚠️ Potential issue | 🟠 Major

Make GPTQ patch lifecycle exception-safe.

If a patched forward or forward_loop(model) raises, patched forwards and quantizer state may leak into later execution.

Suggested fix

         def hessian_forward(self, input, *args, **kwargs):
@@
             self.weight_quantizer.disable()
-            out = self._forward_no_gptq_hessian(input, *args, **kwargs)
-            self.weight_quantizer.enable()
-            return out
+            try:
+                return self._forward_no_gptq_hessian(input, *args, **kwargs)
+            finally:
+                self.weight_quantizer.enable()
@@
-    gptq_handles = {name: GPTQHandle(m, name, offload_to_cpu=True) for name, m in quantized_layers}
-    for handle in gptq_handles.values():
-        handle.setup()
-
-    print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...")
-    forward_loop(model)
-
-    for handle in gptq_handles.values():
-        handle.cleanup()
+    gptq_handles = {name: GPTQHandle(m, name, offload_to_cpu=True) for name, m in quantized_layers}
+    installed_handles = []
+    try:
+        for handle in gptq_handles.values():
+            handle.setup()
+            installed_handles.append(handle)
+
+        print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...")
+        forward_loop(model)
+    finally:
+        for handle in installed_handles:
+            handle.cleanup()

Also applies to: 1817-1825

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1555 - 1568, The
patched hessian_forward must be made exception-safe: wrap the section that
updates the hessian and disables the weight quantizer + calls
_forward_no_gptq_hessian in a try/finally so weight_quantizer.enable() always
runs even if _forward_no_gptq_hessian raises, and in the same finally
clear/reset the temporary GPTQ patch state on gptq_handle (e.g., reset
gptq_handle.hessian and gptq_handle.n_samples or remove the handle reference) to
avoid leaking the patch into later execution; apply the identical try/finally
fix to the other occurrence around the same logic (the block at lines
1817-1825).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 1369-1382: The block_size ModeloptField in
modelopt/torch/quantization/config.py currently allows zero/negative values and
should be validated at parse time; update the block_size field (the
ModeloptField definition named block_size) to require a positive integer (e.g.,
add gt=0 or ge=1) so invalid configs fail fast during config parsing rather than
later in GPTQ update loops.

---

Duplicate comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1679-1689: Guard against empty token batches in the Hessian
update: check batch_size (computed from input_flat.shape[1]) immediately after
flattening and skip the scaling/averaging/Hessian update if batch_size == 0 to
avoid computing math.sqrt(2 / n_samples) or dividing by zero; ensure you do not
update n_samples or call hessian.add_() for empty batches (i.e., return/continue
before the "hessian *= n_samples / (n_samples + batch_size)" and subsequent
lines in the Hessian update block that reference n_samples, scaled_input, and
hessian).
- Around line 1555-1568: The patched hessian_forward must be made
exception-safe: wrap the section that updates the hessian and disables the
weight quantizer + calls _forward_no_gptq_hessian in a try/finally so
weight_quantizer.enable() always runs even if _forward_no_gptq_hessian raises,
and in the same finally clear/reset the temporary GPTQ patch state on
gptq_handle (e.g., reset gptq_handle.hessian and gptq_handle.n_samples or remove
the handle reference) to avoid leaking the patch into later execution; apply the
identical try/finally fix to the other occurrence around the same logic (the
block at lines 1817-1825).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: af4360f3-56a7-42c0-88f1-e0bee76f10d3

📥 Commits

Reviewing files that changed from the base of the PR and between db2e44c and a1e7d72.

📒 Files selected for processing (4)

modelopt/torch/quantization/config.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/model_quant.py
tests/gpu/torch/quantization/test_gptq.py

✅ Files skipped from review due to trivial changes (1)

modelopt/torch/quantization/model_quant.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/gpu/torch/quantization/test_gptq.py

coderabbitai

🧹 Nitpick comments (1)

modelopt/torch/quantization/model_calib.py (1)
1762-1795: Performance consideration: Full-weight quantization per column.

The blockwise update performs full-matrix quantization (quantizer(wblk)) for each column within the inner loop (line 1787). This is O(num_cols × quantizer_cost) per block, which can be expensive for large weight matrices.

This appears intentional per the docstring ("quantizes the full weight matrix via the quantizer and extracts the quantized column") and matches standard GPTQ implementations where each column's quantization depends on the current weight state after error compensation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1762 - 1795, The
inner loop in _blockwise_update repeatedly calls quantizer(wblk) per column
which is expensive; move the quantization call out of the per-column loop so you
compute qdq = quantizer(wblk) once per block (before the for i in
range(n_cols_blk) loop) and then reuse qdq[:, block_start + i] when setting
self.weight and computing err; keep the existing wblk updates (addr_) and final
self.weight.addmm_ call but ensure qdq is recomputed only when you intentionally
need a fresh quantization (e.g., if you change the algorithm to require
re-quantization mid-block), referencing symbols _blockwise_update, quantizer,
wblk, qdq, errs, addr_, and self.weight to locate the changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1762-1795: The inner loop in _blockwise_update repeatedly calls
quantizer(wblk) per column which is expensive; move the quantization call out of
the per-column loop so you compute qdq = quantizer(wblk) once per block (before
the for i in range(n_cols_blk) loop) and then reuse qdq[:, block_start + i] when
setting self.weight and computing err; keep the existing wblk updates (addr_)
and final self.weight.addmm_ call but ensure qdq is recomputed only when you
intentionally need a fresh quantization (e.g., if you change the algorithm to
require re-quantization mid-block), referencing symbols _blockwise_update,
quantizer, wblk, qdq, errs, addr_, and self.weight to locate the changes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d366fe9b-14ea-48e0-b07e-a37f381e70e1

📥 Commits

Reviewing files that changed from the base of the PR and between a1e7d72 and 3c5ad90.

📒 Files selected for processing (3)

modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/model_quant.py
tests/gpu/torch/quantization/test_gptq.py

💤 Files with no reviewable changes (1)

modelopt/torch/quantization/model_quant.py

cjluo-nv

Review — Efficiency & Correctness

Critical Efficiency Issue

Full-matrix quantization per column in _blockwise_update (model_calib.py):

for i in range(n_cols_blk):           # iterates every column in block
    qdq = quantizer(wblk)             # quantizes the ENTIRE weight matrix
    self.weight[:, block_start + i] = qdq[:, block_start + i]  # uses 1 column

This quantizes the full weight matrix once per column to extract a single column. For a weight with N columns, that's N full-matrix quantizations — O(N²) vs. the O(N) of the reference GPTQ implementation. I understand this may be needed because block quantizers (NVFP4) can't easily quantize individual columns, but this is likely the dominant bottleneck. Consider caching or at minimum documenting this cost trade-off.

Full weight clone per block (same area):

wblk = self.weight.clone()   # full matrix clone on every block iteration

Allocating O(rows × cols) per block. Only the slice [:, block_start:] is actually needed.

Correctness Issues

hessian_forward — no try/finally around quantizer state toggle:
```
self.weight_quantizer.disable()
out = self._forward_no_gptq_hessian(input, *args, **kwargs)
self.weight_quantizer.enable()
```
If _forward_no_gptq_hessian raises, the weight quantizer is left permanently disabled. Wrap in try/finally.
update_hessian semantic change — the old code counted n_samples by batch dimension (input.shape[0]); the new code counts by total tokens (batch × seq_len). The new behavior is more correct, but this silently changes Hessian normalization for any other callers of update_hessian (it's a public function). The tests are updated, but callers outside this PR may not be.
Double forward pass — gptq() runs max_calibrate(model, forward_loop) (one full pass), then forward_loop(model) again for Hessians. During the Hessian pass, weight quantizers are disabled so downstream layers see FP weights. The two passes replay the same data — could they be combined into one?
_blockwise_update quantized/FP mixing at block boundaries — wblk is cloned from self.weight at each block start, so it contains already-quantized values for prior blocks. When the block quantizer's group size spans a GPTQ block boundary, scale factors are computed from a mix of quantized and FP values. This is a known trade-off with block quantizers, but the docstring claims "standard GPTQ approach" — worth noting this is a GPTQ variant.

Memory Concern

All GPTQHelper instances (and their Hessians) are held simultaneously:

gptq_handles = {name: GPTQHelper(m, name, offload_to_cpu=True) for name, m in quantized_layers}

The old code deleted each Hessian after processing. For large models (e.g., 70B), holding all Hessians at once could OOM. Consider freeing incrementally:

for handle in gptq_handles.values():
    handle.update_weights(block_size, percdamp)
    handle.hessian = None  # free after use

Minor Issues

Backward compatibility: "gptq_lite" method name is deleted with no migration/alias. Existing configs referencing it will break silently.
Test verbosity (test_gptq_export_roundtrip): ~15 lines of print() for diff stats — consider using logging or just including key stats in the assertion message.
_promote_nvfp4_static_quantizers uses magic tuples (2, 1) and (4, 3) for detection. These should be named constants or reference the quantizer class to avoid silent breakage if encoding changes.
Commit history: Contains "tested, revert later", "remove later", "debug logs" commits — should be squashed before merge.

meenchen · 2026-03-24T22:11:25Z

modelopt/torch/quantization/config.py


-    method: Literal["gptq_lite"] = ModeloptField("gptq_lite")
+    method: Literal["gptq"] = ModeloptField("gptq")
    percdamp: float | None = ModeloptField(


What happens when perdamp is set as None?

Usually, there is no use case for percdamp None. Thanks for the catch!

meenchen · 2026-03-24T22:21:04Z

modelopt/torch/quantization/model_calib.py

+                for i in range(n_cols_blk):
+                    w_ci = wblk[:, block_start + i]
+                    d = h_inv_cho_blk[i, i]
+                    qdq = quantizer(wblk)


Will this qdq become the bottleneck since it is performing #col times for the whole tensor? Would it work if we only do qdq for the block?

Yes, this is a bottleneck. Right now if we go through the TensorQuantizer it doesn't allow for QDQ'ing a slice of the weight tensor.

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 · 2026-03-25T22:25:47Z

@cjluo-nv

Critical Efficiency Issue -> this is a limitation of the current TensorQuantizer and will be addressed in a follow up PR
hessian_forward — no try/finally around quantizer state toggle -> addressed through helper context manager
Double forward pass -> this is necessary as the first forward pass sets the scales while the second pass is used to collect hessians (which require scales to be set on input quantizers)
_blockwise_update quantized/FP mixing at block boundaries -> Added a check to ensure block size is divisible by group size
Memory Concern -> addressed
Backward compatibility -> deleted all references to gptq_lite in the codebase.
Test verbosity -> Addressed
_promote_nvfp4_static_quantizers -> we use these tuples as is throughout many helper functions in the code base
Commit history -> commits are squashed when merged

cjluo-nv

Approved with requested fixes

The overall refactor from gptq_lite to a proper GPTQ implementation with GPTQHelper looks good. Clean encapsulation, correct Hessian token-count fix, and good test coverage additions.

Please address the following before merging:

1. `wblk = self.weight.clone()` inside the block loop (`_blockwise_update`)

A full weight tensor clone is created at every block iteration. For a 4096×4096 layer with block_size=128, that's 32 full-matrix clones per layer. Consider cloning once before the block loop and tracking which columns are finalized vs. working, or document why the per-block clone is necessary if there's a correctness reason I'm missing.

2. `hessian_forward` — add try/finally around the original forward call

In setup(), the patched hessian_forward calls self._forward_no_gptq_hessian(input, ...) without a try/finally guard. If that call raises, the Hessian has already been updated with potentially partial/bad data but the exception will propagate without any cleanup of gptq_helper state. A lightweight try/finally (or at minimum a comment explaining why it's safe to skip) would make the code more robust.

Everything else — disabled_weight_quantizers context manager, _promote_nvfp4_static_quantizers, block/group size divisibility check, test updates — LGTM.

Fridah-nv · 2026-03-26T00:38:13Z

modelopt/torch/quantization/config.py

-class GPTQLiteConfig(QuantizeAlgorithmConfig):
-    """The config for GPTQ lite.
+class GPTQConfig(QuantizeAlgorithmConfig):
+    """The config for GPTQ quantization.


Should use_sequential be part of the GPTQConfig?

Fridah-nv · 2026-03-26T00:48:32Z

modelopt/torch/quantization/model_calib.py

-    finally:
-        input_getter._unpatch_all_layers()
+    with disabled_weight_quantizers(model):
+        forward_loop(model)


I'm thinking about why we disable all weight quantizers during Hessian collection. In non-sequential mode, we applied input quantizers but not weight quantizers, the activations used to build the Hessian will not be the actual inference activations. Does this make sense to you? Any reason we want to keep non-sequential mode?

sugunav14 commented Feb 14, 2026

View reviewed changes

sugunav14 force-pushed the svelury/gptq-official branch 2 times, most recently from b5f6073 to a8540d4 Compare February 24, 2026 00:17

sugunav14 force-pushed the svelury/gptq-official branch 4 times, most recently from 566c9ae to bdbeb03 Compare March 6, 2026 00:56

sugunav14 force-pushed the svelury/gptq-official branch 2 times, most recently from 0aa088c to 3f2d7c0 Compare March 20, 2026 00:34

sugunav14 marked this pull request as ready for review March 21, 2026 00:12

sugunav14 requested review from a team as code owners March 21, 2026 00:12

sugunav14 requested a review from meenchen March 21, 2026 00:12

coderabbitai bot reviewed Mar 21, 2026

View reviewed changes

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

modelopt/torch/quantization/model_calib.py Outdated Show resolved Hide resolved

modelopt/torch/quantization/model_calib.py Outdated Show resolved Hide resolved

modelopt/torch/quantization/model_calib.py Show resolved Hide resolved

sugunav14 requested review from Fridah-nv, cjluo-nv and realAsma March 23, 2026 21:14

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

cjluo-nv reviewed Mar 24, 2026

View reviewed changes

meenchen reviewed Mar 24, 2026

View reviewed changes

Fridah-nv and others added 3 commits March 25, 2026 17:51

add rabbit feedback

1fee97c

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

minor

3f717dd

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

tested perplexity

971b168

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 force-pushed the svelury/gptq-official branch from 8afc672 to 7d2a960 Compare March 25, 2026 18:01

sugunav14 added 23 commits March 25, 2026 18:09

tested perplexity

188fa1d

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

tested, revert later

599227e

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

tested

60df0d8

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

initial cleanup

f88ba6e

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

cleanup

7b24cd3

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

removed stray prints

b17b917

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

fix rebase issues

8ff8976

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

minor

5815ce8

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

tested e2e on qwen

b1f1434

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

removed perplexity eval

df6b182

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update

75a08fe

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

revert later

9e58a6f

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

minor update

16086c7

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update

9b47e77

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

gptq faster

4ec2433

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

added metrics files, remove later

2b0af3d

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

claude review

ee40b48

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

remove stray files

a175178

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

refactor

a948497

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

claude review + coderabbit review

7e235b4

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

refactor

d1498be

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

stray changes removed

d8b1d93

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

Address PR comments

19fc0c2

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 force-pushed the svelury/gptq-official branch from 7d2a960 to 19fc0c2 Compare March 25, 2026 18:13

fixed circular import issue

068e8a9

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

cjluo-nv approved these changes Mar 25, 2026

View reviewed changes

Fridah-nv reviewed Mar 26, 2026

View reviewed changes

Fridah-nv approved these changes Mar 26, 2026

View reviewed changes

Conversation

sugunav14 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 4, 2026

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sugunav14 Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-03-25 18:25 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Review — Efficiency & Correctness

Critical Efficiency Issue

Correctness Issues

Memory Concern

Minor Issues

Uh oh!

meenchen Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

sugunav14 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

meenchen Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

sugunav14 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

sugunav14 commented Mar 25, 2026

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Approved with requested fixes

1. wblk = self.weight.clone() inside the block loop (_blockwise_update)

sugunav14 commented Feb 4, 2026 •

edited

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading

codecov bot commented Feb 4, 2026 •

edited

Loading

github-actions bot commented Mar 23, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-25 18:25 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

1. `wblk = self.weight.clone()` inside the block loop (`_blockwise_update`)

2. `hessian_forward` — add try/finally around the original forward call