Skip to content

GPTQ official#853

Open
sugunav14 wants to merge 40 commits intomainfrom
svelury/gptq-official
Open

GPTQ official#853
sugunav14 wants to merge 40 commits intomainfrom
svelury/gptq-official

Conversation

@sugunav14
Copy link
Copy Markdown
Contributor

@sugunav14 sugunav14 commented Feb 4, 2026

What does this PR do?

Implements official version of GPTQ with decoder level sequential calibration flow.
Ref: https://github.com/IST-DASLab/FP-Quant/tree/master

Type of change: New feature

Overview:

  1. Deletes gptq_lite configuration. gptq() is a more generic implementation and will resolve to gptq_lite when we set use_sequential:False
  2. Introduce GPTQHelper class to handle hessian collection patching/unpatching, blockwise weight update, hessian initialization, hessian_inverse computation etc

Usage

# Add a code snippet demonstrating how to use this
python hf_ptq.py     --pyt_ckpt_path Qwen/Qwen3-8B     --qformat nvfp4_gptq     --kv_cache_qformat none     --dataset cnn_dailymail     --batch_size 32     --calib_seq 512     --calib_size 512    --export_path exported_model

Testing

Measure perplexity and activation MSE on the following models

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • New Features

    • Added a new NVFP4 GPTQ quantization option for CLI/configuration.
  • Improvements

    • Replaced GPTQ-lite with a full GPTQ calibration pipeline for more accurate, Hessian-aware blockwise updates and token-based sampling.
    • Added sequential layer-by-layer calibration support and automatic promotion of NVFP4 static quantizers.
    • Improved logging and timing for calibration runs.
  • Tests

    • Expanded GPTQ tests, including export/roundtrip validation for quantized weights.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 4, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces the GPTQ "lite" calibration with a full GPTQ implementation (new GPTQConfig, GPTQModeDescriptor, and a gptq(...) pipeline that collects Hessians and applies blockwise updates), adds an nvfp4_gptq CLI/example option, and updates tests and examples to the new API.

Changes

Cohort / File(s) Summary
CLI / Example
examples/llm_ptq/hf_ptq.py
Added nvfp4_gptq to quantization format choices and tidied blank-line formatting around export/low-memory flags.
Quantization Config
modelopt/torch/quantization/config.py
Replaced GPTQLiteConfig with GPTQConfig (method literal → "gptq"), removed hessian_state_path, retained percdamp and block_size.
Calibration Mode
modelopt/torch/quantization/mode.py
Registered GPTQModeDescriptor (was GPTQLite); switched config_class to GPTQConfig, bound _calib_func to gptq, and allowed "gptq" in sequential method checks.
Calibration Implementation
modelopt/torch/quantization/model_calib.py
Large refactor: removed GPTQ-lite helpers, added gptq(...) pipeline and GPTQHelper/handle-style logic; Hessian accumulation now flattens token dims, forward-patches per-layer Hessians, promotes NVFP4 static quantizers, performs damped inverse Hessian blockwise updates, and updated logging/timing.
Tests
tests/gpu/torch/quantization/test_gptq.py
Adjusted tests for flattened-token n_samples, switched to calling gptq(...), added export+dequantization roundtrip test, and updated expectations/imports.
Docs / Examples Update
modelopt/torch/quantization/model_quant.py
Doc example updated to use {"method": "gptq", "sequential": True} instead of the lite variant.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as CLI / Caller
    participant Calib as gptq
    participant ForwardLoop as ForwardLoop
    participant Model as Model
    participant Quant as Quantizers
    participant Handle as GPTQHandle
    participant Logger as Logger

    CLI->>Calib: gptq(model, forward_loop, percdamp, block_size)
    Calib->>Calib: max_calibrate(model)
    Calib->>Quant: promote eligible quantizers to NVFP4StaticQuantizer
    Calib->>Model: patch module forwards for Hessian collection

    loop batches
        ForwardLoop->>Model: run forward batch
        Model->>Quant: quantize/dequantize activations
        Quant-->>Calib: provide activations
        Calib->>Handle: update_hessian(flattened_inputs)
    end

    Calib->>Handle: compute damped inverse Hessian per-block
    Handle->>Handle: apply blockwise weight updates
    Handle->>Logger: log per-module metrics and timing
    Calib-->>CLI: return (model weights updated)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'GPTQ official' is extremely vague and does not clearly convey the main change. It lacks specificity about what 'official' means or what aspect of GPTQ is being changed. Use a more descriptive title that clarifies the primary change, such as 'Replace GPTQ-Lite with official GPTQ implementation' or 'Migrate GPTQ quantization from lite to official variant'.
✅ Passed checks (2 passed)
Check name Status Explanation
Security Anti-Patterns ✅ Passed Comprehensive security audit confirms no critical security anti-patterns in modified files and no new non-permissive license dependencies.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch svelury/gptq-official

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 20.49689% with 128 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.18%. Comparing base (ae965a9) to head (068e8a9).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 16.90% 118 Missing ⚠️
modelopt/torch/quantization/utils/core_utils.py 18.18% 9 Missing ⚠️
modelopt/torch/quantization/mode.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #853      +/-   ##
==========================================
- Coverage   70.18%   70.18%   -0.01%     
==========================================
  Files         228      228              
  Lines       25952    25961       +9     
==========================================
+ Hits        18215    18221       +6     
- Misses       7737     7740       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.



@torch.no_grad()
def sequential_calibrate(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fridah-nv reference for sequential calibration
cc @realAsma

@sugunav14 sugunav14 force-pushed the svelury/gptq-official branch 2 times, most recently from b5f6073 to a8540d4 Compare February 24, 2026 00:17
@sugunav14 sugunav14 force-pushed the svelury/gptq-official branch 4 times, most recently from 566c9ae to bdbeb03 Compare March 6, 2026 00:56
@sugunav14 sugunav14 force-pushed the svelury/gptq-official branch 2 times, most recently from 0aa088c to 3f2d7c0 Compare March 20, 2026 00:34
@sugunav14 sugunav14 marked this pull request as ready for review March 21, 2026 00:12
@sugunav14 sugunav14 requested review from a team as code owners March 21, 2026 00:12
@sugunav14 sugunav14 requested a review from meenchen March 21, 2026 00:12
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (3)
tests/gpu/torch/quantization/test_gptq.py (1)

186-199: Use a GPTQ/static preset in this round-trip test.

NVFP4_DEFAULT_CFG leaves the weight quantizer dynamic, so _export_quantized_weight() stays on the existing _amax path instead of the new NVFP4StaticQuantizer.global_amax flow that official GPTQ relies on. This gives you a generic NVFP4 export check, but not coverage of the new feature path.

As per coding guidelines, "All test coverage checks in PRs must pass for new features and examples".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_gptq.py` around lines 186 - 199, Replace
the dynamic NVFP4 preset with the GPTQ/static NVFP4 preset so the quantizer
follows the NVFP4StaticQuantizer.global_amax export path used by GPTQ;
specifically change the quant_cfg assignment (currently mtq.NVFP4_DEFAULT_CFG)
to the static/GPTQ preset (e.g., mtq.NVFP4_GPTQ_STATIC_CFG or the project's
named GPTQ static preset) before calling mtq.quantize(model, quant_cfg, ...),
leaving the subsequent weight restore, update_hessian and
blockwise_weight_update calls intact so the test exercises the new
_export_quantized_weight global_amax flow.
modelopt/torch/quantization/model_calib.py (2)

2154-2154: Consider cleaner conditional synchronization.

The inline conditional torch.cuda.synchronize() if torch.cuda.is_available() else None works but the else None is a no-op that may confuse readers.

♻️ Suggested cleaner pattern
-    torch.cuda.synchronize() if torch.cuda.is_available() else None
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()

Apply the same pattern on line 2171.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` at line 2154, Replace the inline
conditional expression "torch.cuda.synchronize() if torch.cuda.is_available()
else None" with a clear conditional block using "if torch.cuda.is_available():
torch.cuda.synchronize()" to remove the no-op else and improve readability;
update both occurrences of the inline form (the one shown and the other
occurrence later in the file) so all cuda synchronizations use the explicit
if-check around torch.cuda.synchronize().

2106-2107: Potential attribute collision on module.name.

Assigning module.name = name could overwrite an existing name attribute if the module class defines one. Consider using a prefixed attribute name to avoid conflicts.

♻️ Suggested safer attribute name
-            module.name = name  # Attach name for easy access in hooks
+            module._gptq_name = name  # Attach name for easy access in hooks

Then update references in blockwise_weight_update call (line 2165) and the internal state key (line 2162).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 2106 - 2107, The
code assigns module.name = name which may collide with existing attributes;
change to a unique prefixed attribute (e.g. module._quant_name or
module._qo_name) wherever the module name is attached, then update all usages
that read that attribute—specifically the references inside
blockwise_weight_update and the internal state key lookup that currently expect
module.name—so they read the new prefixed attribute instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 935-959: The activation-MSE path assumes text inputs and a
tokenizer; guard it so it only runs when a tokenizer is available (and/or when
the calibration flow is text-only) by early-checking tokenizer is not None
before calling ActivationMSELogger.load_raw_text, materialize_raw_text, or
tokenize_raw_text and emit a clear warning/skipped-info when tokenizer is None;
specifically change the block using mse_input_path,
ActivationMSELogger.materialize_raw_text, ActivationMSELogger.tokenize_raw_text,
calib_dataloader, and args.calib_seq to skip MSE creation/processing if
tokenizer is None (and likewise protect the alternate block around lines
976-981) to avoid touching multimodal calib_dataloader outputs meant for
full_model when measuring activation MSE.
- Around line 688-692: The perplexity path currently calls
compute_perplexity(full_model, eval_data) which can fail for
multimodal/container wrappers; change the call to use the extracted language
model instead (compute_perplexity(language_model, eval_data)) when
args.eval_perplexity is true and tokenizer is present, ensuring you reference
the existing symbols compute_perplexity, language_model, full_model, and args so
the LM-specific forward is used for perplexity evaluation.

In `@modelopt/torch/quantization/metrics/activation_mse.py`:
- Line 577: The torch.load call that creates payload (payload = torch.load(path,
map_location="cpu", weights_only=False)) uses weights_only=False which permits
arbitrary code execution; change it to weights_only=True to restrict loading to
tensor/weight objects (i.e., payload = torch.load(path, map_location="cpu",
weights_only=True)), or if you must keep weights_only=False add a brief inline
comment next to this torch.load explaining why the file is trusted (e.g.,
"checkpoint is internally generated and validated") and the explicit reason it
is safe.

In `@modelopt/torch/quantization/metrics/perplexity.py`:
- Around line 19-22: Update the file-level attribution in the perplexity.py
module: replace the minimal "Ported from FP-Quant" docstring with a full
FP-Quant attribution block that includes a commit-pinned source URL (repo +
specific commit hash), the original FP-Quant copyright and license text (as in
their repository), and the required NVIDIA Apache-2.0 header used across this
project; keep the existing short description but prepend the full multi-line
header so the file contains the commit-pinned reference, original
copyright/license metadata, and the NVIDIA Apache-2.0 copyright/license notice.
- Around line 30-58: compute_perplexity currently proceeds on empty data which
leaves nll_running uninitialized as a tensor and later causes an AttributeError;
add an explicit guard at the top of compute_perplexity to detect an empty data
(e.g., if not data or len(data) == 0) and raise a ValueError with a clear
message that the evaluation dataset is empty (mentioning sequence_length/corpus
size if helpful); this ensures immediate, clear failure instead of the
downstream error involving nll_running/tokens_processed.

In `@modelopt/torch/quantization/mode.py`:
- Around line 509-518: The GPTQ mode currently defaults to non-sequential
because GPTQConfig still inherits use_sequential=False; update
GPTQModeDescriptor (and/or GPTQConfig) so GPTQ always runs sequentially: set the
mode's use_sequential to True (or override a property on GPTQModeDescriptor to
return True) or validate in GPTQModeDescriptor.__init__/config_class that
use_sequential is True and raise an error if not; ensure the handler for
_calib_func = gptq cannot be used when use_sequential is False so gptq is only
invoked under sequential_calibrate().

In `@modelopt/torch/quantization/triton/gptq_fused_kernel.py`:
- Around line 165-174: gptq_fused_block currently may mutate the caller's
w_block because calling .contiguous() doesn't copy an already-contiguous tensor;
ensure the wrapper truly clones inputs before the kernel writes to them by
replacing w_block = w_block.contiguous() (and similarly for amax_grouped and
h_inv_cho_blk if they must not be mutated) with explicit copies like w_block =
w_block.clone().contiguous() (and amax_grouped =
amax_grouped.clone().contiguous(), h_inv_cho_blk =
h_inv_cho_blk.clone().contiguous()) so the kernel (_gptq_fused_block_kernel)
operates on clones and only qw_block/err_block are written back, preserving the
non-mutating contract of gptq_fused_block.

---

Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Line 2154: Replace the inline conditional expression "torch.cuda.synchronize()
if torch.cuda.is_available() else None" with a clear conditional block using "if
torch.cuda.is_available(): torch.cuda.synchronize()" to remove the no-op else
and improve readability; update both occurrences of the inline form (the one
shown and the other occurrence later in the file) so all cuda synchronizations
use the explicit if-check around torch.cuda.synchronize().
- Around line 2106-2107: The code assigns module.name = name which may collide
with existing attributes; change to a unique prefixed attribute (e.g.
module._quant_name or module._qo_name) wherever the module name is attached,
then update all usages that read that attribute—specifically the references
inside blockwise_weight_update and the internal state key lookup that currently
expect module.name—so they read the new prefixed attribute instead.

In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 186-199: Replace the dynamic NVFP4 preset with the GPTQ/static
NVFP4 preset so the quantizer follows the NVFP4StaticQuantizer.global_amax
export path used by GPTQ; specifically change the quant_cfg assignment
(currently mtq.NVFP4_DEFAULT_CFG) to the static/GPTQ preset (e.g.,
mtq.NVFP4_GPTQ_STATIC_CFG or the project's named GPTQ static preset) before
calling mtq.quantize(model, quant_cfg, ...), leaving the subsequent weight
restore, update_hessian and blockwise_weight_update calls intact so the test
exercises the new _export_quantized_weight global_amax flow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8c14b6e0-c7f4-4f64-a587-56ad77719a45

📥 Commits

Reviewing files that changed from the base of the PR and between 839fa3d and b4acce9.

📒 Files selected for processing (10)
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/quantization/__init__.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/metrics/__init__.py
  • modelopt/torch/quantization/metrics/activation_mse.py
  • modelopt/torch/quantization/metrics/perplexity.py
  • modelopt/torch/quantization/mode.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/triton/gptq_fused_kernel.py
  • tests/gpu/torch/quantization/test_gptq.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/gpu/torch/quantization/test_gptq.py (1)

280-281: ⚠️ Potential issue | 🟠 Major

Update the e2e test to the renamed GPTQ mode.

Line 281 still selects gptq_lite, but this PR wires the calibration path to gptq. As written, this test is targeting the removed mode name and will stop covering the new flow.

Suggested fix
-    quant_cfg["algorithm"] = "gptq_lite"
+    quant_cfg["algorithm"] = {"method": "gptq", "use_sequential": True}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_gptq.py` around lines 280 - 281, The test
sets quant_cfg["algorithm"] to the removed mode "gptq_lite"; update the test to
use the renamed mode by changing the algorithm value to "gptq" so the
calibration path wired to GPTQ is exercised (locate the assignment to
quant_cfg["algorithm"] in the test_gptq.py snippet and replace "gptq_lite" with
"gptq").
modelopt/torch/quantization/config.py (1)

1419-1431: ⚠️ Potential issue | 🟠 Major

Reject None and non-positive GPTQ parameters at config-parse time.

GPTQConfig accepts None for percdamp and block_size due to the type annotations float | None and int | None. However, the runtime gptq function requires concrete numeric values: percdamp is multiplied by a tensor in _prepare_hessian_inverse(), and block_size is used in range() in _blockwise_update(). When model_dump() is called on the config, it includes None values in the kwargs dict (Pydantic does not exclude None by default), which override the gptq function's default parameters and cause runtime failures.

Suggested fix
-    percdamp: float | None = ModeloptField(
+    percdamp: float = ModeloptField(
         default=0.01,
         gt=0.0,
         le=1.0,
         title="Percentage damping factor.",
         description="The percentage of average Hessian diagonal used for damping.",
     )
-    block_size: int | None = ModeloptField(
+    block_size: int = ModeloptField(
         default=128,
+        gt=0,
         title="Block size for GPTQ weight update.",
         description="""The block size for GPTQ weight update, which must be a multiple of the
         group_size used in the quantization.""",
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 1419 - 1431, GPTQConfig
currently allows percdamp: float | None and block_size: int | None which lets
None leak into model_dump and override gptq defaults; change both annotations to
non-optional (float and int), remove the `| None`, set ModeloptField defaults
and validators to enforce gt=0 (e.g., percdamp ModeloptField(default=0.01,
gt=0.0, le=1.0) and block_size ModeloptField(default=128, gt=0)), and add a
small pydantic validator on GPTQConfig (or use ModeloptField) to raise on None
or non-positive values so model_dump never emits None for percdamp or block_size
and runtime functions like _prepare_hessian_inverse() and _blockwise_update()
always receive concrete positive numbers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1555-1568: The patched hessian_forward hook can leave
weight_quantizer disabled and the module forward patched if an exception occurs;
wrap the code that disables the quantizer, calls _forward_no_gptq_hessian, and
re-enables the quantizer in a try/finally so weight_quantizer.enable() always
runs, and ensure the hook/installed forward is restored or uninstalled in a
finally block as well (i.e., capture the original forward, install the patched
hessian_forward, call forward inside try, and in finally reassign the original
forward and call weight_quantizer.enable()). Apply the same try/finally pattern
to the analogous hook at the other location referenced (the block around lines
1824-1832) so both hessian_forward and the other patched forward used by
forward_loop always restore state on exceptions.
- Around line 1677-1687: The code must skip processing when batch_size is zero
to avoid the divide-by-zero in math.sqrt(2 / n_samples); in the block around
input_flat, batch_size, hessian, n_samples and scaled_input add a guard that if
batch_size == 0 you do not modify hessian or n_samples (i.e., continue/return
from this iteration) and only perform the incremental averaging and
outer-product update when batch_size > 0 so math.sqrt and the division are never
invoked with zero effective samples.
- Around line 1597-1614: Replace detecting dead columns from weight values with
detection from the Hessian diagonal inside _prepare_hessian_inverse: compute
dead_cols as torch.nonzero(torch.diag(h).eq(0)).unsqueeze(-1) (or equivalent)
using torch.diag(h), then zero out h rows/cols and set h[dead_cols, dead_cols]=1
and also zero the corresponding columns in self.weight so those weight columns
are treated as dead; keep the damping addition using percdamp as before and
ensure all references use the h-based dead column mask rather than
self.weight.eq(0).

---

Outside diff comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 1419-1431: GPTQConfig currently allows percdamp: float | None and
block_size: int | None which lets None leak into model_dump and override gptq
defaults; change both annotations to non-optional (float and int), remove the `|
None`, set ModeloptField defaults and validators to enforce gt=0 (e.g., percdamp
ModeloptField(default=0.01, gt=0.0, le=1.0) and block_size
ModeloptField(default=128, gt=0)), and add a small pydantic validator on
GPTQConfig (or use ModeloptField) to raise on None or non-positive values so
model_dump never emits None for percdamp or block_size and runtime functions
like _prepare_hessian_inverse() and _blockwise_update() always receive concrete
positive numbers.

In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 280-281: The test sets quant_cfg["algorithm"] to the removed mode
"gptq_lite"; update the test to use the renamed mode by changing the algorithm
value to "gptq" so the calibration path wired to GPTQ is exercised (locate the
assignment to quant_cfg["algorithm"] in the test_gptq.py snippet and replace
"gptq_lite" with "gptq").

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f42eadc8-d5b1-48e6-a3a1-4f8e887e5c46

📥 Commits

Reviewing files that changed from the base of the PR and between 76002c8 and db2e44c.

📒 Files selected for processing (4)
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/mode.py
  • modelopt/torch/quantization/model_calib.py
  • tests/gpu/torch/quantization/test_gptq.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-853/

Built to branch gh-pages at 2026-03-25 18:25 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/quantization/config.py (1)

1369-1382: ⚠️ Potential issue | 🟠 Major

Validate GPTQ block_size at config parse time.

block_size currently accepts 0/negative values and only fails later in GPTQ update loops. Add a positive bound here so invalid configs fail fast.

Suggested fix
 class GPTQConfig(QuantizeAlgorithmConfig):
@@
     block_size: int | None = ModeloptField(
         default=128,
+        gt=0,
         title="Block size for GPTQ weight update.",
         description="""The block size for GPTQ weight update, which must be a multiple of the
         group_size used in the quantization.""",
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 1369 - 1382, The
block_size ModeloptField in modelopt/torch/quantization/config.py currently
allows zero/negative values and should be validated at parse time; update the
block_size field (the ModeloptField definition named block_size) to require a
positive integer (e.g., add gt=0 or ge=1) so invalid configs fail fast during
config parsing rather than later in GPTQ update loops.
♻️ Duplicate comments (2)
modelopt/torch/quantization/model_calib.py (2)

1679-1689: ⚠️ Potential issue | 🟠 Major

Guard zero-token batches before Hessian scaling.

batch_size == 0 still reaches math.sqrt(2 / n_samples) and can divide by zero on empty expert/token batches.

Suggested fix
 def update_hessian(input, hessian, n_samples):
@@
     input_flat = input.reshape(-1, input.shape[-1]).t().float()
     batch_size = input_flat.shape[1]
+    if batch_size == 0:
+        return hessian, n_samples
@@
     hessian *= n_samples / (n_samples + batch_size)
     n_samples += batch_size
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1679 - 1689, Guard
against empty token batches in the Hessian update: check batch_size (computed
from input_flat.shape[1]) immediately after flattening and skip the
scaling/averaging/Hessian update if batch_size == 0 to avoid computing
math.sqrt(2 / n_samples) or dividing by zero; ensure you do not update n_samples
or call hessian.add_() for empty batches (i.e., return/continue before the
"hessian *= n_samples / (n_samples + batch_size)" and subsequent lines in the
Hessian update block that reference n_samples, scaled_input, and hessian).

1555-1568: ⚠️ Potential issue | 🟠 Major

Make GPTQ patch lifecycle exception-safe.

If a patched forward or forward_loop(model) raises, patched forwards and quantizer state may leak into later execution.

Suggested fix
         def hessian_forward(self, input, *args, **kwargs):
@@
             self.weight_quantizer.disable()
-            out = self._forward_no_gptq_hessian(input, *args, **kwargs)
-            self.weight_quantizer.enable()
-            return out
+            try:
+                return self._forward_no_gptq_hessian(input, *args, **kwargs)
+            finally:
+                self.weight_quantizer.enable()
@@
-    gptq_handles = {name: GPTQHandle(m, name, offload_to_cpu=True) for name, m in quantized_layers}
-    for handle in gptq_handles.values():
-        handle.setup()
-
-    print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...")
-    forward_loop(model)
-
-    for handle in gptq_handles.values():
-        handle.cleanup()
+    gptq_handles = {name: GPTQHandle(m, name, offload_to_cpu=True) for name, m in quantized_layers}
+    installed_handles = []
+    try:
+        for handle in gptq_handles.values():
+            handle.setup()
+            installed_handles.append(handle)
+
+        print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...")
+        forward_loop(model)
+    finally:
+        for handle in installed_handles:
+            handle.cleanup()

Also applies to: 1817-1825

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1555 - 1568, The
patched hessian_forward must be made exception-safe: wrap the section that
updates the hessian and disables the weight quantizer + calls
_forward_no_gptq_hessian in a try/finally so weight_quantizer.enable() always
runs even if _forward_no_gptq_hessian raises, and in the same finally
clear/reset the temporary GPTQ patch state on gptq_handle (e.g., reset
gptq_handle.hessian and gptq_handle.n_samples or remove the handle reference) to
avoid leaking the patch into later execution; apply the identical try/finally
fix to the other occurrence around the same logic (the block at lines
1817-1825).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 1369-1382: The block_size ModeloptField in
modelopt/torch/quantization/config.py currently allows zero/negative values and
should be validated at parse time; update the block_size field (the
ModeloptField definition named block_size) to require a positive integer (e.g.,
add gt=0 or ge=1) so invalid configs fail fast during config parsing rather than
later in GPTQ update loops.

---

Duplicate comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1679-1689: Guard against empty token batches in the Hessian
update: check batch_size (computed from input_flat.shape[1]) immediately after
flattening and skip the scaling/averaging/Hessian update if batch_size == 0 to
avoid computing math.sqrt(2 / n_samples) or dividing by zero; ensure you do not
update n_samples or call hessian.add_() for empty batches (i.e., return/continue
before the "hessian *= n_samples / (n_samples + batch_size)" and subsequent
lines in the Hessian update block that reference n_samples, scaled_input, and
hessian).
- Around line 1555-1568: The patched hessian_forward must be made
exception-safe: wrap the section that updates the hessian and disables the
weight quantizer + calls _forward_no_gptq_hessian in a try/finally so
weight_quantizer.enable() always runs even if _forward_no_gptq_hessian raises,
and in the same finally clear/reset the temporary GPTQ patch state on
gptq_handle (e.g., reset gptq_handle.hessian and gptq_handle.n_samples or remove
the handle reference) to avoid leaking the patch into later execution; apply the
identical try/finally fix to the other occurrence around the same logic (the
block at lines 1817-1825).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: af4360f3-56a7-42c0-88f1-e0bee76f10d3

📥 Commits

Reviewing files that changed from the base of the PR and between db2e44c and a1e7d72.

📒 Files selected for processing (4)
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/model_quant.py
  • tests/gpu/torch/quantization/test_gptq.py
✅ Files skipped from review due to trivial changes (1)
  • modelopt/torch/quantization/model_quant.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/gpu/torch/quantization/test_gptq.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/torch/quantization/model_calib.py (1)

1762-1795: Performance consideration: Full-weight quantization per column.

The blockwise update performs full-matrix quantization (quantizer(wblk)) for each column within the inner loop (line 1787). This is O(num_cols × quantizer_cost) per block, which can be expensive for large weight matrices.

This appears intentional per the docstring ("quantizes the full weight matrix via the quantizer and extracts the quantized column") and matches standard GPTQ implementations where each column's quantization depends on the current weight state after error compensation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1762 - 1795, The
inner loop in _blockwise_update repeatedly calls quantizer(wblk) per column
which is expensive; move the quantization call out of the per-column loop so you
compute qdq = quantizer(wblk) once per block (before the for i in
range(n_cols_blk) loop) and then reuse qdq[:, block_start + i] when setting
self.weight and computing err; keep the existing wblk updates (addr_) and final
self.weight.addmm_ call but ensure qdq is recomputed only when you intentionally
need a fresh quantization (e.g., if you change the algorithm to require
re-quantization mid-block), referencing symbols _blockwise_update, quantizer,
wblk, qdq, errs, addr_, and self.weight to locate the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1762-1795: The inner loop in _blockwise_update repeatedly calls
quantizer(wblk) per column which is expensive; move the quantization call out of
the per-column loop so you compute qdq = quantizer(wblk) once per block (before
the for i in range(n_cols_blk) loop) and then reuse qdq[:, block_start + i] when
setting self.weight and computing err; keep the existing wblk updates (addr_)
and final self.weight.addmm_ call but ensure qdq is recomputed only when you
intentionally need a fresh quantization (e.g., if you change the algorithm to
require re-quantization mid-block), referencing symbols _blockwise_update,
quantizer, wblk, qdq, errs, addr_, and self.weight to locate the changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d366fe9b-14ea-48e0-b07e-a37f381e70e1

📥 Commits

Reviewing files that changed from the base of the PR and between a1e7d72 and 3c5ad90.

📒 Files selected for processing (3)
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/model_quant.py
  • tests/gpu/torch/quantization/test_gptq.py
💤 Files with no reviewable changes (1)
  • modelopt/torch/quantization/model_quant.py

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Efficiency & Correctness

Critical Efficiency Issue

Full-matrix quantization per column in _blockwise_update (model_calib.py):

for i in range(n_cols_blk):           # iterates every column in block
    qdq = quantizer(wblk)             # quantizes the ENTIRE weight matrix
    self.weight[:, block_start + i] = qdq[:, block_start + i]  # uses 1 column

This quantizes the full weight matrix once per column to extract a single column. For a weight with N columns, that's N full-matrix quantizations — O(N²) vs. the O(N) of the reference GPTQ implementation. I understand this may be needed because block quantizers (NVFP4) can't easily quantize individual columns, but this is likely the dominant bottleneck. Consider caching or at minimum documenting this cost trade-off.

Full weight clone per block (same area):

wblk = self.weight.clone()   # full matrix clone on every block iteration

Allocating O(rows × cols) per block. Only the slice [:, block_start:] is actually needed.

Correctness Issues

  1. hessian_forward — no try/finally around quantizer state toggle:

    self.weight_quantizer.disable()
    out = self._forward_no_gptq_hessian(input, *args, **kwargs)
    self.weight_quantizer.enable()

    If _forward_no_gptq_hessian raises, the weight quantizer is left permanently disabled. Wrap in try/finally.

  2. update_hessian semantic change — the old code counted n_samples by batch dimension (input.shape[0]); the new code counts by total tokens (batch × seq_len). The new behavior is more correct, but this silently changes Hessian normalization for any other callers of update_hessian (it's a public function). The tests are updated, but callers outside this PR may not be.

  3. Double forward passgptq() runs max_calibrate(model, forward_loop) (one full pass), then forward_loop(model) again for Hessians. During the Hessian pass, weight quantizers are disabled so downstream layers see FP weights. The two passes replay the same data — could they be combined into one?

  4. _blockwise_update quantized/FP mixing at block boundarieswblk is cloned from self.weight at each block start, so it contains already-quantized values for prior blocks. When the block quantizer's group size spans a GPTQ block boundary, scale factors are computed from a mix of quantized and FP values. This is a known trade-off with block quantizers, but the docstring claims "standard GPTQ approach" — worth noting this is a GPTQ variant.

Memory Concern

All GPTQHelper instances (and their Hessians) are held simultaneously:

gptq_handles = {name: GPTQHelper(m, name, offload_to_cpu=True) for name, m in quantized_layers}

The old code deleted each Hessian after processing. For large models (e.g., 70B), holding all Hessians at once could OOM. Consider freeing incrementally:

for handle in gptq_handles.values():
    handle.update_weights(block_size, percdamp)
    handle.hessian = None  # free after use

Minor Issues

  1. Backward compatibility: "gptq_lite" method name is deleted with no migration/alias. Existing configs referencing it will break silently.

  2. Test verbosity (test_gptq_export_roundtrip): ~15 lines of print() for diff stats — consider using logging or just including key stats in the assertion message.

  3. _promote_nvfp4_static_quantizers uses magic tuples (2, 1) and (4, 3) for detection. These should be named constants or reference the quantizer class to avoid silent breakage if encoding changes.

  4. Commit history: Contains "tested, revert later", "remove later", "debug logs" commits — should be squashed before merge.


method: Literal["gptq_lite"] = ModeloptField("gptq_lite")
method: Literal["gptq"] = ModeloptField("gptq")
percdamp: float | None = ModeloptField(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when perdamp is set as None?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, there is no use case for percdamp None. Thanks for the catch!

for i in range(n_cols_blk):
w_ci = wblk[:, block_start + i]
d = h_inv_cho_blk[i, i]
qdq = quantizer(wblk)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this qdq become the bottleneck since it is performing #col times for the whole tensor? Would it work if we only do qdq for the block?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a bottleneck. Right now if we go through the TensorQuantizer it doesn't allow for QDQ'ing a slice of the weight tensor.

Fridah-nv and others added 3 commits March 25, 2026 17:51
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
@sugunav14 sugunav14 force-pushed the svelury/gptq-official branch from 8afc672 to 7d2a960 Compare March 25, 2026 18:01
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
@sugunav14 sugunav14 force-pushed the svelury/gptq-official branch from 7d2a960 to 19fc0c2 Compare March 25, 2026 18:13
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
@sugunav14
Copy link
Copy Markdown
Contributor Author

@cjluo-nv

  1. Critical Efficiency Issue -> this is a limitation of the current TensorQuantizer and will be addressed in a follow up PR
  2. hessian_forward — no try/finally around quantizer state toggle -> addressed through helper context manager
  3. Double forward pass -> this is necessary as the first forward pass sets the scales while the second pass is used to collect hessians (which require scales to be set on input quantizers)
  4. _blockwise_update quantized/FP mixing at block boundaries -> Added a check to ensure block size is divisible by group size
  5. Memory Concern -> addressed
  6. Backward compatibility -> deleted all references to gptq_lite in the codebase.
  7. Test verbosity -> Addressed
  8. _promote_nvfp4_static_quantizers -> we use these tuples as is throughout many helper functions in the code base
  9. Commit history -> commits are squashed when merged

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with requested fixes

The overall refactor from gptq_lite to a proper GPTQ implementation with GPTQHelper looks good. Clean encapsulation, correct Hessian token-count fix, and good test coverage additions.

Please address the following before merging:

1. wblk = self.weight.clone() inside the block loop (_blockwise_update)

A full weight tensor clone is created at every block iteration. For a 4096×4096 layer with block_size=128, that's 32 full-matrix clones per layer. Consider cloning once before the block loop and tracking which columns are finalized vs. working, or document why the per-block clone is necessary if there's a correctness reason I'm missing.

2. hessian_forward — add try/finally around the original forward call

In setup(), the patched hessian_forward calls self._forward_no_gptq_hessian(input, ...) without a try/finally guard. If that call raises, the Hessian has already been updated with potentially partial/bad data but the exception will propagate without any cleanup of gptq_helper state. A lightweight try/finally (or at minimum a comment explaining why it's safe to skip) would make the code more robust.

Everything else — disabled_weight_quantizers context manager, _promote_nvfp4_static_quantizers, block/group size divisibility check, test updates — LGTM.

class GPTQLiteConfig(QuantizeAlgorithmConfig):
"""The config for GPTQ lite.
class GPTQConfig(QuantizeAlgorithmConfig):
"""The config for GPTQ quantization.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use_sequential be part of the GPTQConfig?

finally:
input_getter._unpatch_all_layers()
with disabled_weight_quantizers(model):
forward_loop(model)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about why we disable all weight quantizers during Hessian collection. In non-sequential mode, we applied input quantizers but not weight quantizers, the activations used to build the Hessian will not be the actual inference activations. Does this make sense to you? Any reason we want to keep non-sequential mode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants