GPTQ Lite implementation by sugunav14 · Pull Request #555 · NVIDIA/Model-Optimizer

sugunav14 · 2025-11-13T22:43:59Z

What does this PR do?

Type of change: New feature

Overview: Adds support for GPTQ algorithm. This PR implements a modified version of the official GPTQ algorithm; the key difference is that updated activations from each layer are not used for hessian computation

Usage

Modify "algorithm" field in quant_cfg to "gptq_lite".

Note: Does not currently work with AWQ

# Add a code snippet demonstrating how to use this

Testing

Added unit tests to test helper functions + e2e flow
Perplexity and GPQA results

Model	Qformat	Perplexity wikitext2	GPQA
Qwen3-8B	INT4 weight only (modelopt + amax/7) no GPTQ	10.75	n/a
Qwen3-8B	INT4 weight only (modelopt + amax/7)	10.56	0.388
Qwen3-8B	INT4 weight only + FP-Quant hessians + amax/7.5	10.25	0.449
Qwen3-8B	INT4 weight only (FP-Quant)	10.24	0.46
Qwen3-8B	NVFP4 static weight only	10.25	n/a
Qwen3-8B	NVFP4 static weight only no GPTQ	10.25	n/a
Qwen3-0.6B	NVFP4 static weight only	22.75	n/a
Qwen3-0.6B	NVFP4 dynamic weight only	23.50	n/a
Qwen3-0.6B	NVFP4 static weight only with FP-Quant hessians	22.0	n/a
Qwen3-0.6B	NVFP4 static weight only no GPTQ	24.25	n/a

Conclusions from results

Perplexity remains the same or shows improvement with Modelopt implementation. The magnitude of improvement is lesser in modelopt when compared to FP-Quant
GPQA shows no improvement with modelopt, but shows improvement with FP-Quant

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: Yes
Did you add or update any necessary documentation?: No
Did you update Changelog?: No

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- GPTQ Lite quantization mode now available for efficient model calibration
- GPU memory usage monitoring utility added
- Quantization configuration extended to support complex nested structures and lists
Tests
- Comprehensive test coverage added for GPTQ quantization workflows

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2025-11-13T22:44:11Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

codecov · 2025-11-13T22:56:58Z

Codecov Report

❌ Patch coverage is 14.76510% with 127 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.58%. Comparing base (2c73de0) to head (0627fb3).
⚠️ Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_calib.py	6.81%	123 Missing ⚠️
modelopt/torch/utils/perf.py	20.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #555      +/-   ##
==========================================
- Coverage   74.02%   73.58%   -0.45%     
==========================================
  Files         192      192              
  Lines       19664    19812     +148     
==========================================
+ Hits        14557    14578      +21     
- Misses       5107     5234     +127

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv · 2025-11-18T17:28:20Z

modelopt/torch/quantization/config.py

+        gt=0.0,
+        le=1.0,
+        title="Percentage damping factor.",
+        description="The percentage of average Hessian diagonal used for damping.",


if you have a reference from the original paper about what these are, could you also share the link too?

cjluo-nv · 2025-11-18T17:29:43Z

modelopt/torch/quantization/model_calib.py

+    batch_size = input.shape[0]
+
+    # Incremental averaging: scale down old hessian
+    hessian *= n_samples / (n_samples + batch_size)


what's the dtype of hessian? Do we need to up cast to fp32 for this division?

hessian is defaulted to fp32 during initialization. For the division part the result is float.

cjluo-nv · 2025-11-18T17:33:52Z

modelopt/torch/quantization/model_calib.py

+        hessian, n_samples = update_hessian(input[0], state["hessian"], state["n_samples"])
+        hessian_state[module.name] = {"hessian": hessian, "n_samples": n_samples}
+        torch.cuda.empty_cache()
+        gc.collect()


do we have to do gc.collect() here? It's going to be very slow

cjluo-nv · 2025-11-18T17:35:27Z

modelopt/torch/quantization/model_calib.py

+
+    # Phase 1: Collect statistics for quantizers
+    enable_stats_collection(model)
+    max_calibrate(model, forward_loop)


do you need forward_loop here? Is this for weight amax calib only?

cjluo-nv · 2025-11-18T17:36:28Z

modelopt/torch/quantization/model_calib.py

+        state = hessian_state[module.name]
+        hessian = state["hessian"].to(module.weight.device)
+        blockwise_weight_update(module, hessian, block_size, percdamp)
+        torch.cuda.empty_cache()


maybe you can del the hessian after applying blockwise_weight_update?

cjluo-nv · 2025-11-18T17:38:21Z

modelopt/torch/quantization/config.py

+    hessian_state_path: str | None = ModeloptField(
+        default=None,
+        title="Path to the Hessian state file.",
+        description="The path to the Hessian state file.",


Maybe state: if the path exists, we load the hessian from the path instead of re-computing them.

meenchen · 2025-11-24T17:03:30Z

modelopt/torch/quantization/config.py

+    GPTQ lite does not perform sequential quantization of layers. This means that the updated
+    activations are not used to process the next layer.


Can you estimate how much effort is needed if we need to add this constraint? I am thinking if we can have a quick test to see what's the accuracy impact.

This will be addressed in a followup PR

meenchen · 2025-11-24T17:23:22Z

modelopt/torch/quantization/config.py

+    block_size: int | None = ModeloptField(
+        default=128,
+        title="Block size for GPTQ weight update.",
+        description="The block size for GPTQ weight update.",
+    )


This should be the multiple of block_size used in quantization. We should explain it in the description as well.

meenchen · 2025-11-24T17:56:34Z

modelopt/torch/quantization/config.py

+        gt=0.0,
+        le=1.0,
+        title="Percentage damping factor.",
+        description="The percentage of average Hessian diagonal used for damping.",


Could you also add some instructions here, so users can know what's the impact of increasing/decreasing this parameter?

meenchen · 2025-11-24T18:02:46Z

modelopt/torch/quantization/model_calib.py

+    tensor_mapping = {}
+    for name, module in model.named_modules():
+        if is_quantized_linear(module) and module.weight_quantizer.is_enabled:
+            in_features = module.weight.shape[1]


Can we use module.weight.shape[-1] instead incase of 3D weight?

meenchen · 2025-11-24T18:12:06Z

modelopt/torch/quantization/model_calib.py

+    for name, module in model.named_modules():
+        if is_quantized_linear(module) and module.weight_quantizer.is_enabled:
+            module.input_quantizer.reset_amax()
+            module.output_quantizer.reset_amax()


Do you know how much accuracy is impacted if we don't recalibrate the input quantizer??

coderabbitai · 2026-01-17T01:37:22Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

📝 Walkthrough

Walkthrough

This pull request introduces the GPTQ-Lite quantization algorithm, comprising a new configuration class, calibration mode descriptor, Hessian-based weight update utilities, GPU memory monitoring, and comprehensive tests. The implementation enables post-training quantization using blockwise weight updates with damped Hessian information.

Changes

Cohort / File(s)	Summary
Configuration and Mode Setup `modelopt/torch/quantization/config.py`, `modelopt/torch/quantization/mode.py`	Added `GPTQLiteConfig` class with fields for damping, block size, and Hessian state persistence. Expanded `QuantizeQuantCfgType` to support nested lists and dicts of quantizer configs. Added `GPTQLiteModeDescriptor` to register the GPTQ-Lite calibration path.
Core GPTQ Implementation `modelopt/torch/quantization/model_calib.py`	Introduced six new functions: `print_relative_mse_error` (logs Hessian-weighted MSE), `update_hessian` (incremental Hessian calculation), `prepare_hessian_inverse` (damped inverse with dead-neuron handling), `quantize_block` (error-propagated block quantization), `blockwise_weight_update` (orchestrates per-layer updates), and `gptq_lite` (main calibration pipeline with Hessian state management and progress logging).
GPU Memory Utility `modelopt/torch/utils/perf.py`	Added `get_gpu_mem_fraction()` function to compute and export current GPU memory utilization ratio.
Tests `tests/gpu/torch/quantization/test_gptq.py`	Added three parameterized test functions: `test_update_hessian` (shape and accumulation correctness), `test_gptq_updates` (blockwise quantization behavior with random and uniform weights), and `test_gptq_e2e_flow` (end-to-end quantization with multiple quantization configurations and generation validation).

Sequence Diagram

sequenceDiagram
    participant Model as Model
    participant ForwardLoop as ForwardLoop
    participant HessianCollector as Hessian<br/>Collector
    participant HessianPrep as Hessian Inverse<br/>Prep
    participant BlockQuantizer as Block<br/>Quantizer
    participant UpdatedModel as Updated<br/>Model

    ForwardLoop->>HessianCollector: Input samples
    HessianCollector->>HessianCollector: update_hessian(input)
    HessianCollector->>HessianPrep: Accumulated Hessian
    HessianPrep->>HessianPrep: prepare_hessian_inverse<br/>(damping, fallback)
    HessianPrep->>BlockQuantizer: Damped H_inv
    Model->>BlockQuantizer: Quantized weights
    BlockQuantizer->>BlockQuantizer: quantize_block<br/>(error propagation)
    BlockQuantizer->>UpdatedModel: Updated weights
    UpdatedModel-->>Model: In-place update

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main feature being added: GPTQ Lite implementation, which matches the PR's primary objective of adding support for a 'gptq_lite' quantization algorithm.
Docstring Coverage	✅ Passed	Docstring coverage is 89.47% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@modelopt/torch/quantization/config.py`:
- Around line 1235-1240: Fix the typo in the description for the ModeloptField
named hessian_state_path in config.py: change "insteaed" to "instead" inside the
multi-line description string so the text reads "If hessian path exists, we load
from hessian file instead of recomputing them." This touches the
hessian_state_path field declaration using ModeloptField.

In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1302-1303: The call to print_relative_mse_error uses module.name
which may not exist and can raise AttributeError when blockwise_weight_update is
invoked directly; update blockwise_weight_update to derive a safe name (e.g.,
use getattr(module, "name", None) or fallback to module.__class__.__name__) and
pass that safe name into print_relative_mse_error (or only call
print_relative_mse_error when a name exists) so the function no longer assumes
module.name is always set; reference blockwise_weight_update,
print_relative_mse_error, and gptq_lite when making this change.
- Around line 1245-1265: The loop repeatedly calls quantizer(full_weight)
causing N full-tensor quantizations; move the quantization out of the inner
group loop or quantize only the required slice: either compute quantized_full =
quantizer(full_weight) once before iterating group_start (if quantization does
not depend on error propagation), or replace quantizer(full_weight) inside the
loop with quantizer(full_weight[:, block_start:block_end]) /
quantizer(block_weight) to only quantize the columns used for quantized_cols;
update uses of quantized_full/quantized_cols accordingly and ensure
block_weight/full_weight updates still reflect the chosen quantization strategy.
- Around line 1196-1203: The indexing into Hessian h using zero_cols is broken
because torch.nonzero returns a 2D tensor for a 1D mask; change how zero_cols is
computed so it is a 1D index tensor usable for advanced indexing (e.g., use
torch.nonzero(mask, as_tuple=True)[0] or torch.where(mask)[0] or .view(-1) on
the result). Update the line that builds zero_cols (which currently uses
torch.nonzero(weight.eq(0).all(dim=0))) so subsequent operations h[zero_cols,
:], h[:, zero_cols], and h[zero_cols, zero_cols] receive a 1D index tensor.

In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 118-121: The test_gptq_updates setup creates a Linear with shape
(1, dim) but assigns a model_weight of shape (16,16); fix by constructing the
torch.nn.Linear with out_features and in_features that match model_weight (use
model_weight.shape[0] for out_features and model_weight.shape[1] or dim for
in_features) so the model.weight.data assignment is shape-compatible; update the
instantiation of the model in test_gptq_updates and keep the subsequent
.to("cuda") and original_weight clone logic unchanged.
- Around line 176-178: The test is mutating the shared configuration dict
(quant_cfg) by setting quant_cfg["algorithm"] = "gptq_lite", which can alter
imported constants like mtq.NVFP4_DEFAULT_CFG; instead, make a local copy (e.g.,
via copy.deepcopy(quant_cfg) or dict(quant_cfg)) and modify that copy before use
(refer to the quant_cfg variable near model.eval()), so the original shared
constant is never changed.

🧹 Nitpick comments (5)

modelopt/torch/quantization/model_calib.py (4)
1158-1180: Potential numerical issue with Hessian update when n_samples=0.

When n_samples=0 initially, the scaling factor n_samples / (n_samples + batch_size) equals 0, which is correct. However, the comment on line 1174 says "H += (2/n_samples) * X @ X^T" but the actual formula uses sqrt(2 / n_samples) after incrementing n_samples. This is correct for the running average formula, but the comment is misleading.

Also, when starting fresh, if the initial n_samples is 0 and batch_size is small, the scaling sqrt(2 / n_samples) after increment could be large. This is mathematically correct for the incremental averaging scheme.
📝 Suggested documentation fix
-    # Compute outer product: H += (2/n_samples) * X @ X^T
-    # where X is the flattened input reshaped to (features, batch*seq)
+    # Compute outer product contribution using running average:
+    # H += (sqrt(2/n_samples) * X) @ (sqrt(2/n_samples) * X)^T
+    # where X is the flattened input reshaped to (features, batch*seq)
1336-1343: Hardcoded memory threshold (0.65) should be configurable or documented.

The threshold 0.65 for deciding CPU vs GPU storage is a magic number that may not be optimal for all use cases. Consider making this configurable or at least documenting the rationale.
📝 Suggested improvement
+# Memory threshold for offloading Hessians to CPU (65% GPU usage)
+_HESSIAN_GPU_MEM_THRESHOLD = 0.65
+
 def initialize_hessian_state(tensor_mapping):
     """Initialize hessian state with zeros."""
     for name, (shape, device) in tensor_mapping.items():
         # Use CPU if GPU memory is tight
-        target_device = "cpu" if get_gpu_mem_fraction(device) > 0.65 else device
+        target_device = "cpu" if get_gpu_mem_fraction(device) > _HESSIAN_GPU_MEM_THRESHOLD else device
1424-1430: Duplicate exception handling for save_hessian_state.

The save_hessian_state function already has its own try-except block (lines 1366-1378) that handles exceptions and prints messages. Wrapping the call in another try-except (lines 1426-1430) results in duplicate error handling and potentially confusing double-printed messages.
♻️ Proposed fix
     # Save if configured
     if save_hessians:
-        try:
-            save_hessian_state(hessian_state_path)
-        except Exception as e:
-            print_rank_0(f"Error saving hessian state: {e}")
-            print_rank_0("Continuing execution...")
+        save_hessian_state(hessian_state_path)
1446-1448: Calling torch.cuda.empty_cache() in a loop can degrade performance.

torch.cuda.empty_cache() is expensive and calling it on every iteration can significantly slow down quantization. Consider moving it outside the loop or calling it periodically.
♻️ Proposed optimization
     # Perform blockwise weight updates
     for name, module in tqdm(quantized_modules, desc="Quantizing layers"):
         state = hessian_state[module.name]
         hessian = state["hessian"].to(module.weight.device)
         blockwise_weight_update(module, hessian, block_size, percdamp)
         # Delete hessian state to free memory
         del hessian_state[module.name]
-        torch.cuda.empty_cache()
+
+    # Clear CUDA cache once after all layers are processed
+    torch.cuda.empty_cache()
Alternatively, if memory is critical, call it every N iterations:
if (idx + 1) % 10 == 0:
    torch.cuda.empty_cache()
tests/gpu/torch/quantization/test_gptq.py (1)

191-198: Consider using logging instead of print statements in tests.

Print statements in tests can clutter test output. Consider using pytest's capsys fixture or proper logging if output verification is needed, or remove them if they're only for debugging.

modelopt/torch/quantization/config.py

modelopt/torch/quantization/model_calib.py

coderabbitai · 2026-01-17T01:43:49Z

modelopt/torch/quantization/model_calib.py

+    for group_start in range(0, block_size, group_size):
+        group_end = min(group_start + group_size, block_size)
+        group_cols = slice(group_start, group_end)
+        # Get current column and its Hessian inverse diagonal
+        weight_col = block_weight[:, group_cols]
+        hinv_diag = torch.diag(block_hinv[group_cols, group_cols])
+
+        # Quantize using the full weight, then extract the columns we need
+        quantized_full = quantizer(full_weight)
+        quantized_cols = quantized_full[:, block_start + group_start : block_start + group_end]
+        quantized_block[:, group_cols] = quantized_cols
+
+        # Compute quantization error and loss
+        error = (weight_col - quantized_cols) / hinv_diag
+        losses[:, group_cols] = (weight_col - quantized_cols) ** 2 / (hinv_diag**2) / 2
+        errors[:, group_cols] = error
+
+        # Propagate error to remaining columns in block
+        block_weight[:, group_start:] -= error @ block_hinv[group_start:group_end, group_start:]
+        full_weight[:, block_start:block_end] = block_weight
+


⚠️ Potential issue | 🟠 Major

Performance concern: Repeated full-weight quantization inside loop.

The quantizer(full_weight) call at line 1253 is inside the loop and quantizes the entire weight tensor for every column. For a weight matrix with N columns, this results in N full quantizations when only one column's quantized values are used per iteration. This is computationally expensive.

💡 Suggested optimization approach

Consider either:

Quantizing the full weight once before the loop if the quantizer doesn't depend on the error-propagated weights

Or if error propagation affects quantization, quantize only the needed columns per iteration

+ # Quantize full weight once if error propagation doesn't affect quantization results + # (Note: This may need adjustment based on quantizer behavior) for group_start in range(0, block_size, group_size): group_end = min(group_start + group_size, block_size) group_cols = slice(group_start, group_end) # Get current column and its Hessian inverse diagonal weight_col = block_weight[:, group_cols] hinv_diag = torch.diag(block_hinv[group_cols, group_cols]) - # Quantize using the full weight, then extract the columns we need - quantized_full = quantizer(full_weight) - quantized_cols = quantized_full[:, block_start + group_start : block_start + group_end] + # Quantize only the current columns (if quantizer supports it) + # or cache full quantization outside loop

🤖 Prompt for AI Agents

In `@modelopt/torch/quantization/model_calib.py` around lines 1245 - 1265, The loop repeatedly calls quantizer(full_weight) causing N full-tensor quantizations; move the quantization out of the inner group loop or quantize only the required slice: either compute quantized_full = quantizer(full_weight) once before iterating group_start (if quantization does not depend on error propagation), or replace quantizer(full_weight) inside the loop with quantizer(full_weight[:, block_start:block_end]) / quantizer(block_weight) to only quantize the columns used for quantized_cols; update uses of quantized_full/quantized_cols accordingly and ensure block_weight/full_weight updates still reflect the chosen quantization strategy.

modelopt/torch/quantization/model_calib.py

tests/gpu/torch/quantization/test_gptq.py

meenchen · 2026-01-20T20:49:00Z

modelopt/torch/quantization/model_calib.py

+        hinv_diag = torch.diag(block_hinv[group_cols, group_cols])
+
+        # Quantize using the full weight, then extract the columns we need
+        quantized_full = quantizer(full_weight)


so, for nvpf4, are global scales and local scales dynamically computed, and they could change over the iterations?

yeah they could, but recently support was added for static scales so we don't have to compute it dynamically if we use that config.

How does a user know to use static or dynamic scales with NVFP4? Is this selection automated somewhere?

I think we can add it as a note somewhere for the user to ensure they use static scales instead of dynamic scales for GPTQ. This is usually specified in the config like this

NVFP4_WEIGHT_ACT_MSE_CFG = { "quant_cfg": { "*weight_quantizer": { "num_bits": (2, 1), "block_sizes": {-1: 16, "type": "static", "scale_bits": (4, 3)}, "axis": None, "enable": False, }, "*input_quantizer": { "enable": False, }, **_default_disabled_quantizer_cfg, }, "algorithm": "gptq_lite }

cjluo-nv · 2026-01-21T07:30:03Z

modelopt/torch/quantization/model_calib.py

+        h = torch.cholesky_inverse(torch.linalg.cholesky(h))
+        h_inv = torch.linalg.cholesky(h, upper=True)
+    except (RuntimeError, torch.linalg.LinAlgError):
+        print("Warning: Hessian is not positive definite, using identity matrix")


have we seen this before? Should we throw an exception?

Yeah, I have seen this issue pop up a couple of times while running tests. The official implementation does a similar handling of the exception too.

@sugunav14 I remember GPTQ paper adding a small diagonal term to make the hessian invertible.

@realAsma Do you mean the percdamp factor?

I would just keep the implementation very close to the original!

Right now I haven't really made any changes to the original implementation except for a log for the user. The original implementation also has this fallback option

cjluo-nv · 2026-01-21T07:32:08Z

modelopt/torch/utils/perf.py

    }


+def get_gpu_mem_fraction(device="cuda:0"):


nit: get_used_gpu_mem_fraction?

cjluo-nv

Overall LGTM. I have not reviewed the algorithm details. Maybe @meenchen and @realAsma can help double checking.

realAsma

Should we remove the memory optimizations? In my understanding they wont be needed after we implement the sequential per-block GPTQ - we can keep the code a bit cleaner without this.
I am not particular about this though.

modelopt/torch/quantization/model_calib.py

realAsma

Looks great!

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

## What does this PR do? **Type of change:** New feature  **Overview:** Adds support for GPTQ algorithm. This PR implements a modified version of the official GPTQ algorithm; the key difference is that updated activations from each layer are not used for hessian computation ## Usage  Modify "algorithm" field in quant_cfg to "gptq_lite". Note: Does not currently work with AWQ ```python # Add a code snippet demonstrating how to use this ``` ## Testing  - [x] Added unit tests to test helper functions + e2e flow - [x] Perplexity and GPQA results | Model | Qformat | Perplexity wikitext2 | GPQA | |-------------|------------------------|------------------------|------| | Qwen3-8B | INT4 weight only (modelopt + amax/7) no GPTQ | 10.75 | n/a | | Qwen3-8B | INT4 weight only (modelopt + amax/7) | **10.56** | 0.388 | | Qwen3-8B | INT4 weight only + FP-Quant hessians + amax/7.5 | 10.25 | **0.449** | | Qwen3-8B | INT4 weight only (FP-Quant) | 10.24 | 0.46 | | Qwen3-8B | NVFP4 static weight only | 10.25 | n/a | | Qwen3-8B | NVFP4 static weight only no GPTQ | 10.25 | n/a | | Qwen3-0.6B | NVFP4 static weight only | **22.75** | n/a | | Qwen3-0.6B | NVFP4 dynamic weight only | 23.50 | n/a | | Qwen3-0.6B | NVFP4 static weight only with FP-Quant hessians | 22.0 | n/a | | Qwen3-0.6B | NVFP4 static weight only no GPTQ | 24.25 | n/a | Conclusions from results - Perplexity remains the same or shows improvement with Modelopt implementation. The magnitude of improvement is lesser in modelopt when compared to FP-Quant - GPQA shows no improvement with modelopt, but shows improvement with FP-Quant ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes  - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: No  ## Additional Information   ## Summary by CodeRabbit ## Release Notes * **New Features** * GPTQ Lite quantization mode now available for efficient model calibration * GPU memory usage monitoring utility added * Quantization configuration extended to support complex nested structures and lists * **Tests** * Comprehensive test coverage added for GPTQ quantization workflows <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>  --------- Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

## What does this PR do? **Type of change:** New feature  **Overview:** Adds support for GPTQ algorithm. This PR implements a modified version of the official GPTQ algorithm; the key difference is that updated activations from each layer are not used for hessian computation ## Usage  Modify "algorithm" field in quant_cfg to "gptq_lite". Note: Does not currently work with AWQ ```python # Add a code snippet demonstrating how to use this ``` ## Testing  - [x] Added unit tests to test helper functions + e2e flow - [x] Perplexity and GPQA results | Model | Qformat | Perplexity wikitext2 | GPQA | |-------------|------------------------|------------------------|------| | Qwen3-8B | INT4 weight only (modelopt + amax/7) no GPTQ | 10.75 | n/a | | Qwen3-8B | INT4 weight only (modelopt + amax/7) | **10.56** | 0.388 | | Qwen3-8B | INT4 weight only + FP-Quant hessians + amax/7.5 | 10.25 | **0.449** | | Qwen3-8B | INT4 weight only (FP-Quant) | 10.24 | 0.46 | | Qwen3-8B | NVFP4 static weight only | 10.25 | n/a | | Qwen3-8B | NVFP4 static weight only no GPTQ | 10.25 | n/a | | Qwen3-0.6B | NVFP4 static weight only | **22.75** | n/a | | Qwen3-0.6B | NVFP4 dynamic weight only | 23.50 | n/a | | Qwen3-0.6B | NVFP4 static weight only with FP-Quant hessians | 22.0 | n/a | | Qwen3-0.6B | NVFP4 static weight only no GPTQ | 24.25 | n/a | Conclusions from results - Perplexity remains the same or shows improvement with Modelopt implementation. The magnitude of improvement is lesser in modelopt when compared to FP-Quant - GPQA shows no improvement with modelopt, but shows improvement with FP-Quant ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes  - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: No  ## Additional Information   ## Summary by CodeRabbit ## Release Notes * **New Features** * GPTQ Lite quantization mode now available for efficient model calibration * GPU memory usage monitoring utility added * Quantization configuration extended to support complex nested structures and lists * **Tests** * Comprehensive test coverage added for GPTQ quantization workflows <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>  --------- Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

sugunav14 requested review from a team as code owners November 13, 2025 22:44

sugunav14 requested a review from RalphMao November 13, 2025 22:44

sugunav14 marked this pull request as draft November 13, 2025 22:44

sugunav14 self-assigned this Nov 15, 2025

sugunav14 requested review from cjluo-nv, kaix-nv, meenchen and realAsma November 15, 2025 01:26

sugunav14 marked this pull request as ready for review November 17, 2025 23:59

cjluo-nv reviewed Nov 18, 2025

View reviewed changes

meenchen reviewed Nov 24, 2025

View reviewed changes

sugunav14 force-pushed the svelury/gptq-lite branch from c6e1f07 to 6c96513 Compare January 14, 2026 00:47

sugunav14 requested a review from a team as a code owner January 14, 2026 00:47

sugunav14 force-pushed the svelury/gptq-lite branch from d5b91c1 to 24104f7 Compare January 17, 2026 01:37

coderabbitai bot reviewed Jan 17, 2026

View reviewed changes

meenchen reviewed Jan 20, 2026

View reviewed changes

cjluo-nv reviewed Jan 21, 2026

View reviewed changes

cjluo-nv approved these changes Jan 21, 2026

View reviewed changes

realAsma reviewed Jan 22, 2026

View reviewed changes

sugunav14 commented Jan 23, 2026

View reviewed changes

modelopt/torch/quantization/model_calib.py Show resolved Hide resolved

realAsma approved these changes Jan 27, 2026

View reviewed changes

sugunav14 force-pushed the svelury/gptq-lite branch from bf52a9d to 5872ddf Compare January 27, 2026 18:07

sugunav14 added 19 commits January 28, 2026 05:21

implemented gptq lite

c66842b

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

updated for supporting INT4

5a0b9a7

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

updated config

93850f4

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

added tests

7e0c38f

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

minr update

0a7d1b6

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update unit tests

90b845a

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update

d407e9b

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

revert later

078af70

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update to revert

bbb2cad

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

cleanup

58fa8a4

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

updated

8f7e47d

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update

3193e7f

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

tested

bf7ae10

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

Addressed PR comments

87fdc5e

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

updated

9ebe0b3

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

coderabbit suggestions

9d7a29e

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

added note about experimental feature

37e2b03

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update

1f6ce03

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update

0627fb3

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 force-pushed the svelury/gptq-lite branch from 24ca4b5 to 0627fb3 Compare January 28, 2026 05:21

sugunav14 enabled auto-merge (squash) January 28, 2026 05:22

sugunav14 merged commit 02c5f29 into main Jan 30, 2026
40 of 42 checks passed

sugunav14 deleted the svelury/gptq-lite branch January 30, 2026 02:15

coderabbitai bot mentioned this pull request Mar 21, 2026

GPTQ official #853

Open

coderabbitai bot mentioned this pull request Mar 28, 2026

Bug fix 6012573 #1130

Closed

		GPTQ lite does not perform sequential quantization of layers. This means that the updated
		activations are not used to process the next layer.

Conversation

sugunav14 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Nov 13, 2025

Uh oh!

codecov bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sugunav14 commented Nov 13, 2025 •

edited

Loading

codecov bot commented Nov 13, 2025 •

edited

Loading

coderabbitai bot commented Jan 17, 2026 •

edited

Loading

realAsma left a comment •

edited

Loading