Skip to content

GPTQ Lite implementation#555

Merged
sugunav14 merged 19 commits intomainfrom
svelury/gptq-lite
Jan 30, 2026
Merged

GPTQ Lite implementation#555
sugunav14 merged 19 commits intomainfrom
svelury/gptq-lite

Conversation

@sugunav14
Copy link
Copy Markdown
Contributor

@sugunav14 sugunav14 commented Nov 13, 2025

What does this PR do?

Type of change: New feature

Overview: Adds support for GPTQ algorithm. This PR implements a modified version of the official GPTQ algorithm; the key difference is that updated activations from each layer are not used for hessian computation

Usage

Modify "algorithm" field in quant_cfg to "gptq_lite".

Note: Does not currently work with AWQ

# Add a code snippet demonstrating how to use this

Testing

  • Added unit tests to test helper functions + e2e flow
  • Perplexity and GPQA results
Model Qformat Perplexity wikitext2 GPQA
Qwen3-8B INT4 weight only (modelopt + amax/7) no GPTQ 10.75 n/a
Qwen3-8B INT4 weight only (modelopt + amax/7) 10.56 0.388
Qwen3-8B INT4 weight only + FP-Quant hessians + amax/7.5 10.25 0.449
Qwen3-8B INT4 weight only (FP-Quant) 10.24 0.46
Qwen3-8B NVFP4 static weight only 10.25 n/a
Qwen3-8B NVFP4 static weight only no GPTQ 10.25 n/a
Qwen3-0.6B NVFP4 static weight only 22.75 n/a
Qwen3-0.6B NVFP4 dynamic weight only 23.50 n/a
Qwen3-0.6B NVFP4 static weight only with FP-Quant hessians 22.0 n/a
Qwen3-0.6B NVFP4 static weight only no GPTQ 24.25 n/a

Conclusions from results

  • Perplexity remains the same or shows improvement with Modelopt implementation. The magnitude of improvement is lesser in modelopt when compared to FP-Quant
  • GPQA shows no improvement with modelopt, but shows improvement with FP-Quant

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: Yes
  • Did you add or update any necessary documentation?: No
  • Did you update Changelog?: No

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • GPTQ Lite quantization mode now available for efficient model calibration
    • GPU memory usage monitoring utility added
    • Quantization configuration extended to support complex nested structures and lists
  • Tests

    • Comprehensive test coverage added for GPTQ quantization workflows

✏️ Tip: You can customize this high-level summary in your review settings.

@sugunav14 sugunav14 requested review from a team as code owners November 13, 2025 22:44
@sugunav14 sugunav14 requested a review from RalphMao November 13, 2025 22:44
@sugunav14 sugunav14 marked this pull request as draft November 13, 2025 22:44
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Nov 13, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@codecov
Copy link
Copy Markdown

codecov bot commented Nov 13, 2025

Codecov Report

❌ Patch coverage is 14.76510% with 127 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.58%. Comparing base (2c73de0) to head (0627fb3).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 6.81% 123 Missing ⚠️
modelopt/torch/utils/perf.py 20.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #555      +/-   ##
==========================================
- Coverage   74.02%   73.58%   -0.45%     
==========================================
  Files         192      192              
  Lines       19664    19812     +148     
==========================================
+ Hits        14557    14578      +21     
- Misses       5107     5234     +127     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sugunav14 sugunav14 self-assigned this Nov 15, 2025
@sugunav14 sugunav14 marked this pull request as ready for review November 17, 2025 23:59
gt=0.0,
le=1.0,
title="Percentage damping factor.",
description="The percentage of average Hessian diagonal used for damping.",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a reference from the original paper about what these are, could you also share the link too?

batch_size = input.shape[0]

# Incremental averaging: scale down old hessian
hessian *= n_samples / (n_samples + batch_size)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the dtype of hessian? Do we need to up cast to fp32 for this division?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hessian is defaulted to fp32 during initialization. For the division part the result is float.

hessian, n_samples = update_hessian(input[0], state["hessian"], state["n_samples"])
hessian_state[module.name] = {"hessian": hessian, "n_samples": n_samples}
torch.cuda.empty_cache()
gc.collect()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to do gc.collect() here? It's going to be very slow


# Phase 1: Collect statistics for quantizers
enable_stats_collection(model)
max_calibrate(model, forward_loop)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need forward_loop here? Is this for weight amax calib only?

state = hessian_state[module.name]
hessian = state["hessian"].to(module.weight.device)
blockwise_weight_update(module, hessian, block_size, percdamp)
torch.cuda.empty_cache()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can del the hessian after applying blockwise_weight_update?

hessian_state_path: str | None = ModeloptField(
default=None,
title="Path to the Hessian state file.",
description="The path to the Hessian state file.",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe state: if the path exists, we load the hessian from the path instead of re-computing them.

Comment on lines +1119 to +1120
GPTQ lite does not perform sequential quantization of layers. This means that the updated
activations are not used to process the next layer.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you estimate how much effort is needed if we need to add this constraint? I am thinking if we can have a quick test to see what's the accuracy impact.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be addressed in a followup PR

Comment on lines +1135 to +1139
block_size: int | None = ModeloptField(
default=128,
title="Block size for GPTQ weight update.",
description="The block size for GPTQ weight update.",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the multiple of block_size used in quantization. We should explain it in the description as well.

gt=0.0,
le=1.0,
title="Percentage damping factor.",
description="The percentage of average Hessian diagonal used for damping.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add some instructions here, so users can know what's the impact of increasing/decreasing this parameter?

tensor_mapping = {}
for name, module in model.named_modules():
if is_quantized_linear(module) and module.weight_quantizer.is_enabled:
in_features = module.weight.shape[1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use module.weight.shape[-1] instead incase of 3D weight?

Comment on lines +1269 to +1272
for name, module in model.named_modules():
if is_quantized_linear(module) and module.weight_quantizer.is_enabled:
module.input_quantizer.reset_amax()
module.output_quantizer.reset_amax()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how much accuracy is impacted if we don't recalibrate the input quantizer??

@sugunav14 sugunav14 requested a review from a team as a code owner January 14, 2026 00:47
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 17, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

This pull request introduces the GPTQ-Lite quantization algorithm, comprising a new configuration class, calibration mode descriptor, Hessian-based weight update utilities, GPU memory monitoring, and comprehensive tests. The implementation enables post-training quantization using blockwise weight updates with damped Hessian information.

Changes

Cohort / File(s) Summary
Configuration and Mode Setup
modelopt/torch/quantization/config.py, modelopt/torch/quantization/mode.py
Added GPTQLiteConfig class with fields for damping, block size, and Hessian state persistence. Expanded QuantizeQuantCfgType to support nested lists and dicts of quantizer configs. Added GPTQLiteModeDescriptor to register the GPTQ-Lite calibration path.
Core GPTQ Implementation
modelopt/torch/quantization/model_calib.py
Introduced six new functions: print_relative_mse_error (logs Hessian-weighted MSE), update_hessian (incremental Hessian calculation), prepare_hessian_inverse (damped inverse with dead-neuron handling), quantize_block (error-propagated block quantization), blockwise_weight_update (orchestrates per-layer updates), and gptq_lite (main calibration pipeline with Hessian state management and progress logging).
GPU Memory Utility
modelopt/torch/utils/perf.py
Added get_gpu_mem_fraction() function to compute and export current GPU memory utilization ratio.
Tests
tests/gpu/torch/quantization/test_gptq.py
Added three parameterized test functions: test_update_hessian (shape and accumulation correctness), test_gptq_updates (blockwise quantization behavior with random and uniform weights), and test_gptq_e2e_flow (end-to-end quantization with multiple quantization configurations and generation validation).

Sequence Diagram

sequenceDiagram
    participant Model as Model
    participant ForwardLoop as ForwardLoop
    participant HessianCollector as Hessian<br/>Collector
    participant HessianPrep as Hessian Inverse<br/>Prep
    participant BlockQuantizer as Block<br/>Quantizer
    participant UpdatedModel as Updated<br/>Model

    ForwardLoop->>HessianCollector: Input samples
    HessianCollector->>HessianCollector: update_hessian(input)
    HessianCollector->>HessianPrep: Accumulated Hessian
    HessianPrep->>HessianPrep: prepare_hessian_inverse<br/>(damping, fallback)
    HessianPrep->>BlockQuantizer: Damped H_inv
    Model->>BlockQuantizer: Quantized weights
    BlockQuantizer->>BlockQuantizer: quantize_block<br/>(error propagation)
    BlockQuantizer->>UpdatedModel: Updated weights
    UpdatedModel-->>Model: In-place update
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature being added: GPTQ Lite implementation, which matches the PR's primary objective of adding support for a 'gptq_lite' quantization algorithm.
Docstring Coverage ✅ Passed Docstring coverage is 89.47% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@modelopt/torch/quantization/config.py`:
- Around line 1235-1240: Fix the typo in the description for the ModeloptField
named hessian_state_path in config.py: change "insteaed" to "instead" inside the
multi-line description string so the text reads "If hessian path exists, we load
from hessian file instead of recomputing them." This touches the
hessian_state_path field declaration using ModeloptField.

In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1302-1303: The call to print_relative_mse_error uses module.name
which may not exist and can raise AttributeError when blockwise_weight_update is
invoked directly; update blockwise_weight_update to derive a safe name (e.g.,
use getattr(module, "name", None) or fallback to module.__class__.__name__) and
pass that safe name into print_relative_mse_error (or only call
print_relative_mse_error when a name exists) so the function no longer assumes
module.name is always set; reference blockwise_weight_update,
print_relative_mse_error, and gptq_lite when making this change.
- Around line 1245-1265: The loop repeatedly calls quantizer(full_weight)
causing N full-tensor quantizations; move the quantization out of the inner
group loop or quantize only the required slice: either compute quantized_full =
quantizer(full_weight) once before iterating group_start (if quantization does
not depend on error propagation), or replace quantizer(full_weight) inside the
loop with quantizer(full_weight[:, block_start:block_end]) /
quantizer(block_weight) to only quantize the columns used for quantized_cols;
update uses of quantized_full/quantized_cols accordingly and ensure
block_weight/full_weight updates still reflect the chosen quantization strategy.
- Around line 1196-1203: The indexing into Hessian h using zero_cols is broken
because torch.nonzero returns a 2D tensor for a 1D mask; change how zero_cols is
computed so it is a 1D index tensor usable for advanced indexing (e.g., use
torch.nonzero(mask, as_tuple=True)[0] or torch.where(mask)[0] or .view(-1) on
the result). Update the line that builds zero_cols (which currently uses
torch.nonzero(weight.eq(0).all(dim=0))) so subsequent operations h[zero_cols,
:], h[:, zero_cols], and h[zero_cols, zero_cols] receive a 1D index tensor.

In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 118-121: The test_gptq_updates setup creates a Linear with shape
(1, dim) but assigns a model_weight of shape (16,16); fix by constructing the
torch.nn.Linear with out_features and in_features that match model_weight (use
model_weight.shape[0] for out_features and model_weight.shape[1] or dim for
in_features) so the model.weight.data assignment is shape-compatible; update the
instantiation of the model in test_gptq_updates and keep the subsequent
.to("cuda") and original_weight clone logic unchanged.
- Around line 176-178: The test is mutating the shared configuration dict
(quant_cfg) by setting quant_cfg["algorithm"] = "gptq_lite", which can alter
imported constants like mtq.NVFP4_DEFAULT_CFG; instead, make a local copy (e.g.,
via copy.deepcopy(quant_cfg) or dict(quant_cfg)) and modify that copy before use
(refer to the quant_cfg variable near model.eval()), so the original shared
constant is never changed.
🧹 Nitpick comments (5)
modelopt/torch/quantization/model_calib.py (4)

1158-1180: Potential numerical issue with Hessian update when n_samples=0.

When n_samples=0 initially, the scaling factor n_samples / (n_samples + batch_size) equals 0, which is correct. However, the comment on line 1174 says "H += (2/n_samples) * X @ X^T" but the actual formula uses sqrt(2 / n_samples) after incrementing n_samples. This is correct for the running average formula, but the comment is misleading.

Also, when starting fresh, if the initial n_samples is 0 and batch_size is small, the scaling sqrt(2 / n_samples) after increment could be large. This is mathematically correct for the incremental averaging scheme.

📝 Suggested documentation fix
-    # Compute outer product: H += (2/n_samples) * X @ X^T
-    # where X is the flattened input reshaped to (features, batch*seq)
+    # Compute outer product contribution using running average:
+    # H += (sqrt(2/n_samples) * X) @ (sqrt(2/n_samples) * X)^T
+    # where X is the flattened input reshaped to (features, batch*seq)

1336-1343: Hardcoded memory threshold (0.65) should be configurable or documented.

The threshold 0.65 for deciding CPU vs GPU storage is a magic number that may not be optimal for all use cases. Consider making this configurable or at least documenting the rationale.

📝 Suggested improvement
+# Memory threshold for offloading Hessians to CPU (65% GPU usage)
+_HESSIAN_GPU_MEM_THRESHOLD = 0.65
+
 def initialize_hessian_state(tensor_mapping):
     """Initialize hessian state with zeros."""
     for name, (shape, device) in tensor_mapping.items():
         # Use CPU if GPU memory is tight
-        target_device = "cpu" if get_gpu_mem_fraction(device) > 0.65 else device
+        target_device = "cpu" if get_gpu_mem_fraction(device) > _HESSIAN_GPU_MEM_THRESHOLD else device

1424-1430: Duplicate exception handling for save_hessian_state.

The save_hessian_state function already has its own try-except block (lines 1366-1378) that handles exceptions and prints messages. Wrapping the call in another try-except (lines 1426-1430) results in duplicate error handling and potentially confusing double-printed messages.

♻️ Proposed fix
     # Save if configured
     if save_hessians:
-        try:
-            save_hessian_state(hessian_state_path)
-        except Exception as e:
-            print_rank_0(f"Error saving hessian state: {e}")
-            print_rank_0("Continuing execution...")
+        save_hessian_state(hessian_state_path)

1446-1448: Calling torch.cuda.empty_cache() in a loop can degrade performance.

torch.cuda.empty_cache() is expensive and calling it on every iteration can significantly slow down quantization. Consider moving it outside the loop or calling it periodically.

♻️ Proposed optimization
     # Perform blockwise weight updates
     for name, module in tqdm(quantized_modules, desc="Quantizing layers"):
         state = hessian_state[module.name]
         hessian = state["hessian"].to(module.weight.device)
         blockwise_weight_update(module, hessian, block_size, percdamp)
         # Delete hessian state to free memory
         del hessian_state[module.name]
-        torch.cuda.empty_cache()
+
+    # Clear CUDA cache once after all layers are processed
+    torch.cuda.empty_cache()

Alternatively, if memory is critical, call it every N iterations:

if (idx + 1) % 10 == 0:
    torch.cuda.empty_cache()
tests/gpu/torch/quantization/test_gptq.py (1)

191-198: Consider using logging instead of print statements in tests.

Print statements in tests can clutter test output. Consider using pytest's capsys fixture or proper logging if output verification is needed, or remove them if they're only for debugging.

Comment on lines +1245 to +1265
for group_start in range(0, block_size, group_size):
group_end = min(group_start + group_size, block_size)
group_cols = slice(group_start, group_end)
# Get current column and its Hessian inverse diagonal
weight_col = block_weight[:, group_cols]
hinv_diag = torch.diag(block_hinv[group_cols, group_cols])

# Quantize using the full weight, then extract the columns we need
quantized_full = quantizer(full_weight)
quantized_cols = quantized_full[:, block_start + group_start : block_start + group_end]
quantized_block[:, group_cols] = quantized_cols

# Compute quantization error and loss
error = (weight_col - quantized_cols) / hinv_diag
losses[:, group_cols] = (weight_col - quantized_cols) ** 2 / (hinv_diag**2) / 2
errors[:, group_cols] = error

# Propagate error to remaining columns in block
block_weight[:, group_start:] -= error @ block_hinv[group_start:group_end, group_start:]
full_weight[:, block_start:block_end] = block_weight

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Performance concern: Repeated full-weight quantization inside loop.

The quantizer(full_weight) call at line 1253 is inside the loop and quantizes the entire weight tensor for every column. For a weight matrix with N columns, this results in N full quantizations when only one column's quantized values are used per iteration. This is computationally expensive.

💡 Suggested optimization approach

Consider either:

  1. Quantizing the full weight once before the loop if the quantizer doesn't depend on the error-propagated weights
  2. Or if error propagation affects quantization, quantize only the needed columns per iteration
+    # Quantize full weight once if error propagation doesn't affect quantization results
+    # (Note: This may need adjustment based on quantizer behavior)
     for group_start in range(0, block_size, group_size):
         group_end = min(group_start + group_size, block_size)
         group_cols = slice(group_start, group_end)
         # Get current column and its Hessian inverse diagonal
         weight_col = block_weight[:, group_cols]
         hinv_diag = torch.diag(block_hinv[group_cols, group_cols])
 
-        # Quantize using the full weight, then extract the columns we need
-        quantized_full = quantizer(full_weight)
-        quantized_cols = quantized_full[:, block_start + group_start : block_start + group_end]
+        # Quantize only the current columns (if quantizer supports it)
+        # or cache full quantization outside loop
🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/model_calib.py` around lines 1245 - 1265, The
loop repeatedly calls quantizer(full_weight) causing N full-tensor
quantizations; move the quantization out of the inner group loop or quantize
only the required slice: either compute quantized_full = quantizer(full_weight)
once before iterating group_start (if quantization does not depend on error
propagation), or replace quantizer(full_weight) inside the loop with
quantizer(full_weight[:, block_start:block_end]) / quantizer(block_weight) to
only quantize the columns used for quantized_cols; update uses of
quantized_full/quantized_cols accordingly and ensure block_weight/full_weight
updates still reflect the chosen quantization strategy.

hinv_diag = torch.diag(block_hinv[group_cols, group_cols])

# Quantize using the full weight, then extract the columns we need
quantized_full = quantizer(full_weight)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, for nvpf4, are global scales and local scales dynamically computed, and they could change over the iterations?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah they could, but recently support was added for static scales so we don't have to compute it dynamically if we use that config.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does a user know to use static or dynamic scales with NVFP4? Is this selection automated somewhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add it as a note somewhere for the user to ensure they use static scales instead of dynamic scales for GPTQ. This is usually specified in the config like this

NVFP4_WEIGHT_ACT_MSE_CFG = {
    "quant_cfg": {
        "*weight_quantizer": {
            "num_bits": (2, 1),
            "block_sizes": {-1: 16, "type": "static", "scale_bits": (4, 3)},
            "axis": None,
            "enable": False,
        },
        "*input_quantizer": {
            "enable": False,
        },
        **_default_disabled_quantizer_cfg,
    },
    "algorithm": "gptq_lite
}

h = torch.cholesky_inverse(torch.linalg.cholesky(h))
h_inv = torch.linalg.cholesky(h, upper=True)
except (RuntimeError, torch.linalg.LinAlgError):
print("Warning: Hessian is not positive definite, using identity matrix")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we seen this before? Should we throw an exception?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have seen this issue pop up a couple of times while running tests. The official implementation does a similar handling of the exception too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sugunav14 I remember GPTQ paper adding a small diagonal term to make the hessian invertible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@realAsma Do you mean the percdamp factor?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just keep the implementation very close to the original!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I haven't really made any changes to the original implementation except for a log for the user. The original implementation also has this fallback option

}


def get_gpu_mem_fraction(device="cuda:0"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: get_used_gpu_mem_fraction?

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I have not reviewed the algorithm details. Maybe @meenchen and @realAsma can help double checking.

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the memory optimizations? In my understanding they wont be needed after we implement the sequential per-block GPTQ - we can keep the code a bit cleaner without this.
I am not particular about this though.

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
@sugunav14 sugunav14 enabled auto-merge (squash) January 28, 2026 05:22
@sugunav14 sugunav14 merged commit 02c5f29 into main Jan 30, 2026
40 of 42 checks passed
@sugunav14 sugunav14 deleted the svelury/gptq-lite branch January 30, 2026 02:15
danielkorzekwa pushed a commit that referenced this pull request Feb 17, 2026
## What does this PR do?

**Type of change:** New feature <!-- Use one of the following: Bug fix,
new feature, new example, new tests, documentation. -->

**Overview:** Adds support for GPTQ algorithm. This PR implements a
modified version of the official GPTQ algorithm; the key difference is
that updated activations from each layer are not used for hessian
computation

## Usage
<!-- You can potentially add a usage example below. -->
Modify "algorithm" field in quant_cfg to "gptq_lite".

Note: Does not currently work with AWQ

```python
# Add a code snippet demonstrating how to use this
```

## Testing
<!-- Mention how have you tested your change if applicable. -->

- [x] Added unit tests to test helper functions + e2e flow
- [x] Perplexity and GPQA results

| Model       | Qformat               | Perplexity wikitext2 | GPQA |
|-------------|------------------------|------------------------|------|
| Qwen3-8B | INT4 weight only (modelopt + amax/7) no GPTQ | 10.75 | n/a
|
| Qwen3-8B | INT4 weight only (modelopt + amax/7) | **10.56** | 0.388 |
| Qwen3-8B | INT4 weight only + FP-Quant hessians + amax/7.5 | 10.25 |
**0.449** |
| Qwen3-8B | INT4 weight only (FP-Quant) | 10.24 | 0.46 |
| Qwen3-8B | NVFP4 static weight only | 10.25 | n/a |
| Qwen3-8B | NVFP4 static weight only no GPTQ | 10.25 | n/a |
| Qwen3-0.6B | NVFP4 static weight only | **22.75** | n/a |
| Qwen3-0.6B    | NVFP4 dynamic weight only |  23.50          | n/a   |
| Qwen3-0.6B | NVFP4 static weight only with FP-Quant hessians | 22.0 |
n/a |
| Qwen3-0.6B | NVFP4 static weight only no GPTQ | 24.25 | n/a |

Conclusions from results
- Perplexity remains the same or shows improvement with Modelopt
implementation. The magnitude of improvement is lesser in modelopt when
compared to FP-Quant
- GPQA shows no improvement with modelopt, but shows improvement with
FP-Quant



## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes <!--- If No, explain why.
-->
- **Did you write any new necessary tests?**: Yes
- **Did you add or update any necessary documentation?**: No
- **Did you update
[Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**:
No <!--- Only for new features, API changes, critical bug fixes or bw
breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* GPTQ Lite quantization mode now available for efficient model
calibration
  * GPU memory usage monitoring utility added
* Quantization configuration extended to support complex nested
structures and lists

* **Tests**
  * Comprehensive test coverage added for GPTQ quantization workflows

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
danielkorzekwa pushed a commit that referenced this pull request Mar 4, 2026
## What does this PR do?

**Type of change:** New feature <!-- Use one of the following: Bug fix,
new feature, new example, new tests, documentation. -->

**Overview:** Adds support for GPTQ algorithm. This PR implements a
modified version of the official GPTQ algorithm; the key difference is
that updated activations from each layer are not used for hessian
computation

## Usage
<!-- You can potentially add a usage example below. -->
Modify "algorithm" field in quant_cfg to "gptq_lite".

Note: Does not currently work with AWQ

```python
# Add a code snippet demonstrating how to use this
```

## Testing
<!-- Mention how have you tested your change if applicable. -->

- [x] Added unit tests to test helper functions + e2e flow
- [x] Perplexity and GPQA results

| Model       | Qformat               | Perplexity wikitext2 | GPQA |
|-------------|------------------------|------------------------|------|
| Qwen3-8B | INT4 weight only (modelopt + amax/7) no GPTQ | 10.75 | n/a
|
| Qwen3-8B | INT4 weight only (modelopt + amax/7) | **10.56** | 0.388 |
| Qwen3-8B | INT4 weight only + FP-Quant hessians + amax/7.5 | 10.25 |
**0.449** |
| Qwen3-8B | INT4 weight only (FP-Quant) | 10.24 | 0.46 |
| Qwen3-8B | NVFP4 static weight only | 10.25 | n/a |
| Qwen3-8B | NVFP4 static weight only no GPTQ | 10.25 | n/a |
| Qwen3-0.6B | NVFP4 static weight only | **22.75** | n/a |
| Qwen3-0.6B    | NVFP4 dynamic weight only |  23.50          | n/a   |
| Qwen3-0.6B | NVFP4 static weight only with FP-Quant hessians | 22.0 |
n/a |
| Qwen3-0.6B | NVFP4 static weight only no GPTQ | 24.25 | n/a |

Conclusions from results
- Perplexity remains the same or shows improvement with Modelopt
implementation. The magnitude of improvement is lesser in modelopt when
compared to FP-Quant
- GPQA shows no improvement with modelopt, but shows improvement with
FP-Quant

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes <!--- If No, explain why.
-->
- **Did you write any new necessary tests?**: Yes
- **Did you add or update any necessary documentation?**: No
- **Did you update
[Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**:
No <!--- Only for new features, API changes, critical bug fixes or bw
breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* GPTQ Lite quantization mode now available for efficient model
calibration
  * GPU memory usage monitoring utility added
* Quantization configuration extended to support complex nested
structures and lists

* **Tests**
  * Comprehensive test coverage added for GPTQ quantization workflows

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Mar 21, 2026
@coderabbitai coderabbitai bot mentioned this pull request Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants