Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces the GPTQ "lite" calibration with a full GPTQ implementation (new Changes
Sequence DiagramsequenceDiagram
actor User
participant CLI as CLI / Caller
participant Calib as gptq
participant ForwardLoop as ForwardLoop
participant Model as Model
participant Quant as Quantizers
participant Handle as GPTQHandle
participant Logger as Logger
CLI->>Calib: gptq(model, forward_loop, percdamp, block_size)
Calib->>Calib: max_calibrate(model)
Calib->>Quant: promote eligible quantizers to NVFP4StaticQuantizer
Calib->>Model: patch module forwards for Hessian collection
loop batches
ForwardLoop->>Model: run forward batch
Model->>Quant: quantize/dequantize activations
Quant-->>Calib: provide activations
Calib->>Handle: update_hessian(flattened_inputs)
end
Calib->>Handle: compute damped inverse Hessian per-block
Handle->>Handle: apply blockwise weight updates
Handle->>Logger: log per-module metrics and timing
Calib-->>CLI: return (model weights updated)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #853 +/- ##
==========================================
- Coverage 70.18% 70.18% -0.01%
==========================================
Files 228 228
Lines 25952 25961 +9
==========================================
+ Hits 18215 18221 +6
- Misses 7737 7740 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
|
|
||
| @torch.no_grad() | ||
| def sequential_calibrate( |
There was a problem hiding this comment.
@Fridah-nv reference for sequential calibration
cc @realAsma
b5f6073 to
a8540d4
Compare
566c9ae to
bdbeb03
Compare
0aa088c to
3f2d7c0
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (3)
tests/gpu/torch/quantization/test_gptq.py (1)
186-199: Use a GPTQ/static preset in this round-trip test.
NVFP4_DEFAULT_CFGleaves the weight quantizer dynamic, so_export_quantized_weight()stays on the existing_amaxpath instead of the newNVFP4StaticQuantizer.global_amaxflow that official GPTQ relies on. This gives you a generic NVFP4 export check, but not coverage of the new feature path.As per coding guidelines, "All test coverage checks in PRs must pass for new features and examples".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/quantization/test_gptq.py` around lines 186 - 199, Replace the dynamic NVFP4 preset with the GPTQ/static NVFP4 preset so the quantizer follows the NVFP4StaticQuantizer.global_amax export path used by GPTQ; specifically change the quant_cfg assignment (currently mtq.NVFP4_DEFAULT_CFG) to the static/GPTQ preset (e.g., mtq.NVFP4_GPTQ_STATIC_CFG or the project's named GPTQ static preset) before calling mtq.quantize(model, quant_cfg, ...), leaving the subsequent weight restore, update_hessian and blockwise_weight_update calls intact so the test exercises the new _export_quantized_weight global_amax flow.modelopt/torch/quantization/model_calib.py (2)
2154-2154: Consider cleaner conditional synchronization.The inline conditional
torch.cuda.synchronize() if torch.cuda.is_available() else Noneworks but theelse Noneis a no-op that may confuse readers.♻️ Suggested cleaner pattern
- torch.cuda.synchronize() if torch.cuda.is_available() else None + if torch.cuda.is_available(): + torch.cuda.synchronize()Apply the same pattern on line 2171.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` at line 2154, Replace the inline conditional expression "torch.cuda.synchronize() if torch.cuda.is_available() else None" with a clear conditional block using "if torch.cuda.is_available(): torch.cuda.synchronize()" to remove the no-op else and improve readability; update both occurrences of the inline form (the one shown and the other occurrence later in the file) so all cuda synchronizations use the explicit if-check around torch.cuda.synchronize().
2106-2107: Potential attribute collision onmodule.name.Assigning
module.name = namecould overwrite an existingnameattribute if the module class defines one. Consider using a prefixed attribute name to avoid conflicts.♻️ Suggested safer attribute name
- module.name = name # Attach name for easy access in hooks + module._gptq_name = name # Attach name for easy access in hooksThen update references in
blockwise_weight_updatecall (line 2165) and the internal state key (line 2162).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 2106 - 2107, The code assigns module.name = name which may collide with existing attributes; change to a unique prefixed attribute (e.g. module._quant_name or module._qo_name) wherever the module name is attached, then update all usages that read that attribute—specifically the references inside blockwise_weight_update and the internal state key lookup that currently expect module.name—so they read the new prefixed attribute instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 935-959: The activation-MSE path assumes text inputs and a
tokenizer; guard it so it only runs when a tokenizer is available (and/or when
the calibration flow is text-only) by early-checking tokenizer is not None
before calling ActivationMSELogger.load_raw_text, materialize_raw_text, or
tokenize_raw_text and emit a clear warning/skipped-info when tokenizer is None;
specifically change the block using mse_input_path,
ActivationMSELogger.materialize_raw_text, ActivationMSELogger.tokenize_raw_text,
calib_dataloader, and args.calib_seq to skip MSE creation/processing if
tokenizer is None (and likewise protect the alternate block around lines
976-981) to avoid touching multimodal calib_dataloader outputs meant for
full_model when measuring activation MSE.
- Around line 688-692: The perplexity path currently calls
compute_perplexity(full_model, eval_data) which can fail for
multimodal/container wrappers; change the call to use the extracted language
model instead (compute_perplexity(language_model, eval_data)) when
args.eval_perplexity is true and tokenizer is present, ensuring you reference
the existing symbols compute_perplexity, language_model, full_model, and args so
the LM-specific forward is used for perplexity evaluation.
In `@modelopt/torch/quantization/metrics/activation_mse.py`:
- Line 577: The torch.load call that creates payload (payload = torch.load(path,
map_location="cpu", weights_only=False)) uses weights_only=False which permits
arbitrary code execution; change it to weights_only=True to restrict loading to
tensor/weight objects (i.e., payload = torch.load(path, map_location="cpu",
weights_only=True)), or if you must keep weights_only=False add a brief inline
comment next to this torch.load explaining why the file is trusted (e.g.,
"checkpoint is internally generated and validated") and the explicit reason it
is safe.
In `@modelopt/torch/quantization/metrics/perplexity.py`:
- Around line 19-22: Update the file-level attribution in the perplexity.py
module: replace the minimal "Ported from FP-Quant" docstring with a full
FP-Quant attribution block that includes a commit-pinned source URL (repo +
specific commit hash), the original FP-Quant copyright and license text (as in
their repository), and the required NVIDIA Apache-2.0 header used across this
project; keep the existing short description but prepend the full multi-line
header so the file contains the commit-pinned reference, original
copyright/license metadata, and the NVIDIA Apache-2.0 copyright/license notice.
- Around line 30-58: compute_perplexity currently proceeds on empty data which
leaves nll_running uninitialized as a tensor and later causes an AttributeError;
add an explicit guard at the top of compute_perplexity to detect an empty data
(e.g., if not data or len(data) == 0) and raise a ValueError with a clear
message that the evaluation dataset is empty (mentioning sequence_length/corpus
size if helpful); this ensures immediate, clear failure instead of the
downstream error involving nll_running/tokens_processed.
In `@modelopt/torch/quantization/mode.py`:
- Around line 509-518: The GPTQ mode currently defaults to non-sequential
because GPTQConfig still inherits use_sequential=False; update
GPTQModeDescriptor (and/or GPTQConfig) so GPTQ always runs sequentially: set the
mode's use_sequential to True (or override a property on GPTQModeDescriptor to
return True) or validate in GPTQModeDescriptor.__init__/config_class that
use_sequential is True and raise an error if not; ensure the handler for
_calib_func = gptq cannot be used when use_sequential is False so gptq is only
invoked under sequential_calibrate().
In `@modelopt/torch/quantization/triton/gptq_fused_kernel.py`:
- Around line 165-174: gptq_fused_block currently may mutate the caller's
w_block because calling .contiguous() doesn't copy an already-contiguous tensor;
ensure the wrapper truly clones inputs before the kernel writes to them by
replacing w_block = w_block.contiguous() (and similarly for amax_grouped and
h_inv_cho_blk if they must not be mutated) with explicit copies like w_block =
w_block.clone().contiguous() (and amax_grouped =
amax_grouped.clone().contiguous(), h_inv_cho_blk =
h_inv_cho_blk.clone().contiguous()) so the kernel (_gptq_fused_block_kernel)
operates on clones and only qw_block/err_block are written back, preserving the
non-mutating contract of gptq_fused_block.
---
Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Line 2154: Replace the inline conditional expression "torch.cuda.synchronize()
if torch.cuda.is_available() else None" with a clear conditional block using "if
torch.cuda.is_available(): torch.cuda.synchronize()" to remove the no-op else
and improve readability; update both occurrences of the inline form (the one
shown and the other occurrence later in the file) so all cuda synchronizations
use the explicit if-check around torch.cuda.synchronize().
- Around line 2106-2107: The code assigns module.name = name which may collide
with existing attributes; change to a unique prefixed attribute (e.g.
module._quant_name or module._qo_name) wherever the module name is attached,
then update all usages that read that attribute—specifically the references
inside blockwise_weight_update and the internal state key lookup that currently
expect module.name—so they read the new prefixed attribute instead.
In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 186-199: Replace the dynamic NVFP4 preset with the GPTQ/static
NVFP4 preset so the quantizer follows the NVFP4StaticQuantizer.global_amax
export path used by GPTQ; specifically change the quant_cfg assignment
(currently mtq.NVFP4_DEFAULT_CFG) to the static/GPTQ preset (e.g.,
mtq.NVFP4_GPTQ_STATIC_CFG or the project's named GPTQ static preset) before
calling mtq.quantize(model, quant_cfg, ...), leaving the subsequent weight
restore, update_hessian and blockwise_weight_update calls intact so the test
exercises the new _export_quantized_weight global_amax flow.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8c14b6e0-c7f4-4f64-a587-56ad77719a45
📒 Files selected for processing (10)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/__init__.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/metrics/__init__.pymodelopt/torch/quantization/metrics/activation_mse.pymodelopt/torch/quantization/metrics/perplexity.pymodelopt/torch/quantization/mode.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/triton/gptq_fused_kernel.pytests/gpu/torch/quantization/test_gptq.py
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/gpu/torch/quantization/test_gptq.py (1)
280-281:⚠️ Potential issue | 🟠 MajorUpdate the e2e test to the renamed GPTQ mode.
Line 281 still selects
gptq_lite, but this PR wires the calibration path togptq. As written, this test is targeting the removed mode name and will stop covering the new flow.Suggested fix
- quant_cfg["algorithm"] = "gptq_lite" + quant_cfg["algorithm"] = {"method": "gptq", "use_sequential": True}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/quantization/test_gptq.py` around lines 280 - 281, The test sets quant_cfg["algorithm"] to the removed mode "gptq_lite"; update the test to use the renamed mode by changing the algorithm value to "gptq" so the calibration path wired to GPTQ is exercised (locate the assignment to quant_cfg["algorithm"] in the test_gptq.py snippet and replace "gptq_lite" with "gptq").modelopt/torch/quantization/config.py (1)
1419-1431:⚠️ Potential issue | 🟠 MajorReject
Noneand non-positive GPTQ parameters at config-parse time.
GPTQConfigacceptsNoneforpercdampandblock_sizedue to the type annotationsfloat | Noneandint | None. However, the runtime gptq function requires concrete numeric values:percdampis multiplied by a tensor in_prepare_hessian_inverse(), andblock_sizeis used inrange()in_blockwise_update(). Whenmodel_dump()is called on the config, it includes None values in the kwargs dict (Pydantic does not exclude None by default), which override the gptq function's default parameters and cause runtime failures.Suggested fix
- percdamp: float | None = ModeloptField( + percdamp: float = ModeloptField( default=0.01, gt=0.0, le=1.0, title="Percentage damping factor.", description="The percentage of average Hessian diagonal used for damping.", ) - block_size: int | None = ModeloptField( + block_size: int = ModeloptField( default=128, + gt=0, title="Block size for GPTQ weight update.", description="""The block size for GPTQ weight update, which must be a multiple of the group_size used in the quantization.""", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/config.py` around lines 1419 - 1431, GPTQConfig currently allows percdamp: float | None and block_size: int | None which lets None leak into model_dump and override gptq defaults; change both annotations to non-optional (float and int), remove the `| None`, set ModeloptField defaults and validators to enforce gt=0 (e.g., percdamp ModeloptField(default=0.01, gt=0.0, le=1.0) and block_size ModeloptField(default=128, gt=0)), and add a small pydantic validator on GPTQConfig (or use ModeloptField) to raise on None or non-positive values so model_dump never emits None for percdamp or block_size and runtime functions like _prepare_hessian_inverse() and _blockwise_update() always receive concrete positive numbers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1555-1568: The patched hessian_forward hook can leave
weight_quantizer disabled and the module forward patched if an exception occurs;
wrap the code that disables the quantizer, calls _forward_no_gptq_hessian, and
re-enables the quantizer in a try/finally so weight_quantizer.enable() always
runs, and ensure the hook/installed forward is restored or uninstalled in a
finally block as well (i.e., capture the original forward, install the patched
hessian_forward, call forward inside try, and in finally reassign the original
forward and call weight_quantizer.enable()). Apply the same try/finally pattern
to the analogous hook at the other location referenced (the block around lines
1824-1832) so both hessian_forward and the other patched forward used by
forward_loop always restore state on exceptions.
- Around line 1677-1687: The code must skip processing when batch_size is zero
to avoid the divide-by-zero in math.sqrt(2 / n_samples); in the block around
input_flat, batch_size, hessian, n_samples and scaled_input add a guard that if
batch_size == 0 you do not modify hessian or n_samples (i.e., continue/return
from this iteration) and only perform the incremental averaging and
outer-product update when batch_size > 0 so math.sqrt and the division are never
invoked with zero effective samples.
- Around line 1597-1614: Replace detecting dead columns from weight values with
detection from the Hessian diagonal inside _prepare_hessian_inverse: compute
dead_cols as torch.nonzero(torch.diag(h).eq(0)).unsqueeze(-1) (or equivalent)
using torch.diag(h), then zero out h rows/cols and set h[dead_cols, dead_cols]=1
and also zero the corresponding columns in self.weight so those weight columns
are treated as dead; keep the damping addition using percdamp as before and
ensure all references use the h-based dead column mask rather than
self.weight.eq(0).
---
Outside diff comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 1419-1431: GPTQConfig currently allows percdamp: float | None and
block_size: int | None which lets None leak into model_dump and override gptq
defaults; change both annotations to non-optional (float and int), remove the `|
None`, set ModeloptField defaults and validators to enforce gt=0 (e.g., percdamp
ModeloptField(default=0.01, gt=0.0, le=1.0) and block_size
ModeloptField(default=128, gt=0)), and add a small pydantic validator on
GPTQConfig (or use ModeloptField) to raise on None or non-positive values so
model_dump never emits None for percdamp or block_size and runtime functions
like _prepare_hessian_inverse() and _blockwise_update() always receive concrete
positive numbers.
In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 280-281: The test sets quant_cfg["algorithm"] to the removed mode
"gptq_lite"; update the test to use the renamed mode by changing the algorithm
value to "gptq" so the calibration path wired to GPTQ is exercised (locate the
assignment to quant_cfg["algorithm"] in the test_gptq.py snippet and replace
"gptq_lite" with "gptq").
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f42eadc8-d5b1-48e6-a3a1-4f8e887e5c46
📒 Files selected for processing (4)
modelopt/torch/quantization/config.pymodelopt/torch/quantization/mode.pymodelopt/torch/quantization/model_calib.pytests/gpu/torch/quantization/test_gptq.py
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/config.py (1)
1369-1382:⚠️ Potential issue | 🟠 MajorValidate GPTQ
block_sizeat config parse time.
block_sizecurrently accepts0/negative values and only fails later in GPTQ update loops. Add a positive bound here so invalid configs fail fast.Suggested fix
class GPTQConfig(QuantizeAlgorithmConfig): @@ block_size: int | None = ModeloptField( default=128, + gt=0, title="Block size for GPTQ weight update.", description="""The block size for GPTQ weight update, which must be a multiple of the group_size used in the quantization.""", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/config.py` around lines 1369 - 1382, The block_size ModeloptField in modelopt/torch/quantization/config.py currently allows zero/negative values and should be validated at parse time; update the block_size field (the ModeloptField definition named block_size) to require a positive integer (e.g., add gt=0 or ge=1) so invalid configs fail fast during config parsing rather than later in GPTQ update loops.
♻️ Duplicate comments (2)
modelopt/torch/quantization/model_calib.py (2)
1679-1689:⚠️ Potential issue | 🟠 MajorGuard zero-token batches before Hessian scaling.
batch_size == 0still reachesmath.sqrt(2 / n_samples)and can divide by zero on empty expert/token batches.Suggested fix
def update_hessian(input, hessian, n_samples): @@ input_flat = input.reshape(-1, input.shape[-1]).t().float() batch_size = input_flat.shape[1] + if batch_size == 0: + return hessian, n_samples @@ hessian *= n_samples / (n_samples + batch_size) n_samples += batch_size🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 1679 - 1689, Guard against empty token batches in the Hessian update: check batch_size (computed from input_flat.shape[1]) immediately after flattening and skip the scaling/averaging/Hessian update if batch_size == 0 to avoid computing math.sqrt(2 / n_samples) or dividing by zero; ensure you do not update n_samples or call hessian.add_() for empty batches (i.e., return/continue before the "hessian *= n_samples / (n_samples + batch_size)" and subsequent lines in the Hessian update block that reference n_samples, scaled_input, and hessian).
1555-1568:⚠️ Potential issue | 🟠 MajorMake GPTQ patch lifecycle exception-safe.
If a patched forward or
forward_loop(model)raises, patched forwards and quantizer state may leak into later execution.Suggested fix
def hessian_forward(self, input, *args, **kwargs): @@ self.weight_quantizer.disable() - out = self._forward_no_gptq_hessian(input, *args, **kwargs) - self.weight_quantizer.enable() - return out + try: + return self._forward_no_gptq_hessian(input, *args, **kwargs) + finally: + self.weight_quantizer.enable() @@ - gptq_handles = {name: GPTQHandle(m, name, offload_to_cpu=True) for name, m in quantized_layers} - for handle in gptq_handles.values(): - handle.setup() - - print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...") - forward_loop(model) - - for handle in gptq_handles.values(): - handle.cleanup() + gptq_handles = {name: GPTQHandle(m, name, offload_to_cpu=True) for name, m in quantized_layers} + installed_handles = [] + try: + for handle in gptq_handles.values(): + handle.setup() + installed_handles.append(handle) + + print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...") + forward_loop(model) + finally: + for handle in installed_handles: + handle.cleanup()Also applies to: 1817-1825
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 1555 - 1568, The patched hessian_forward must be made exception-safe: wrap the section that updates the hessian and disables the weight quantizer + calls _forward_no_gptq_hessian in a try/finally so weight_quantizer.enable() always runs even if _forward_no_gptq_hessian raises, and in the same finally clear/reset the temporary GPTQ patch state on gptq_handle (e.g., reset gptq_handle.hessian and gptq_handle.n_samples or remove the handle reference) to avoid leaking the patch into later execution; apply the identical try/finally fix to the other occurrence around the same logic (the block at lines 1817-1825).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 1369-1382: The block_size ModeloptField in
modelopt/torch/quantization/config.py currently allows zero/negative values and
should be validated at parse time; update the block_size field (the
ModeloptField definition named block_size) to require a positive integer (e.g.,
add gt=0 or ge=1) so invalid configs fail fast during config parsing rather than
later in GPTQ update loops.
---
Duplicate comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1679-1689: Guard against empty token batches in the Hessian
update: check batch_size (computed from input_flat.shape[1]) immediately after
flattening and skip the scaling/averaging/Hessian update if batch_size == 0 to
avoid computing math.sqrt(2 / n_samples) or dividing by zero; ensure you do not
update n_samples or call hessian.add_() for empty batches (i.e., return/continue
before the "hessian *= n_samples / (n_samples + batch_size)" and subsequent
lines in the Hessian update block that reference n_samples, scaled_input, and
hessian).
- Around line 1555-1568: The patched hessian_forward must be made
exception-safe: wrap the section that updates the hessian and disables the
weight quantizer + calls _forward_no_gptq_hessian in a try/finally so
weight_quantizer.enable() always runs even if _forward_no_gptq_hessian raises,
and in the same finally clear/reset the temporary GPTQ patch state on
gptq_handle (e.g., reset gptq_handle.hessian and gptq_handle.n_samples or remove
the handle reference) to avoid leaking the patch into later execution; apply the
identical try/finally fix to the other occurrence around the same logic (the
block at lines 1817-1825).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: af4360f3-56a7-42c0-88f1-e0bee76f10d3
📒 Files selected for processing (4)
modelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/model_quant.pytests/gpu/torch/quantization/test_gptq.py
✅ Files skipped from review due to trivial changes (1)
- modelopt/torch/quantization/model_quant.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/gpu/torch/quantization/test_gptq.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/quantization/model_calib.py (1)
1762-1795: Performance consideration: Full-weight quantization per column.The blockwise update performs full-matrix quantization (
quantizer(wblk)) for each column within the inner loop (line 1787). This is O(num_cols × quantizer_cost) per block, which can be expensive for large weight matrices.This appears intentional per the docstring ("quantizes the full weight matrix via the quantizer and extracts the quantized column") and matches standard GPTQ implementations where each column's quantization depends on the current weight state after error compensation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 1762 - 1795, The inner loop in _blockwise_update repeatedly calls quantizer(wblk) per column which is expensive; move the quantization call out of the per-column loop so you compute qdq = quantizer(wblk) once per block (before the for i in range(n_cols_blk) loop) and then reuse qdq[:, block_start + i] when setting self.weight and computing err; keep the existing wblk updates (addr_) and final self.weight.addmm_ call but ensure qdq is recomputed only when you intentionally need a fresh quantization (e.g., if you change the algorithm to require re-quantization mid-block), referencing symbols _blockwise_update, quantizer, wblk, qdq, errs, addr_, and self.weight to locate the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1762-1795: The inner loop in _blockwise_update repeatedly calls
quantizer(wblk) per column which is expensive; move the quantization call out of
the per-column loop so you compute qdq = quantizer(wblk) once per block (before
the for i in range(n_cols_blk) loop) and then reuse qdq[:, block_start + i] when
setting self.weight and computing err; keep the existing wblk updates (addr_)
and final self.weight.addmm_ call but ensure qdq is recomputed only when you
intentionally need a fresh quantization (e.g., if you change the algorithm to
require re-quantization mid-block), referencing symbols _blockwise_update,
quantizer, wblk, qdq, errs, addr_, and self.weight to locate the changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d366fe9b-14ea-48e0-b07e-a37f381e70e1
📒 Files selected for processing (3)
modelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/model_quant.pytests/gpu/torch/quantization/test_gptq.py
💤 Files with no reviewable changes (1)
- modelopt/torch/quantization/model_quant.py
cjluo-nv
left a comment
There was a problem hiding this comment.
Review — Efficiency & Correctness
Critical Efficiency Issue
Full-matrix quantization per column in _blockwise_update (model_calib.py):
for i in range(n_cols_blk): # iterates every column in block
qdq = quantizer(wblk) # quantizes the ENTIRE weight matrix
self.weight[:, block_start + i] = qdq[:, block_start + i] # uses 1 columnThis quantizes the full weight matrix once per column to extract a single column. For a weight with N columns, that's N full-matrix quantizations — O(N²) vs. the O(N) of the reference GPTQ implementation. I understand this may be needed because block quantizers (NVFP4) can't easily quantize individual columns, but this is likely the dominant bottleneck. Consider caching or at minimum documenting this cost trade-off.
Full weight clone per block (same area):
wblk = self.weight.clone() # full matrix clone on every block iterationAllocating O(rows × cols) per block. Only the slice [:, block_start:] is actually needed.
Correctness Issues
-
hessian_forward— no try/finally around quantizer state toggle:self.weight_quantizer.disable() out = self._forward_no_gptq_hessian(input, *args, **kwargs) self.weight_quantizer.enable()
If
_forward_no_gptq_hessianraises, the weight quantizer is left permanently disabled. Wrap intry/finally. -
update_hessiansemantic change — the old code countedn_samplesby batch dimension (input.shape[0]); the new code counts by total tokens (batch × seq_len). The new behavior is more correct, but this silently changes Hessian normalization for any other callers ofupdate_hessian(it's a public function). The tests are updated, but callers outside this PR may not be. -
Double forward pass —
gptq()runsmax_calibrate(model, forward_loop)(one full pass), thenforward_loop(model)again for Hessians. During the Hessian pass, weight quantizers are disabled so downstream layers see FP weights. The two passes replay the same data — could they be combined into one? -
_blockwise_updatequantized/FP mixing at block boundaries —wblkis cloned fromself.weightat each block start, so it contains already-quantized values for prior blocks. When the block quantizer's group size spans a GPTQ block boundary, scale factors are computed from a mix of quantized and FP values. This is a known trade-off with block quantizers, but the docstring claims "standard GPTQ approach" — worth noting this is a GPTQ variant.
Memory Concern
All GPTQHelper instances (and their Hessians) are held simultaneously:
gptq_handles = {name: GPTQHelper(m, name, offload_to_cpu=True) for name, m in quantized_layers}The old code deleted each Hessian after processing. For large models (e.g., 70B), holding all Hessians at once could OOM. Consider freeing incrementally:
for handle in gptq_handles.values():
handle.update_weights(block_size, percdamp)
handle.hessian = None # free after useMinor Issues
-
Backward compatibility:
"gptq_lite"method name is deleted with no migration/alias. Existing configs referencing it will break silently. -
Test verbosity (
test_gptq_export_roundtrip): ~15 lines ofprint()for diff stats — consider usingloggingor just including key stats in the assertion message. -
_promote_nvfp4_static_quantizersuses magic tuples(2, 1)and(4, 3)for detection. These should be named constants or reference the quantizer class to avoid silent breakage if encoding changes. -
Commit history: Contains "tested, revert later", "remove later", "debug logs" commits — should be squashed before merge.
|
|
||
| method: Literal["gptq_lite"] = ModeloptField("gptq_lite") | ||
| method: Literal["gptq"] = ModeloptField("gptq") | ||
| percdamp: float | None = ModeloptField( |
There was a problem hiding this comment.
What happens when perdamp is set as None?
There was a problem hiding this comment.
Usually, there is no use case for percdamp None. Thanks for the catch!
| for i in range(n_cols_blk): | ||
| w_ci = wblk[:, block_start + i] | ||
| d = h_inv_cho_blk[i, i] | ||
| qdq = quantizer(wblk) |
There was a problem hiding this comment.
Will this qdq become the bottleneck since it is performing #col times for the whole tensor? Would it work if we only do qdq for the block?
There was a problem hiding this comment.
Yes, this is a bottleneck. Right now if we go through the TensorQuantizer it doesn't allow for QDQ'ing a slice of the weight tensor.
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
8afc672 to
7d2a960
Compare
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
7d2a960 to
19fc0c2
Compare
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
|
cjluo-nv
left a comment
There was a problem hiding this comment.
Approved with requested fixes
The overall refactor from gptq_lite to a proper GPTQ implementation with GPTQHelper looks good. Clean encapsulation, correct Hessian token-count fix, and good test coverage additions.
Please address the following before merging:
1. wblk = self.weight.clone() inside the block loop (_blockwise_update)
A full weight tensor clone is created at every block iteration. For a 4096×4096 layer with block_size=128, that's 32 full-matrix clones per layer. Consider cloning once before the block loop and tracking which columns are finalized vs. working, or document why the per-block clone is necessary if there's a correctness reason I'm missing.
2. hessian_forward — add try/finally around the original forward call
In setup(), the patched hessian_forward calls self._forward_no_gptq_hessian(input, ...) without a try/finally guard. If that call raises, the Hessian has already been updated with potentially partial/bad data but the exception will propagate without any cleanup of gptq_helper state. A lightweight try/finally (or at minimum a comment explaining why it's safe to skip) would make the code more robust.
Everything else — disabled_weight_quantizers context manager, _promote_nvfp4_static_quantizers, block/group size divisibility check, test updates — LGTM.
| class GPTQLiteConfig(QuantizeAlgorithmConfig): | ||
| """The config for GPTQ lite. | ||
| class GPTQConfig(QuantizeAlgorithmConfig): | ||
| """The config for GPTQ quantization. |
There was a problem hiding this comment.
Should use_sequential be part of the GPTQConfig?
| finally: | ||
| input_getter._unpatch_all_layers() | ||
| with disabled_weight_quantizers(model): | ||
| forward_loop(model) |
There was a problem hiding this comment.
I'm thinking about why we disable all weight quantizers during Hessian collection. In non-sequential mode, we applied input quantizers but not weight quantizers, the activations used to build the Hessian will not be the actual inference activations. Does this make sense to you? Any reason we want to keep non-sequential mode?
What does this PR do?
Implements official version of GPTQ with decoder level sequential calibration flow.
Ref: https://github.com/IST-DASLab/FP-Quant/tree/master
Type of change: New feature
Overview:
Usage
Testing
Measure perplexity and activation MSE on the following models
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Improvements
Tests