Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #555 +/- ##
==========================================
- Coverage 74.02% 73.58% -0.45%
==========================================
Files 192 192
Lines 19664 19812 +148
==========================================
+ Hits 14557 14578 +21
- Misses 5107 5234 +127 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| gt=0.0, | ||
| le=1.0, | ||
| title="Percentage damping factor.", | ||
| description="The percentage of average Hessian diagonal used for damping.", |
There was a problem hiding this comment.
if you have a reference from the original paper about what these are, could you also share the link too?
| batch_size = input.shape[0] | ||
|
|
||
| # Incremental averaging: scale down old hessian | ||
| hessian *= n_samples / (n_samples + batch_size) |
There was a problem hiding this comment.
what's the dtype of hessian? Do we need to up cast to fp32 for this division?
There was a problem hiding this comment.
hessian is defaulted to fp32 during initialization. For the division part the result is float.
| hessian, n_samples = update_hessian(input[0], state["hessian"], state["n_samples"]) | ||
| hessian_state[module.name] = {"hessian": hessian, "n_samples": n_samples} | ||
| torch.cuda.empty_cache() | ||
| gc.collect() |
There was a problem hiding this comment.
do we have to do gc.collect() here? It's going to be very slow
|
|
||
| # Phase 1: Collect statistics for quantizers | ||
| enable_stats_collection(model) | ||
| max_calibrate(model, forward_loop) |
There was a problem hiding this comment.
do you need forward_loop here? Is this for weight amax calib only?
| state = hessian_state[module.name] | ||
| hessian = state["hessian"].to(module.weight.device) | ||
| blockwise_weight_update(module, hessian, block_size, percdamp) | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
maybe you can del the hessian after applying blockwise_weight_update?
| hessian_state_path: str | None = ModeloptField( | ||
| default=None, | ||
| title="Path to the Hessian state file.", | ||
| description="The path to the Hessian state file.", |
There was a problem hiding this comment.
Maybe state: if the path exists, we load the hessian from the path instead of re-computing them.
| GPTQ lite does not perform sequential quantization of layers. This means that the updated | ||
| activations are not used to process the next layer. |
There was a problem hiding this comment.
Can you estimate how much effort is needed if we need to add this constraint? I am thinking if we can have a quick test to see what's the accuracy impact.
There was a problem hiding this comment.
This will be addressed in a followup PR
| block_size: int | None = ModeloptField( | ||
| default=128, | ||
| title="Block size for GPTQ weight update.", | ||
| description="The block size for GPTQ weight update.", | ||
| ) |
There was a problem hiding this comment.
This should be the multiple of block_size used in quantization. We should explain it in the description as well.
| gt=0.0, | ||
| le=1.0, | ||
| title="Percentage damping factor.", | ||
| description="The percentage of average Hessian diagonal used for damping.", |
There was a problem hiding this comment.
Could you also add some instructions here, so users can know what's the impact of increasing/decreasing this parameter?
| tensor_mapping = {} | ||
| for name, module in model.named_modules(): | ||
| if is_quantized_linear(module) and module.weight_quantizer.is_enabled: | ||
| in_features = module.weight.shape[1] |
There was a problem hiding this comment.
Can we use module.weight.shape[-1] instead incase of 3D weight?
| for name, module in model.named_modules(): | ||
| if is_quantized_linear(module) and module.weight_quantizer.is_enabled: | ||
| module.input_quantizer.reset_amax() | ||
| module.output_quantizer.reset_amax() |
There was a problem hiding this comment.
Do you know how much accuracy is impacted if we don't recalibrate the input quantizer??
c6e1f07 to
6c96513
Compare
d5b91c1 to
24104f7
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughThis pull request introduces the GPTQ-Lite quantization algorithm, comprising a new configuration class, calibration mode descriptor, Hessian-based weight update utilities, GPU memory monitoring, and comprehensive tests. The implementation enables post-training quantization using blockwise weight updates with damped Hessian information. Changes
Sequence DiagramsequenceDiagram
participant Model as Model
participant ForwardLoop as ForwardLoop
participant HessianCollector as Hessian<br/>Collector
participant HessianPrep as Hessian Inverse<br/>Prep
participant BlockQuantizer as Block<br/>Quantizer
participant UpdatedModel as Updated<br/>Model
ForwardLoop->>HessianCollector: Input samples
HessianCollector->>HessianCollector: update_hessian(input)
HessianCollector->>HessianPrep: Accumulated Hessian
HessianPrep->>HessianPrep: prepare_hessian_inverse<br/>(damping, fallback)
HessianPrep->>BlockQuantizer: Damped H_inv
Model->>BlockQuantizer: Quantized weights
BlockQuantizer->>BlockQuantizer: quantize_block<br/>(error propagation)
BlockQuantizer->>UpdatedModel: Updated weights
UpdatedModel-->>Model: In-place update
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@modelopt/torch/quantization/config.py`:
- Around line 1235-1240: Fix the typo in the description for the ModeloptField
named hessian_state_path in config.py: change "insteaed" to "instead" inside the
multi-line description string so the text reads "If hessian path exists, we load
from hessian file instead of recomputing them." This touches the
hessian_state_path field declaration using ModeloptField.
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1302-1303: The call to print_relative_mse_error uses module.name
which may not exist and can raise AttributeError when blockwise_weight_update is
invoked directly; update blockwise_weight_update to derive a safe name (e.g.,
use getattr(module, "name", None) or fallback to module.__class__.__name__) and
pass that safe name into print_relative_mse_error (or only call
print_relative_mse_error when a name exists) so the function no longer assumes
module.name is always set; reference blockwise_weight_update,
print_relative_mse_error, and gptq_lite when making this change.
- Around line 1245-1265: The loop repeatedly calls quantizer(full_weight)
causing N full-tensor quantizations; move the quantization out of the inner
group loop or quantize only the required slice: either compute quantized_full =
quantizer(full_weight) once before iterating group_start (if quantization does
not depend on error propagation), or replace quantizer(full_weight) inside the
loop with quantizer(full_weight[:, block_start:block_end]) /
quantizer(block_weight) to only quantize the columns used for quantized_cols;
update uses of quantized_full/quantized_cols accordingly and ensure
block_weight/full_weight updates still reflect the chosen quantization strategy.
- Around line 1196-1203: The indexing into Hessian h using zero_cols is broken
because torch.nonzero returns a 2D tensor for a 1D mask; change how zero_cols is
computed so it is a 1D index tensor usable for advanced indexing (e.g., use
torch.nonzero(mask, as_tuple=True)[0] or torch.where(mask)[0] or .view(-1) on
the result). Update the line that builds zero_cols (which currently uses
torch.nonzero(weight.eq(0).all(dim=0))) so subsequent operations h[zero_cols,
:], h[:, zero_cols], and h[zero_cols, zero_cols] receive a 1D index tensor.
In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 118-121: The test_gptq_updates setup creates a Linear with shape
(1, dim) but assigns a model_weight of shape (16,16); fix by constructing the
torch.nn.Linear with out_features and in_features that match model_weight (use
model_weight.shape[0] for out_features and model_weight.shape[1] or dim for
in_features) so the model.weight.data assignment is shape-compatible; update the
instantiation of the model in test_gptq_updates and keep the subsequent
.to("cuda") and original_weight clone logic unchanged.
- Around line 176-178: The test is mutating the shared configuration dict
(quant_cfg) by setting quant_cfg["algorithm"] = "gptq_lite", which can alter
imported constants like mtq.NVFP4_DEFAULT_CFG; instead, make a local copy (e.g.,
via copy.deepcopy(quant_cfg) or dict(quant_cfg)) and modify that copy before use
(refer to the quant_cfg variable near model.eval()), so the original shared
constant is never changed.
🧹 Nitpick comments (5)
modelopt/torch/quantization/model_calib.py (4)
1158-1180: Potential numerical issue with Hessian update whenn_samples=0.When
n_samples=0initially, the scaling factorn_samples / (n_samples + batch_size)equals0, which is correct. However, the comment on line 1174 says "H += (2/n_samples) * X @ X^T" but the actual formula usessqrt(2 / n_samples)after incrementingn_samples. This is correct for the running average formula, but the comment is misleading.Also, when starting fresh, if the initial
n_samplesis 0 andbatch_sizeis small, the scalingsqrt(2 / n_samples)after increment could be large. This is mathematically correct for the incremental averaging scheme.📝 Suggested documentation fix
- # Compute outer product: H += (2/n_samples) * X @ X^T - # where X is the flattened input reshaped to (features, batch*seq) + # Compute outer product contribution using running average: + # H += (sqrt(2/n_samples) * X) @ (sqrt(2/n_samples) * X)^T + # where X is the flattened input reshaped to (features, batch*seq)
1336-1343: Hardcoded memory threshold (0.65) should be configurable or documented.The threshold
0.65for deciding CPU vs GPU storage is a magic number that may not be optimal for all use cases. Consider making this configurable or at least documenting the rationale.📝 Suggested improvement
+# Memory threshold for offloading Hessians to CPU (65% GPU usage) +_HESSIAN_GPU_MEM_THRESHOLD = 0.65 + def initialize_hessian_state(tensor_mapping): """Initialize hessian state with zeros.""" for name, (shape, device) in tensor_mapping.items(): # Use CPU if GPU memory is tight - target_device = "cpu" if get_gpu_mem_fraction(device) > 0.65 else device + target_device = "cpu" if get_gpu_mem_fraction(device) > _HESSIAN_GPU_MEM_THRESHOLD else device
1424-1430: Duplicate exception handling forsave_hessian_state.The
save_hessian_statefunction already has its own try-except block (lines 1366-1378) that handles exceptions and prints messages. Wrapping the call in another try-except (lines 1426-1430) results in duplicate error handling and potentially confusing double-printed messages.♻️ Proposed fix
# Save if configured if save_hessians: - try: - save_hessian_state(hessian_state_path) - except Exception as e: - print_rank_0(f"Error saving hessian state: {e}") - print_rank_0("Continuing execution...") + save_hessian_state(hessian_state_path)
1446-1448: Callingtorch.cuda.empty_cache()in a loop can degrade performance.
torch.cuda.empty_cache()is expensive and calling it on every iteration can significantly slow down quantization. Consider moving it outside the loop or calling it periodically.♻️ Proposed optimization
# Perform blockwise weight updates for name, module in tqdm(quantized_modules, desc="Quantizing layers"): state = hessian_state[module.name] hessian = state["hessian"].to(module.weight.device) blockwise_weight_update(module, hessian, block_size, percdamp) # Delete hessian state to free memory del hessian_state[module.name] - torch.cuda.empty_cache() + + # Clear CUDA cache once after all layers are processed + torch.cuda.empty_cache()Alternatively, if memory is critical, call it every N iterations:
if (idx + 1) % 10 == 0: torch.cuda.empty_cache()tests/gpu/torch/quantization/test_gptq.py (1)
191-198: Consider using logging instead of print statements in tests.Print statements in tests can clutter test output. Consider using
pytest's capsys fixture or proper logging if output verification is needed, or remove them if they're only for debugging.
| for group_start in range(0, block_size, group_size): | ||
| group_end = min(group_start + group_size, block_size) | ||
| group_cols = slice(group_start, group_end) | ||
| # Get current column and its Hessian inverse diagonal | ||
| weight_col = block_weight[:, group_cols] | ||
| hinv_diag = torch.diag(block_hinv[group_cols, group_cols]) | ||
|
|
||
| # Quantize using the full weight, then extract the columns we need | ||
| quantized_full = quantizer(full_weight) | ||
| quantized_cols = quantized_full[:, block_start + group_start : block_start + group_end] | ||
| quantized_block[:, group_cols] = quantized_cols | ||
|
|
||
| # Compute quantization error and loss | ||
| error = (weight_col - quantized_cols) / hinv_diag | ||
| losses[:, group_cols] = (weight_col - quantized_cols) ** 2 / (hinv_diag**2) / 2 | ||
| errors[:, group_cols] = error | ||
|
|
||
| # Propagate error to remaining columns in block | ||
| block_weight[:, group_start:] -= error @ block_hinv[group_start:group_end, group_start:] | ||
| full_weight[:, block_start:block_end] = block_weight | ||
|
|
There was a problem hiding this comment.
Performance concern: Repeated full-weight quantization inside loop.
The quantizer(full_weight) call at line 1253 is inside the loop and quantizes the entire weight tensor for every column. For a weight matrix with N columns, this results in N full quantizations when only one column's quantized values are used per iteration. This is computationally expensive.
💡 Suggested optimization approach
Consider either:
- Quantizing the full weight once before the loop if the quantizer doesn't depend on the error-propagated weights
- Or if error propagation affects quantization, quantize only the needed columns per iteration
+ # Quantize full weight once if error propagation doesn't affect quantization results
+ # (Note: This may need adjustment based on quantizer behavior)
for group_start in range(0, block_size, group_size):
group_end = min(group_start + group_size, block_size)
group_cols = slice(group_start, group_end)
# Get current column and its Hessian inverse diagonal
weight_col = block_weight[:, group_cols]
hinv_diag = torch.diag(block_hinv[group_cols, group_cols])
- # Quantize using the full weight, then extract the columns we need
- quantized_full = quantizer(full_weight)
- quantized_cols = quantized_full[:, block_start + group_start : block_start + group_end]
+ # Quantize only the current columns (if quantizer supports it)
+ # or cache full quantization outside loop🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/model_calib.py` around lines 1245 - 1265, The
loop repeatedly calls quantizer(full_weight) causing N full-tensor
quantizations; move the quantization out of the inner group loop or quantize
only the required slice: either compute quantized_full = quantizer(full_weight)
once before iterating group_start (if quantization does not depend on error
propagation), or replace quantizer(full_weight) inside the loop with
quantizer(full_weight[:, block_start:block_end]) / quantizer(block_weight) to
only quantize the columns used for quantized_cols; update uses of
quantized_full/quantized_cols accordingly and ensure block_weight/full_weight
updates still reflect the chosen quantization strategy.
| hinv_diag = torch.diag(block_hinv[group_cols, group_cols]) | ||
|
|
||
| # Quantize using the full weight, then extract the columns we need | ||
| quantized_full = quantizer(full_weight) |
There was a problem hiding this comment.
so, for nvpf4, are global scales and local scales dynamically computed, and they could change over the iterations?
There was a problem hiding this comment.
yeah they could, but recently support was added for static scales so we don't have to compute it dynamically if we use that config.
There was a problem hiding this comment.
How does a user know to use static or dynamic scales with NVFP4? Is this selection automated somewhere?
There was a problem hiding this comment.
I think we can add it as a note somewhere for the user to ensure they use static scales instead of dynamic scales for GPTQ. This is usually specified in the config like this
NVFP4_WEIGHT_ACT_MSE_CFG = {
"quant_cfg": {
"*weight_quantizer": {
"num_bits": (2, 1),
"block_sizes": {-1: 16, "type": "static", "scale_bits": (4, 3)},
"axis": None,
"enable": False,
},
"*input_quantizer": {
"enable": False,
},
**_default_disabled_quantizer_cfg,
},
"algorithm": "gptq_lite
}
| h = torch.cholesky_inverse(torch.linalg.cholesky(h)) | ||
| h_inv = torch.linalg.cholesky(h, upper=True) | ||
| except (RuntimeError, torch.linalg.LinAlgError): | ||
| print("Warning: Hessian is not positive definite, using identity matrix") |
There was a problem hiding this comment.
have we seen this before? Should we throw an exception?
There was a problem hiding this comment.
Yeah, I have seen this issue pop up a couple of times while running tests. The official implementation does a similar handling of the exception too.
There was a problem hiding this comment.
@sugunav14 I remember GPTQ paper adding a small diagonal term to make the hessian invertible.
There was a problem hiding this comment.
@realAsma Do you mean the percdamp factor?
There was a problem hiding this comment.
I would just keep the implementation very close to the original!
There was a problem hiding this comment.
Right now I haven't really made any changes to the original implementation except for a log for the user. The original implementation also has this fallback option
modelopt/torch/utils/perf.py
Outdated
| } | ||
|
|
||
|
|
||
| def get_gpu_mem_fraction(device="cuda:0"): |
There was a problem hiding this comment.
nit: get_used_gpu_mem_fraction?
bf52a9d to
5872ddf
Compare
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
24ca4b5 to
0627fb3
Compare
## What does this PR do? **Type of change:** New feature <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** Adds support for GPTQ algorithm. This PR implements a modified version of the official GPTQ algorithm; the key difference is that updated activations from each layer are not used for hessian computation ## Usage <!-- You can potentially add a usage example below. --> Modify "algorithm" field in quant_cfg to "gptq_lite". Note: Does not currently work with AWQ ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> - [x] Added unit tests to test helper functions + e2e flow - [x] Perplexity and GPQA results | Model | Qformat | Perplexity wikitext2 | GPQA | |-------------|------------------------|------------------------|------| | Qwen3-8B | INT4 weight only (modelopt + amax/7) no GPTQ | 10.75 | n/a | | Qwen3-8B | INT4 weight only (modelopt + amax/7) | **10.56** | 0.388 | | Qwen3-8B | INT4 weight only + FP-Quant hessians + amax/7.5 | 10.25 | **0.449** | | Qwen3-8B | INT4 weight only (FP-Quant) | 10.24 | 0.46 | | Qwen3-8B | NVFP4 static weight only | 10.25 | n/a | | Qwen3-8B | NVFP4 static weight only no GPTQ | 10.25 | n/a | | Qwen3-0.6B | NVFP4 static weight only | **22.75** | n/a | | Qwen3-0.6B | NVFP4 dynamic weight only | 23.50 | n/a | | Qwen3-0.6B | NVFP4 static weight only with FP-Quant hessians | 22.0 | n/a | | Qwen3-0.6B | NVFP4 static weight only no GPTQ | 24.25 | n/a | Conclusions from results - Perplexity remains the same or shows improvement with Modelopt implementation. The magnitude of improvement is lesser in modelopt when compared to FP-Quant - GPQA shows no improvement with modelopt, but shows improvement with FP-Quant ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * GPTQ Lite quantization mode now available for efficient model calibration * GPU memory usage monitoring utility added * Quantization configuration extended to support complex nested structures and lists * **Tests** * Comprehensive test coverage added for GPTQ quantization workflows <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
## What does this PR do? **Type of change:** New feature <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** Adds support for GPTQ algorithm. This PR implements a modified version of the official GPTQ algorithm; the key difference is that updated activations from each layer are not used for hessian computation ## Usage <!-- You can potentially add a usage example below. --> Modify "algorithm" field in quant_cfg to "gptq_lite". Note: Does not currently work with AWQ ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> - [x] Added unit tests to test helper functions + e2e flow - [x] Perplexity and GPQA results | Model | Qformat | Perplexity wikitext2 | GPQA | |-------------|------------------------|------------------------|------| | Qwen3-8B | INT4 weight only (modelopt + amax/7) no GPTQ | 10.75 | n/a | | Qwen3-8B | INT4 weight only (modelopt + amax/7) | **10.56** | 0.388 | | Qwen3-8B | INT4 weight only + FP-Quant hessians + amax/7.5 | 10.25 | **0.449** | | Qwen3-8B | INT4 weight only (FP-Quant) | 10.24 | 0.46 | | Qwen3-8B | NVFP4 static weight only | 10.25 | n/a | | Qwen3-8B | NVFP4 static weight only no GPTQ | 10.25 | n/a | | Qwen3-0.6B | NVFP4 static weight only | **22.75** | n/a | | Qwen3-0.6B | NVFP4 dynamic weight only | 23.50 | n/a | | Qwen3-0.6B | NVFP4 static weight only with FP-Quant hessians | 22.0 | n/a | | Qwen3-0.6B | NVFP4 static weight only no GPTQ | 24.25 | n/a | Conclusions from results - Perplexity remains the same or shows improvement with Modelopt implementation. The magnitude of improvement is lesser in modelopt when compared to FP-Quant - GPQA shows no improvement with modelopt, but shows improvement with FP-Quant ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * GPTQ Lite quantization mode now available for efficient model calibration * GPU memory usage monitoring utility added * Quantization configuration extended to support complex nested structures and lists * **Tests** * Comprehensive test coverage added for GPTQ quantization workflows <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
What does this PR do?
Type of change: New feature
Overview: Adds support for GPTQ algorithm. This PR implements a modified version of the official GPTQ algorithm; the key difference is that updated activations from each layer are not used for hessian computation
Usage
Modify "algorithm" field in quant_cfg to "gptq_lite".
Note: Does not currently work with AWQ
# Add a code snippet demonstrating how to use thisTesting
Conclusions from results
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.