feat: new quantizer for vllm by llcnt · Pull Request #239 · PrunaAI/pruna

llcnt · 2025-07-08T16:29:26Z

Description

This PR is the little sister of this PR in pruna_pro.
This PR in pruna add some flexibility around hqq quantization and saving:

a new parameter patch_for_inference is added to allow the user to quantize in 4bits without patching the layers (because can make them un-loadable);
a new parameter default_to_hf allows the user to select a specific implementation of hqq quantization (previously only a try:...except was done, but no freedom was given to the user). this is important in particular because vllm expects a models that has been saved with the 'transformers' structure, not the AutoHQQModel one.

Related Issue

Fixes #(issue number)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Unit test pass locally.

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

github-actions · 2025-07-19T00:07:57Z

This PR has been inactive for 10 days and is now marked as stale.

sharpenb

This is functional codeso it is approved. Some comments that would be important to consider before merging:

Simplify/Clairfy the interface in this PR
Test in this PR the compatibility fo the new introduced configs with the compatible algorithms in HQQ.
Add step in the roadmap which corresponds to refactorization of the quantization code aiming at homogenizing the quant UX between quantizers, and the quant features (e.g. save/load), making easier the implementation of custom quants.

sharpenb · 2025-07-20T09:28:48Z

When is that important for the user to bypass AutoHQQHFModel?

Could we maybe find a way to avoir having this additional hyperparameter?

In the context of vllm, we need to bypass AutoHQQ and use AutoModel (because the former produces a qmodel.pt while vllm needs .safetensors)

I did't find a clean way to avoid the use of additional hyperparameter. If you have any idea I will be happy to add it :)
In general, do you remember why we still have this try...except with AutoHQQ and AutoModel in hqq.py? Why didn't we have since the beginning a hyperparameter to let the user select which type of implementation s/he want to use?

llcnt · 2025-07-21T14:33:47Z

This is functional codeso it is approved. Some comments that would be important to consider before merging:

* Simplify/Clairfy the interface in this PR

* Test in this PR the compatibility fo the new introduced configs with the compatible algorithms in HQQ.

* Add step in the roadmap which corresponds to refactorization of the quantization code aiming at homogenizing the quant UX between quantizers, and the quant features (e.g. save/load), making easier the implementation of custom quants.

I re-ran locally all unit test related to hqq, and there are valid (the 2 hyperparameters introduced have some default value that yields to the exact previous code) :)

github-actions · 2025-08-01T00:09:27Z

This PR has been inactive for 10 days and is now marked as stale.

johnrachwan123 · 2025-08-05T06:57:14Z

bugbot run

johnrachwan123

LGTM

sharpenb

Reapproving

cursor · 2025-08-13T14:55:53Z

-            return model
+        smashed_model = pipeline.working_model if hasattr(pipeline, "working_model") else pipeline
+        move_to_device(smashed_model, smash_config["device"])
+        return smashed_model


Bug: Quantized Model Retrieval Fails Post-Context Exit

After the ModelContext exits, the code attempts to retrieve the quantized model via pipeline.working_model. However, the context manager's __exit__ method reassigns the working_model to a specific pipeline attribute (e.g., transformer, unet) and then deletes pipeline.working_model. This causes hasattr(pipeline, "working_model") to return False, resulting in the unquantized pipeline object being returned instead of the quantized model.

Furthermore, moving the move_to_device call outside the context manager, after safe_memory_cleanup() (which includes torch.cuda.empty_cache()) in __exit__, can lead to device inconsistencies. The original approach of returning the model object was correct as the context manager automatically updates it and handles device placement.

llcnt marked this pull request as ready for review July 8, 2025 17:19

llcnt requested review from johnrachwan123 and sharpenb July 8, 2025 17:19

github-actions bot added the stale label Jul 19, 2025

sharpenb reviewed Jul 20, 2025

View reviewed changes

github-actions bot removed the stale label Jul 21, 2025

github-actions bot added the stale label Aug 1, 2025

This comment was marked as outdated.

Sign in to view

johnrachwan123 approved these changes Aug 5, 2025

View reviewed changes

github-actions bot removed the stale label Aug 6, 2025

llcnt requested a review from sharpenb August 6, 2025 15:36

sharpenb approved these changes Aug 7, 2025

View reviewed changes

llcnt force-pushed the feat/quantizer_vllm branch from c7bdc35 to 2129f8c Compare August 8, 2025 15:19

This comment was marked as outdated.

Sign in to view

llcnt added 6 commits August 13, 2025 14:50

feat: make hqq savable in 4bits

f145153

feat: enable user to default to hf hqq code

a4b7679

fix: manage mypy error after rebase

c34ef26

fix: make hyperparam more transparent

2b534f2

feat: upd modelcontext after rebase

e26f9dc

feat: change categorical to boolean

00ec609

llcnt force-pushed the feat/quantizer_vllm branch from 3bc5fd7 to 00ec609 Compare August 13, 2025 14:50

cursor bot reviewed Aug 13, 2025

View reviewed changes

llcnt merged commit 751c0c8 into main Aug 13, 2025
7 checks passed

davidberenstein1957 added the algorithm label Aug 19, 2025

Conversation

llcnt commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

github-actions bot commented Jul 19, 2025

Uh oh!

sharpenb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sharpenb Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

sharpenb Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

llcnt Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

llcnt Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

llcnt commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

johnrachwan123 commented Aug 5, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

johnrachwan123 left a comment

Choose a reason for hiding this comment

Uh oh!

sharpenb left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot Aug 13, 2025

Choose a reason for hiding this comment

Bug: Quantized Model Retrieval Fails Post-Context Exit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

llcnt commented Jul 8, 2025 •

edited

Loading

llcnt commented Jul 21, 2025 •

edited

Loading