Skip to content

feat: new quantizer for vllm#239

Merged
llcnt merged 6 commits intomainfrom
feat/quantizer_vllm
Aug 13, 2025
Merged

feat: new quantizer for vllm#239
llcnt merged 6 commits intomainfrom
feat/quantizer_vllm

Conversation

@llcnt
Copy link
Copy Markdown
Collaborator

@llcnt llcnt commented Jul 8, 2025

Description

This PR is the little sister of this PR in pruna_pro.
This PR in pruna add some flexibility around hqq quantization and saving:

  • a new parameter patch_for_inference is added to allow the user to quantize in 4bits without patching the layers (because can make them un-loadable);
  • a new parameter default_to_hf allows the user to select a specific implementation of hqq quantization (previously only a try:...except was done, but no freedom was given to the user). this is important in particular because vllm expects a models that has been saved with the 'transformers' structure, not the AutoHQQModel one.

Related Issue

Fixes #(issue number)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Unit test pass locally.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

@llcnt llcnt marked this pull request as ready for review July 8, 2025 17:19
@llcnt llcnt requested review from johnrachwan123 and sharpenb July 8, 2025 17:19
@github-actions
Copy link
Copy Markdown

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Jul 19, 2025
Copy link
Copy Markdown
Member

@sharpenb sharpenb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is functional codeso it is approved. Some comments that would be important to consider before merging:

  • Simplify/Clairfy the interface in this PR
  • Test in this PR the compatibility fo the new introduced configs with the compatible algorithms in HQQ.
  • Add step in the roadmap which corresponds to refactorization of the quantization code aiming at homogenizing the quant UX between quantizers, and the quant features (e.g. save/load), making easier the implementation of custom quants.

Comment thread src/pruna/algorithms/quantization/hqq.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is that important for the user to bypass AutoHQQHFModel?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe find a way to avoir having this additional hyperparameter?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of vllm, we need to bypass AutoHQQ and use AutoModel (because the former produces a qmodel.pt while vllm needs .safetensors)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did't find a clean way to avoid the use of additional hyperparameter. If you have any idea I will be happy to add it :)
In general, do you remember why we still have this try...except with AutoHQQ and AutoModel in hqq.py? Why didn't we have since the beginning a hyperparameter to let the user select which type of implementation s/he want to use?

Comment thread src/pruna/algorithms/quantization/hqq.py Outdated
Comment thread src/pruna/algorithms/quantization/hqq.py Outdated
@github-actions github-actions bot removed the stale label Jul 21, 2025
@llcnt
Copy link
Copy Markdown
Collaborator Author

llcnt commented Jul 21, 2025

This is functional codeso it is approved. Some comments that would be important to consider before merging:

* Simplify/Clairfy the interface in this PR

* Test in this PR the compatibility fo the new introduced configs with the compatible algorithms in HQQ.

* Add step in the roadmap which corresponds to refactorization of the quantization code aiming at homogenizing the quant UX between quantizers, and the quant features (e.g. save/load), making easier the implementation of custom quants.

I re-ran locally all unit test related to hqq, and there are valid (the 2 hyperparameters introduced have some default value that yields to the exact previous code) :)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 1, 2025

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Aug 1, 2025
@johnrachwan123
Copy link
Copy Markdown
Member

bugbot run

cursor[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Member

@johnrachwan123 johnrachwan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot removed the stale label Aug 6, 2025
@llcnt llcnt requested a review from sharpenb August 6, 2025 15:36
Copy link
Copy Markdown
Member

@sharpenb sharpenb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reapproving

@llcnt llcnt force-pushed the feat/quantizer_vllm branch from c7bdc35 to 2129f8c Compare August 8, 2025 15:19
cursor[bot]

This comment was marked as outdated.

@llcnt llcnt force-pushed the feat/quantizer_vllm branch from 3bc5fd7 to 00ec609 Compare August 13, 2025 14:50
return model
smashed_model = pipeline.working_model if hasattr(pipeline, "working_model") else pipeline
move_to_device(smashed_model, smash_config["device"])
return smashed_model
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Quantized Model Retrieval Fails Post-Context Exit

After the ModelContext exits, the code attempts to retrieve the quantized model via pipeline.working_model. However, the context manager's __exit__ method reassigns the working_model to a specific pipeline attribute (e.g., transformer, unet) and then deletes pipeline.working_model. This causes hasattr(pipeline, "working_model") to return False, resulting in the unquantized pipeline object being returned instead of the quantized model.

Furthermore, moving the move_to_device call outside the context manager, after safe_memory_cleanup() (which includes torch.cuda.empty_cache()) in __exit__, can lead to device inconsistencies. The original approach of returning the model object was correct as the context manager automatically updates it and handles device placement.

Fix in Cursor Fix in Web

@llcnt llcnt merged commit 751c0c8 into main Aug 13, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants