Conversation
|
This PR has been inactive for 10 days and is now marked as stale. |
sharpenb
left a comment
There was a problem hiding this comment.
This is functional codeso it is approved. Some comments that would be important to consider before merging:
- Simplify/Clairfy the interface in this PR
- Test in this PR the compatibility fo the new introduced configs with the compatible algorithms in HQQ.
- Add step in the roadmap which corresponds to refactorization of the quantization code aiming at homogenizing the quant UX between quantizers, and the quant features (e.g. save/load), making easier the implementation of custom quants.
There was a problem hiding this comment.
When is that important for the user to bypass AutoHQQHFModel?
There was a problem hiding this comment.
Could we maybe find a way to avoir having this additional hyperparameter?
There was a problem hiding this comment.
In the context of vllm, we need to bypass AutoHQQ and use AutoModel (because the former produces a qmodel.pt while vllm needs .safetensors)
There was a problem hiding this comment.
I did't find a clean way to avoid the use of additional hyperparameter. If you have any idea I will be happy to add it :)
In general, do you remember why we still have this try...except with AutoHQQ and AutoModel in hqq.py? Why didn't we have since the beginning a hyperparameter to let the user select which type of implementation s/he want to use?
I re-ran locally all unit test related to hqq, and there are valid (the 2 hyperparameters introduced have some default value that yields to the exact previous code) :) |
|
This PR has been inactive for 10 days and is now marked as stale. |
|
bugbot run |
c7bdc35 to
2129f8c
Compare
3bc5fd7 to
00ec609
Compare
| return model | ||
| smashed_model = pipeline.working_model if hasattr(pipeline, "working_model") else pipeline | ||
| move_to_device(smashed_model, smash_config["device"]) | ||
| return smashed_model |
There was a problem hiding this comment.
Bug: Quantized Model Retrieval Fails Post-Context Exit
After the ModelContext exits, the code attempts to retrieve the quantized model via pipeline.working_model. However, the context manager's __exit__ method reassigns the working_model to a specific pipeline attribute (e.g., transformer, unet) and then deletes pipeline.working_model. This causes hasattr(pipeline, "working_model") to return False, resulting in the unquantized pipeline object being returned instead of the quantized model.
Furthermore, moving the move_to_device call outside the context manager, after safe_memory_cleanup() (which includes torch.cuda.empty_cache()) in __exit__, can lead to device inconsistencies. The original approach of returning the model object was correct as the context manager automatically updates it and handles device placement.
Description
This PR is the little sister of this PR in pruna_pro.
This PR in pruna add some flexibility around hqq quantization and saving:
patch_for_inferenceis added to allow the user to quantize in 4bits without patching the layers (because can make them un-loadable);default_to_hfallows the user to select a specific implementation of hqq quantization (previously only a try:...except was done, but no freedom was given to the user). this is important in particular because vllm expects a models that has been saved with the 'transformers' structure, not the AutoHQQModel one.Related Issue
Fixes #(issue number)
Type of Change
How Has This Been Tested?
Unit test pass locally.
Checklist
Additional Notes