feat: add janus support for quantization+torch.compile combo(s) by llcnt · Pull Request #145 · PrunaAI/pruna

llcnt · 2025-05-21T15:47:51Z

Description

This goal of this PR is to decrease the memory impact and the latency of the janus(pro-7b) model.
This model are based on Llamagen and compute tokens in a latent space with an autoregressive fashion, thanks to an attribute (here called language_model) defined as a llama model.
There is currently no standardization regarding llamagen AR models, so this PR is exclusively dedicated to janus models (that are compatible with the transformers package). But we expect in the near future (see this thread) that llmgenAR models will have similar .generate() functions.

The idea of the code change is:

create a JanusGenerator (similar to the TransformersGenerator we already had, and that is renamed CausalLMGenerator);
adapt the context manager to be able to deal with Janus models;
adapt HQQ to be able to work on Janus;
adapt hqq save function (distinct savings for the lm model and the rest: similar to what is done for diffusers pipeline, but more tricky as there is no (yet) pipeline for janus);
adapt hqq load function (also distinct loads for the LM model and the rest).
The above points can be extended to other llm quantizer. However, testing the code and adapting all save/load functions is time consuming. I leave this work for a future PR.

Related Issue

Fixes #(issue number)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

When quantized with hqq4bits and combined with torch.compile, we can obtain a ~x3 speedup.
I have added 2 unit tests, and provided (below) a notebook for reproducing the results.
Edit: The notebook works well for torch==2.5.1. For torch==2.7, major changes has been introduced in torch dynamo, we have to slightly adapt the smash_config: smash_config['torch_compile_fullgraph'] = False
smash_config['torch_compile_mode'] = 'default'
smash_config['torch_compile_backend'] = 'inductor'.
Otherwise the compilation step takes very long, and is reapplied at each step with torch==2.7

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

A notebook for reproducing results is provided here;
The current Janus models lags behind the current sota flow matching models. It is more comparable to the first versions of midjourney and dalee's, as you can see in this blog.

src/pruna/algorithms/compilation/torch_compile.py

src/pruna/algorithms/compilation/utils.py

src/pruna/engine/load.py

src/pruna/engine/model_checks.py

src/pruna/engine/save.py

tests/fixtures.py

sharpenb

Very cool to have it working for Janus! The PR description is clear. Could we add a small benchmark for this model? Main points are about how ot improve better the code factorization to avoid redundant work in the future.

gsprochette

Awesome work, can't wait to have this working for Janus in Pruna! I also hope that the HF PR for adding generate functions to LlamaGen models will be added soon so we can remove the Generator code. For the rest of the code I left a few comments, nothing breaking only trying to improve already good code and make some double checks here and there :)

src/pruna/engine/load.py

src/pruna/algorithms/quantization/hqq.py

src/pruna/engine/utils.py

src/pruna/algorithms/compilation/torch_compile.py

cursor

Bug: Model Loading Fails Due to Incorrect `lm_head` Handling

When loading Janus-like models, the load_hqq_model function attempts to add a dummy lm_head. This process introduces several issues:

The lm_head is created with hardcoded (1024, 1024) dimensions, which may not match the model's actual hidden size, leading to shape mismatches.
Its randomly initialized weights are incorrectly added to the model's state dictionary as a nested dictionary under the "lm_head" key, instead of flattening its parameters (e.g., "lm_head.weight", "lm_head.bias"). This will cause loading failures.
The function loads qmodel.pt, modifies its contents with these problematic lm_head weights, and then overwrites the original file in-place. This is an unexpected and potentially harmful side effect for a load operation.

src/pruna/engine/load.py#L359-L365

https://github.com/PrunaAI/pruna/blob/4d8daadf1d2cdbcd6a9ae71f37e274e2e1dfb33f/src/pruna/engine/load.py#L359-L365

Fix in Cursor

Was this report helpful? Give feedback by reacting with 👍 or 👎

gsprochette

Looks good to me, there's just this expected_quantized_model_path that could be re-used to make the code a bit cleaner, see comment. If Bertrand is satisfied with your answers this is ready to go for me, thanks for the updates :)

sharpenb

Thanks for the details! I left some comments. Happy to discuss them if needed :)

src/pruna/engine/load.py

src/pruna/engine/model_checks.py

src/pruna/engine/save.py

src/pruna/algorithms/compilation/utils.py

llcnt · 2025-07-01T16:38:08Z

I have edited the main comment, and I am fixing the code (will appear in the next push), for enabling speedups with torch==2.7

sharpenb · 2025-07-07T13:23:45Z

src/pruna/algorithms/compilation/utils.py

Step 3. and Step 6. prepare inputs in some way. Could we factorize these in one step?

Nice catch! Indeed step6 only depends on the previous step3, I merged it and put setp6 into the sub-function "self.prepare_inputs_tokens" :)

sharpenb

Thanks for tackling the comments! I left a couple more but I think that it should be good to go then

sharpenb · 2025-07-07T13:25:35Z

src/pruna/algorithms/compilation/utils.py

It is unlcear why step 5. prepare logit processors while step 4. alreayd used the suer passed processor. Could we also merge all processors preparation step?

You are definitely right ;) I merged all process related to logits_processor into a single sub-function

(ps: it is maybe unclear in the function name self.model._get_logits_processor, but this function will use the function from transformers that merge the user-defined logits_processor together with the logits_processor defined into the generation_config. I have added a small comment in the function to highlight this)

src/pruna/algorithms/compilation/utils.py

src/pruna/engine/load.py

llcnt requested review from johnrachwan123 and sharpenb May 23, 2025 15:13

llcnt marked this pull request as ready for review May 23, 2025 15:17

llcnt force-pushed the feat/llamagen_ar_janus_support branch from fdf809e to 1a6c0a5 Compare June 2, 2025 08:14

sharpenb reviewed Jun 2, 2025

View reviewed changes

src/pruna/algorithms/compilation/torch_compile.py Outdated Show resolved Hide resolved

sharpenb requested changes Jun 2, 2025

View reviewed changes

llcnt requested review from gsprochette and removed request for johnrachwan123 June 3, 2025 08:29

gsprochette requested changes Jun 10, 2025

View reviewed changes

llcnt force-pushed the feat/llamagen_ar_janus_support branch from 1a6c0a5 to 92b69ab Compare June 17, 2025 16:34

llcnt requested review from gsprochette and sharpenb June 17, 2025 16:34

This comment was marked as outdated.

Sign in to view

llcnt force-pushed the feat/llamagen_ar_janus_support branch from 2421573 to 4d8daad Compare June 18, 2025 09:29

cursor bot reviewed Jun 18, 2025

View reviewed changes

gsprochette approved these changes Jun 23, 2025

View reviewed changes

sharpenb reviewed Jun 23, 2025

View reviewed changes

src/pruna/engine/load.py Outdated Show resolved Hide resolved

src/pruna/engine/model_checks.py Outdated Show resolved Hide resolved

src/pruna/engine/save.py Outdated Show resolved Hide resolved

src/pruna/algorithms/compilation/utils.py Outdated Show resolved Hide resolved

llcnt force-pushed the feat/llamagen_ar_janus_support branch from 4d8daad to a9d3d84 Compare June 30, 2025 13:30

llcnt force-pushed the feat/llamagen_ar_janus_support branch from 35bb705 to 3341a90 Compare July 2, 2025 17:10

llcnt requested review from gsprochette and sharpenb July 2, 2025 17:17

sharpenb reviewed Jul 7, 2025

View reviewed changes

sharpenb approved these changes Jul 7, 2025

View reviewed changes

llcnt added 5 commits July 8, 2025 13:19

feat: initial code with janusgenerator and hqq

93ba025

fix: manage mypy errors

0a3befb

fix: manage docstring errors

34e714f

feat: adapt hqq save fn

e0cb778

feat: manage reloading errors

c603b4b

llcnt added 9 commits July 8, 2025 13:19

fix: loading in tochaoint4bits is failing

16d88e7

feat: add unit tests

c7707d7

fix: adress pr comments

48b65e0

fix: clean save load

07dfad6

fix: adress pr comments2 and manage type str vs path

610b34c

fix: wip adapt code to torch2.7 and clean generate fn

7f51b48

fix: create sub functions for the generate one

b4be7ef

fix: unit test are now correct with the loaded model on correct device

3ccb7cd

feat: merge logit_processors and make fn comments clearer

7e6603f

llcnt force-pushed the feat/llamagen_ar_janus_support branch from ae2d1f8 to 7e6603f Compare July 8, 2025 13:19

llcnt merged commit ec79dcd into main Jul 8, 2025
6 checks passed

johannaSommer deleted the feat/llamagen_ar_janus_support branch July 9, 2025 07:21

llcnt mentioned this pull request Jul 9, 2025

fix: change janus import into automodel import #243

Merged

10 tasks

Conversation

llcnt commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sharpenb left a comment

Choose a reason for hiding this comment

Uh oh!

gsprochette left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Model Loading Fails Due to Incorrect `lm_head` Handling

Uh oh!

gsprochette left a comment

Choose a reason for hiding this comment

Uh oh!

sharpenb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llcnt commented Jul 1, 2025

Uh oh!

sharpenb Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

llcnt Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sharpenb left a comment

Choose a reason for hiding this comment

Uh oh!

sharpenb Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

llcnt Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

llcnt commented May 21, 2025 •

edited

Loading