feat: moe kernel tuning by llcnt · Pull Request #482 · PrunaAI/pruna

llcnt · 2025-12-23T10:59:08Z

Description

This PR is inspired from vLLM benchmarks (the benchmark_config fn is copied from here) and enable one to tune the MoE (triton) kernel used in vllm.
This new algorithm MoeKernelTuner does not modify the model. It generates a tuned configuration that is saved in:

the vllm configs folder (so that using the model on the same gpu afterward makes vllm use this optimized config);
the RedhatAI kernel folder in the hf cache (so that using the moe kernels from the kernels lib will make use of the optimized config);
a folder moe_kernel_tuned_configs in the model directory (to be later re-used without waiting for tuning, when loading the model with pruna).

The core modifications are in:

the new moe_kernel_tuner.py file ((i) it does not modify the model, so it is compatible with every other algorithm before/after; (ii) the user can select dtypes but also size of parameters gridsearch; (iii) the kernel is tuned for batch sizes(ie the input dimension M) from 1 to 8192 using ray for parallelization; (iv) the best configurations are saved in hf and vllm caches (so that after smashing, hf cache and vllm cache are already populated with optimal configs that the user can use), and in the pruna cache (similar to what we do with save_before_apply);
the save_artifacts.py file (we move the tuned config from the pruna cache to the saved path);
the load_artifacts.py file (for re-saving the tuned config inside vllm/hf cache when loading a smashed model).

Related Issue

Fixes #(issue number)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

Notebook for testing with vllm is available here. On H100 for qwen3Coder-30B, latency goes from 6.43ms (before tuning) to 5.83 ms (after tuning) while using vllm.

github-actions · 2026-01-06T00:08:28Z

This PR has been inactive for 10 days and is now marked as stale.

github-actions · 2026-01-18T00:09:01Z

This PR has been inactive for 10 days and is now marked as stale.

llcnt · 2026-02-09T15:54:38Z

bugbot run

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

sharpenb

The overall structure is clear.

I did not check if the detailed furnciotns could be factorized differently for more compact code.

gsprochette

Super nice feature, I have basically no comment on the content but left comments on the form. I left close to no comment on the form of benchmark_config since this is imported, i get the value of keeping it as is.

I do have 4 questions/suggestions about the general structure of the code:

ray is not declared as a dependency, should it be in an extra e.g. vllm? Could we import it inside the import_algorithm_packages by isolating everything except the MoeKernelTuner(PrunaAlgorithmBase) in a utils.py and import it only in the import_algorithm_packages method?
Should the tuning be done again if the model is loaded on a setup with a different triton version? If so we can use the reapply saving function and check at the beginning of apply if the artifact already exists and matches the setup, in which case we skip the tuning, but still tune otherwise.
The _apply method is very long. Below is some suggestion for splitting/simplifying them and making it more readable and also more type-friendly.
The moe_kernel_tuner.py file is very long, the utils split in question 1. would also make this lighter, WDYT?

For (i) the in _apply method, I think the code should be made clearer. Currently most of the logic is a series of if..else checking whether we are in the general is_moe_lm case, or the HunyuanImage3ForCausalMM exception, and extracting hyperparameters: nb_experts, topk, intermediate_size, hidden_size, shard_intermediate_size.
I think it would be clearer if:

in (i): a check for HunyuanImage3ForCausalMM -> call _extract_hunyuan_dimensions whose output is nb_experts, shard_intermediate_size, hidden_size and topk, and in the general case call _extract_transformers_moe_dimensions that has the same output
in each of these functions, get the config and make an actual typing check so we know the attributes exist. The docstring of these functions, or the comment in (i) when collecting these functions can explain what these different variables represent in the moe operations

github-actions · 2026-03-01T00:11:05Z

This PR has been inactive for 10 days and is now marked as stale.

gsprochette

It's a lot clearer for me, I just have minor style refinements and me seeing how wrong two of my comments were so we need to correct the change.

I agree with you about the test comment, if we can easily add some form of test adapted to this then let's do it, if not let's just leave it the way it is currently.

Thanks a lot for all the work, and sorry about the review delay 🙃

gsprochette

Looks perfect! I left some comments, mostly just liking the changes, still a couple in tests for clarity, nothing blocking :) Thanks a lot for all the work :)

llcnt force-pushed the feat/moe_kernel_tuning branch from 78c6657 to 5764274 Compare December 23, 2025 17:02

github-actions bot added the stale label Jan 6, 2026

llcnt removed the stale label Jan 7, 2026

llcnt marked this pull request as ready for review January 7, 2026 14:21

cursor bot reviewed Jan 7, 2026

View reviewed changes

github-actions bot added the stale label Jan 18, 2026

llcnt mentioned this pull request Jan 19, 2026

refactor: formalize artifact saving #492

Merged

10 tasks

github-actions bot closed this Jan 25, 2026

llcnt removed the stale label Jan 26, 2026

llcnt reopened this Jan 26, 2026

llcnt force-pushed the feat/moe_kernel_tuning branch 3 times, most recently from 89b9bca to e779dbb Compare February 9, 2026 15:09

cursor bot reviewed Feb 9, 2026

View reviewed changes

Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated

Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated

Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated

Comment thread src/pruna/engine/save_artifacts.py Outdated

llcnt requested review from gsprochette and sharpenb February 10, 2026 15:01

sharpenb approved these changes Feb 11, 2026

View reviewed changes

Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated

Comment thread src/pruna/algorithms/moe_kernel_tuner.py

gsprochette requested changes Feb 12, 2026

View reviewed changes

llcnt requested a review from gsprochette February 18, 2026 11:10

github-actions bot added the stale label Mar 1, 2026

gsprochette removed the stale label Mar 2, 2026

llcnt added 5 commits March 17, 2026 10:25

feat: draft kernel tuning based on vllm fns

be02ad4

feat: draft kernel tuning based on vllm fns

a301d4e

feat: add benchmark fn and saving draft

3c84318

feat: clean and simplify tuning and config saving

e939bf4

feat: add custom loading fn

838faac

llcnt added 15 commits March 17, 2026 10:26

fix: moe intermediate size can differ from model intermediate size

e238ac3

feat: ruff linting

6a2841e

feat: ty check linting

2d79fe5

fix: npdoc space issue

5159b03

fix: minor bugs from review

1d0d7bf

feat: adapt tuned configs saving to the new artifact savings

3c086b3

fix: ruff linting

91f0ab2

feat: upd docstrings on moe artifacts

3012e3c

feat: add try execpt for ray

5cd4922

fix: moe artifacts load fn takes 3 args not 2

300ea5a

fix: review comment draft

67ac73a

fix: linting

57ec564

feat: ruff on new imports

f4a5c49

fix: ruff on new import

12f99a7

feat: adapt pruna logger when triton version mismatch

f5217fc

llcnt force-pushed the feat/moe_kernel_tuning branch from f684372 to f5217fc Compare March 17, 2026 10:29

gsprochette requested changes Mar 17, 2026

View reviewed changes

llcnt added 9 commits March 17, 2026 11:02

feat: upd all compatibilities

9773392

fix: check save config and put for loop in try

9a8c510

feat: add new hook and tester

5f8dfee

fix: line too long

20bf149

feat: reduce hyperparam to make test faster

255812f

feat: make algo compatible with vllm changes in version >=0.16

155cf0f

fix: make vllm only in extra

443784a

fix: ruff linting on new rd_seed import

19415f3

fix: ty check on dict

cefd5b8

llcnt requested a review from gsprochette March 17, 2026 18:16

gsprochette approved these changes Mar 18, 2026

View reviewed changes

feat clean import and use smashconfig load from json

d0922e5

llcnt merged commit 45968e6 into main Mar 18, 2026
7 checks passed

Conversation

llcnt commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 18, 2026

Uh oh!

llcnt commented Feb 9, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sharpenb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gsprochette left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 1, 2026

Uh oh!

gsprochette left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gsprochette left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llcnt commented Dec 23, 2025 •

edited

Loading