Skip to content

feat: moe kernel tuning#482

Merged
llcnt merged 35 commits intomainfrom
feat/moe_kernel_tuning
Mar 18, 2026
Merged

feat: moe kernel tuning#482
llcnt merged 35 commits intomainfrom
feat/moe_kernel_tuning

Conversation

@llcnt
Copy link
Copy Markdown
Collaborator

@llcnt llcnt commented Dec 23, 2025

Description

This PR is inspired from vLLM benchmarks (the benchmark_config fn is copied from here) and enable one to tune the MoE (triton) kernel used in vllm.
This new algorithm MoeKernelTuner does not modify the model. It generates a tuned configuration that is saved in:

  • the vllm configs folder (so that using the model on the same gpu afterward makes vllm use this optimized config);
  • the RedhatAI kernel folder in the hf cache (so that using the moe kernels from the kernels lib will make use of the optimized config);
  • a folder moe_kernel_tuned_configs in the model directory (to be later re-used without waiting for tuning, when loading the model with pruna).

The core modifications are in:

  • the new moe_kernel_tuner.py file ((i) it does not modify the model, so it is compatible with every other algorithm before/after; (ii) the user can select dtypes but also size of parameters gridsearch; (iii) the kernel is tuned for batch sizes(ie the input dimension M) from 1 to 8192 using ray for parallelization; (iv) the best configurations are saved in hf and vllm caches (so that after smashing, hf cache and vllm cache are already populated with optimal configs that the user can use), and in the pruna cache (similar to what we do with save_before_apply);
  • the save_artifacts.py file (we move the tuned config from the pruna cache to the saved path);
  • the load_artifacts.py file (for re-saving the tuned config inside vllm/hf cache when loading a smashed model).

Related Issue

Fixes #(issue number)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

Notebook for testing with vllm is available here. On H100 for qwen3Coder-30B, latency goes from 6.43ms (before tuning) to 5.83 ms (after tuning) while using vllm.

@llcnt llcnt force-pushed the feat/moe_kernel_tuning branch from 78c6657 to 5764274 Compare December 23, 2025 17:02
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 6, 2026

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Jan 6, 2026
@llcnt llcnt removed the stale label Jan 7, 2026
@llcnt llcnt marked this pull request as ready for review January 7, 2026 14:21
Comment thread src/pruna/engine/model_checks.py Outdated
Comment thread src/pruna/engine/load.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/config/smash_config.py Outdated
@github-actions
Copy link
Copy Markdown

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Jan 18, 2026
@llcnt llcnt mentioned this pull request Jan 19, 2026
10 tasks
@github-actions github-actions bot closed this Jan 25, 2026
@llcnt llcnt removed the stale label Jan 26, 2026
@llcnt llcnt reopened this Jan 26, 2026
@llcnt llcnt force-pushed the feat/moe_kernel_tuning branch 3 times, most recently from 89b9bca to e779dbb Compare February 9, 2026 15:09
@llcnt
Copy link
Copy Markdown
Collaborator Author

llcnt commented Feb 9, 2026

bugbot run

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/engine/save_artifacts.py Outdated
Copy link
Copy Markdown
Member

@sharpenb sharpenb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure is clear.

  • I did not check if the detailed furnciotns could be factorized differently for more compact code.

Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py
Copy link
Copy Markdown
Collaborator

@gsprochette gsprochette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice feature, I have basically no comment on the content but left comments on the form. I left close to no comment on the form of benchmark_config since this is imported, i get the value of keeping it as is.

I do have 4 questions/suggestions about the general structure of the code:

  1. ray is not declared as a dependency, should it be in an extra e.g. vllm? Could we import it inside the import_algorithm_packages by isolating everything except the MoeKernelTuner(PrunaAlgorithmBase) in a utils.py and import it only in the import_algorithm_packages method?
  2. Should the tuning be done again if the model is loaded on a setup with a different triton version? If so we can use the reapply saving function and check at the beginning of apply if the artifact already exists and matches the setup, in which case we skip the tuning, but still tune otherwise.
  3. The _apply method is very long. Below is some suggestion for splitting/simplifying them and making it more readable and also more type-friendly.
  4. The moe_kernel_tuner.py file is very long, the utils split in question 1. would also make this lighter, WDYT?

For (i) the in _apply method, I think the code should be made clearer. Currently most of the logic is a series of if..else checking whether we are in the general is_moe_lm case, or the HunyuanImage3ForCausalMM exception, and extracting hyperparameters: nb_experts, topk, intermediate_size, hidden_size, shard_intermediate_size.
I think it would be clearer if:

  • in (i): a check for HunyuanImage3ForCausalMM -> call _extract_hunyuan_dimensions whose output is nb_experts, shard_intermediate_size, hidden_size and topk, and in the general case call _extract_transformers_moe_dimensions that has the same output
  • in each of these functions, get the config and make an actual typing check so we know the attributes exist. The docstring of these functions, or the comment in (i) when collecting these functions can explain what these different variables represent in the moe operations

Comment thread tests/algorithms/testers/moe_kernel_tuner.py
Comment thread pyproject.toml Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/engine/load_artifacts.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
@llcnt llcnt requested a review from gsprochette February 18, 2026 11:10
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 1, 2026

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Mar 1, 2026
@gsprochette gsprochette removed the stale label Mar 2, 2026
@llcnt llcnt force-pushed the feat/moe_kernel_tuning branch from f684372 to f5217fc Compare March 17, 2026 10:29
Copy link
Copy Markdown
Collaborator

@gsprochette gsprochette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a lot clearer for me, I just have minor style refinements and me seeing how wrong two of my comments were so we need to correct the change.

I agree with you about the test comment, if we can easily add some form of test adapted to this then let's do it, if not let's just leave it the way it is currently.

Thanks a lot for all the work, and sorry about the review delay 🙃

Comment thread tests/algorithms/testers/moe_kernel_tuner.py
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py Outdated
Comment thread src/pruna/algorithms/moe_kernel_tuner.py
Comment thread src/pruna/algorithms/utils/moe_kernel_tuner.py Outdated
Comment thread src/pruna/engine/load_artifacts.py
@llcnt llcnt requested a review from gsprochette March 17, 2026 18:16
Copy link
Copy Markdown
Collaborator

@gsprochette gsprochette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks perfect! I left some comments, mostly just liking the changes, still a couple in tests for clarity, nothing blocking :) Thanks a lot for all the work :)

Comment thread src/pruna/algorithms/moe_kernel_tuner.py
Comment thread src/pruna/algorithms/moe_kernel_tuner.py
Comment thread src/pruna/algorithms/reduce_noe.py
Comment thread tests/algorithms/testers/base_tester.py
Comment thread tests/algorithms/testers/moe_kernel_tuner.py Outdated
Comment thread tests/algorithms/testers/moe_kernel_tuner.py Outdated
Comment thread tests/algorithms/testers/moe_kernel_tuner.py
@llcnt llcnt merged commit 45968e6 into main Mar 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants