Conversation
78c6657 to
5764274
Compare
|
This PR has been inactive for 10 days and is now marked as stale. |
|
This PR has been inactive for 10 days and is now marked as stale. |
89b9bca to
e779dbb
Compare
|
bugbot run |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
sharpenb
left a comment
There was a problem hiding this comment.
The overall structure is clear.
- I did not check if the detailed furnciotns could be factorized differently for more compact code.
gsprochette
left a comment
There was a problem hiding this comment.
Super nice feature, I have basically no comment on the content but left comments on the form. I left close to no comment on the form of benchmark_config since this is imported, i get the value of keeping it as is.
I do have 4 questions/suggestions about the general structure of the code:
- ray is not declared as a dependency, should it be in an extra e.g. vllm? Could we import it inside the import_algorithm_packages by isolating everything except the
MoeKernelTuner(PrunaAlgorithmBase)in a utils.py and import it only in the import_algorithm_packages method? - Should the tuning be done again if the model is loaded on a setup with a different triton version? If so we can use the reapply saving function and check at the beginning of apply if the artifact already exists and matches the setup, in which case we skip the tuning, but still tune otherwise.
- The _apply method is very long. Below is some suggestion for splitting/simplifying them and making it more readable and also more type-friendly.
- The moe_kernel_tuner.py file is very long, the utils split in question 1. would also make this lighter, WDYT?
For (i) the in _apply method, I think the code should be made clearer. Currently most of the logic is a series of if..else checking whether we are in the general is_moe_lm case, or the HunyuanImage3ForCausalMM exception, and extracting hyperparameters: nb_experts, topk, intermediate_size, hidden_size, shard_intermediate_size.
I think it would be clearer if:
- in (i): a check for HunyuanImage3ForCausalMM -> call _extract_hunyuan_dimensions whose output is nb_experts, shard_intermediate_size, hidden_size and topk, and in the general case call _extract_transformers_moe_dimensions that has the same output
- in each of these functions, get the config and make an actual typing check so we know the attributes exist. The docstring of these functions, or the comment in (i) when collecting these functions can explain what these different variables represent in the moe operations
|
This PR has been inactive for 10 days and is now marked as stale. |
f684372 to
f5217fc
Compare
gsprochette
left a comment
There was a problem hiding this comment.
It's a lot clearer for me, I just have minor style refinements and me seeing how wrong two of my comments were so we need to correct the change.
I agree with you about the test comment, if we can easily add some form of test adapted to this then let's do it, if not let's just leave it the way it is currently.
Thanks a lot for all the work, and sorry about the review delay 🙃
gsprochette
left a comment
There was a problem hiding this comment.
Looks perfect! I left some comments, mostly just liking the changes, still a couple in tests for clarity, nothing blocking :) Thanks a lot for all the work :)
Description
This PR is inspired from vLLM benchmarks (the benchmark_config fn is copied from here) and enable one to tune the MoE (triton) kernel used in vllm.
This new algorithm
MoeKernelTunerdoes not modify the model. It generates a tuned configuration that is saved in:kernelslib will make use of the optimized config);moe_kernel_tuned_configsin the model directory (to be later re-used without waiting for tuning, when loading the model with pruna).The core modifications are in:
rayfor parallelization; (iv) the best configurations are saved in hf and vllm caches (so that after smashing, hf cache and vllm cache are already populated with optimal configs that the user can use), and in the pruna cache (similar to what we do withsave_before_apply);Related Issue
Fixes #(issue number)
Type of Change
How Has This Been Tested?
Checklist
Additional Notes
Notebook for testing with vllm is available here. On H100 for qwen3Coder-30B, latency goes from 6.43ms (before tuning) to 5.83 ms (after tuning) while using
vllm.