-
Notifications
You must be signed in to change notification settings - Fork 1.1k
GPU autoscheduling with Mullapudi2016: the reference implementation #7787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
db5703d to
8683250
Compare
|
Thanks for this! IIRC the original GPU version of this autoscheduler was what we charitably describe as "research code", and was never fit for production. |
4bfdf3f to
f195efa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! IIRC the original GPU version of this autoscheduler was what we charitably describe as "research code", and was never fit for production.
Hi @abadams ,
Thank you for reviewing the code, and dotting the i's and t's. I concur that GPU scheduling is an experimental feature, and should be highlighted as such in the user_warning. Could you please show me where to warn the user?
I am also open to an additional option bool ArchParams::emit_gpu_schedules = false;, parsable in the generator command line interface. Though, I highly doubt if anyone would go through the hassle of setting target=host-cuda-cuda_capability_?? just to disable GPU auto-scheduler.
My primary goal is get this PR upstreamed, so that everybody can benefit from the auto-scheduler comparison and other studies. The generated demo.schedule.h can be sub-optimal; we all expect the end users will tweak it for their use cases.
|
As this is an attempted reconstruction of his GPU autoscheduler, I should probably tag @ravi-teja-mullapudi to see if this looks sane, because this will affect how people cite and compare to his work in future. |
|
Several bot failures with: |
ba7257c to
02199ca
Compare
Done removing the offending line. I also rebased the changes on top of Update: perhaps we need a separate PR to check for unused variables in the CMake configs: diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 47e90864d..83ded47a1 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -587,6 +587,8 @@ target_compile_options(
$<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-function>
$<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-macros>
$<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-parameter>
+ $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-variable>
+ $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-const-variable>
$<$<CXX_COMPILER_ID:Clang,AppleClang>:-Wno-c++98-compat-pedantic>
$<$<CXX_COMPILER_ID:Clang,AppleClang>:-Wno-c++98-compat> |
02199ca to
f366498
Compare
|
@steven-johnson and @abadams , thank you for testing the PR on the CI. Yes, the failure is triggered by the CMake build option There are two types of generator failures:
But yeah, we should have a better exception handling mechanism for this actionable error. I need help to improve the user experience. Another generator failure: |
f366498 to
65a793c
Compare
|
Updated to main branch to fix OSX WebGPU failures |
9bd065d to
3dcb5d4
Compare
|
Update: The GPU scheduling extension for Mullapudi2016 passes all Buildbot tests except for
@abadams Yeah I agreed the Buildbot CI jobs ensure production quality auto-schedulers, which is not original goal of the Mullapudi2016's GPU extensions. I will switch this PR to a draft, and work on issue 2 later next week. |
68b338c to
c2f2017
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: Last mile changes.
@mcourteaux Done. I appreciate if you review the latest changes again.
Re: intermixing correctness test cases and performance regression tests in the test/autoscheduler/ folder.
It's not really a matter of performance benchmarking. The tests simply check if the autoscheduler doesn't do something wildly wrong, by allowing a 2x performance hit comapred to the manual schedule. If it's slower than half the speed, it's considered a correctness issue.
If you already have schedules available for a follow-up PR, which you feel are out-of-scope for this PR, I guess we can solve this next PR.
I see. So Halide developers treats performance regression tests as part of correctness tests. Thanks for the reminder.
Yes, I do prefer to write regression tests (with manual GPU schedules) in the next PR. And, stick to the following road map to ensure small-scale and frequent PRs and merges:
- (this PR) Experimental GPU schedule support, with correctness tests;
- Enable experimental GPU features for the pipelines in
apps/*/; manual tuning of thelast_level_cache_sizeestimates for complex pipelines. - Write manual GPU schedules for the test cases at
test/autoscheduler/mullapudi2016/*.cpp; - Resolve the
reorderbug having "reduction axes sandwiched betweengpu_blocks; - (Optional) Resolve the
apps/iir_blur/error having vectorized width larger than the channel count (=3). - (Optional) Resolve other unknown issues not anticipated by the original Mullapudi paper, e.g.
CodeGen_NVPTXdoes not support more than 4gpu_blocks; needingfuse()to fit the 4 gpu clock limit imposed by the hardware.
By small scale PRs, I mean those having less than ~1,000 words of accumulated code review comments and conversations. In the ideal world, it avoids losing momentum whenever new reviewers join the chat.
-Antony
mcourteaux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing: LGTM.
AutoScheduler: Not reviewed in depth, but code changes look fine, style wise.
alexreinking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some style nitpicks, but my biggest concern is with the "experimental" gating -- I would like to see simpler checks at the key algorithmic points that boil down to "are we scheduling for GPU or not?" without redundantly (and inconsistently!) checking that the experimental flag is enabled.
I also think fixing the naming issues in the test code is important.
|
@antonysigma - I really want to get this merged, so I made the changes I requested myself. Please take a look and review my changes. If we're both happy, we can land this. |
alexreinking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved pending 👍 from @antonysigma
@alexreinking Thank you very much for initiating the code style changes. The buildbot failed after the changes. How do I assist you to pass all tests? I also noticed Halide PR doesn't enforce linear git history by "squash and rebase merge" per PR. It doesn't impact me as much, but good to know about the PR policy here. Thanks. |
|
I think the failures are due to some new nightly LLVM 20 regression, not to my changes. I'll investigate further. We do enforce a linear history by squashing. |
|
I'm investigating the llvm regression |
|
Merged in main with LLVM fixes. Can merge this when green |
|
One flaky performance test. Merging! |
Rationale:
To compare the GPU auto-scheduling performance of
Mullapudi2016againstLi2018andAnderson2021.To reduce the following claims to practice, quoting the original Mullapudi2016 article:
Mullapudi2016andSioutas2020algorithms, according to the findings in the Anderson2021 paper:Change summary:
Reverse engineer the GPU scheduling feature as stated in Section 5.4 of Mullapudi's article:
Mullapudi, Adams, Sharlet, Ragan-Kelley, Fatahalian. Automatically scheduling Halide image processing pipelines. ACM Transactions on Graphics, 35(4), 83pp 1–11. https://doi.org/10.1145/2897824.2925952
When
target=cudais detected in the code generator command line arguments, intercept allvectorize,parallelscheduling calls requested by the auto-vectorization algorithm and the auto-parallelization algo with the classGPUTilingDedupfor deferred execution.Implement the class
GPUTilingDedupto ensure all Halide gpu schedule calls are idempotent: no matter how many times the Stage isvectorized,reordered,parallel, and thenreorderedagain, thereorderandgpu_threads()schedules are called exactly once.Also, intercept all
splitandreorderscheduling calls by Mullapudi's auto-splitting algorithm.Implement the clss
GPUTileHelperto enforce atomic transaction of the gpu schedules. If the current stage iscompute_root, mark all auto-split inner dimensions asgpu_threads, and outer dimensions asgpu_blocks. If the Stage iscompute_atanother Stage, mark allvectorizedimensions asgpu_threads.If auto-splitting of the current stage does not result in any tile, implement a rudimentary tiling having tile size = vector_length x parallel_factor.
If Mullapudi does not call any split, vectorize, or parallel schedules, assume scalar reduction routine. Implement it on the GPU via
single_thread.cc'ed @aekul , @jrk, @abadams .
See also: #7491