GPU autoscheduling with Mullapudi2016: the reference implementation #7787

antonysigma · 2023-08-21T18:19:27Z

Rationale:

To compare the GPU auto-scheduling performance of Mullapudi2016 against Li2018 and Anderson2021.
To reduce the following claims to practice, quoting the original Mullapudi2016 article:

Portability to Different Architectures: GPU Performance: The inlining, tiling, and grouping processes are otherwise similar to the CPU case. Groups resulting from merging are mapped to CUDA kernels by designating the outer tile loops as GPU block grid dimensions and the inner tile loops as GPU thread block dimensions. All intermediate buffers
within a group are allocated in GPU shared memory

To implement the so-call "single level tiling only" limitation in the Mullapudi2016 and Sioutas2020 algorithms, according to the findings in the Anderson2021 paper:

[Mullapudi et al] develops an automatic scheduling technique using a heuristic cost model and a greedy stage grouping algorithm... but its search space is smaller compared to ours among other reasons because it only supports a single level of tiling, and as we discuss in Section 6.2, this excludes a number of high performance schedules.

Change summary:

Reverse engineer the GPU scheduling feature as stated in Section 5.4 of Mullapudi's article:

Mullapudi, Adams, Sharlet, Ragan-Kelley, Fatahalian. Automatically scheduling Halide image processing pipelines. ACM Transactions on Graphics, 35(4), 83pp 1–11. https://doi.org/10.1145/2897824.2925952

When target=cuda is detected in the code generator command line arguments, intercept all vectorize, parallel scheduling calls requested by the auto-vectorization algorithm and the auto-parallelization algo with the class GPUTilingDedup for deferred execution.

Implement the class GPUTilingDedup to ensure all Halide gpu schedule calls are idempotent: no matter how many times the Stage is vectorized, reordered, parallel, and then reordered again, the reorder and gpu_threads() schedules are called exactly once.

Also, intercept all split and reorder scheduling calls by Mullapudi's auto-splitting algorithm.

Implement the clss GPUTileHelper to enforce atomic transaction of the gpu schedules. If the current stage is compute_root, mark all auto-split inner dimensions as gpu_threads, and outer dimensions as gpu_blocks. If the Stage is compute_at another Stage, mark all vectorize dimensions as gpu_threads.

If auto-splitting of the current stage does not result in any tile, implement a rudimentary tiling having tile size = vector_length x parallel_factor.

If Mullapudi does not call any split, vectorize, or parallel schedules, assume scalar reduction routine. Implement it on the GPU via single_thread.

cc'ed @aekul , @jrk, @abadams .

See also: #7491

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

abadams · 2023-08-22T16:16:54Z

Thanks for this! IIRC the original GPU version of this autoscheduler was what we charitably describe as "research code", and was never fit for production.

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

antonysigma

Thanks for this! IIRC the original GPU version of this autoscheduler was what we charitably describe as "research code", and was never fit for production.

Hi @abadams ,

Thank you for reviewing the code, and dotting the i's and t's. I concur that GPU scheduling is an experimental feature, and should be highlighted as such in the user_warning. Could you please show me where to warn the user?

I am also open to an additional option bool ArchParams::emit_gpu_schedules = false;, parsable in the generator command line interface. Though, I highly doubt if anyone would go through the hassle of setting target=host-cuda-cuda_capability_?? just to disable GPU auto-scheduler.

My primary goal is get this PR upstreamed, so that everybody can benefit from the auto-scheduler comparison and other studies. The generated demo.schedule.h can be sub-optimal; we all expect the end users will tweak it for their use cases.

abadams · 2023-08-22T20:02:55Z

As this is an attempted reconstruction of his GPU autoscheduler, I should probably tag @ravi-teja-mullapudi to see if this looks sane, because this will affect how people cite and compare to his work in future.

steven-johnson · 2023-08-23T20:43:31Z

Several bot failures with:

/home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-32-linux-make/halide-source/src/autoschedulers/mullapudi2016/AutoSchedule.cpp:2830:21: error: unused variable ‘types’ [-Werror=unused-variable]

antonysigma · 2023-08-23T23:17:08Z

Several bot failures with:

/home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-32-linux-make/halide-source/src/autoschedulers/mullapudi2016/AutoSchedule.cpp:2830:21: error: unused variable ‘types’ [-Werror=unused-variable]

Done removing the offending line. I also rebased the changes on top of main.

Update: perhaps we need a separate PR to check for unused variables in the CMake configs:

diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 47e90864d..83ded47a1 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -587,6 +587,8 @@ target_compile_options(
         $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-function>
         $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-macros>
         $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-parameter>
+        $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-variable>
+        $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-const-variable>
 
         $<$<CXX_COMPILER_ID:Clang,AppleClang>:-Wno-c++98-compat-pedantic>
         $<$<CXX_COMPILER_ID:Clang,AppleClang>:-Wno-c++98-compat>

antonysigma · 2023-08-24T16:46:02Z

@steven-johnson and @abadams , thank you for testing the PR on the CI. Yes, the failure is triggered by the CMake build option -DHalide_TARGET=host-[metal|gpu]. I didn't know we can do that. I like the feature. I will reproduce it on my machine.

There are two types of generator failures:

Functions that are compute_at() a gpu_block() loop must specify the innermost gpu_block() loop for that Func. It stems from the over-estimated "L2 cache size per thread" machine parameter; the value should have been ~70kB instead of 16MB. It is described in the original paper as a limitation, not a bug.

But yeah, we should have a better exception handling mechanism for this actionable error. I need help to improve the user experience.

Another generator failure: Functions that are compute_at() a gpu_block() loop cannot have their own gpu_block() loops. It happens in scalar reduction stages scheduled in compute_at. Resolving the bug...

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

steven-johnson · 2023-08-24T23:13:52Z

Updated to main branch to fix OSX WebGPU failures

antonysigma · 2023-08-25T18:15:01Z

Update: The GPU scheduling extension for Mullapudi2016 passes all Buildbot tests except for autograd_grad.generator and local_laplacian_generator.

autograd_grad passes the Buildbot tests, but the unamed Var x triggers basic_string::_M_construct == null error on LLVM16; !name.empty() error on LLVM18.

Halide/test/generator/autograd_generator.cpp

Line 23 in 69c75b3

Var x;

local_laplacian_generator triggers a subtle !name.empty() exception in the Halide IR.

@abadams Yeah I agreed the Buildbot CI jobs ensure production quality auto-schedulers, which is not original goal of the Mullapudi2016's GPU extensions. I will switch this PR to a draft, and work on issue 2 later next week.

antonysigma

Re: Last mile changes.

@mcourteaux Done. I appreciate if you review the latest changes again.

Re: intermixing correctness test cases and performance regression tests in the test/autoscheduler/ folder.

It's not really a matter of performance benchmarking. The tests simply check if the autoscheduler doesn't do something wildly wrong, by allowing a 2x performance hit comapred to the manual schedule. If it's slower than half the speed, it's considered a correctness issue.

If you already have schedules available for a follow-up PR, which you feel are out-of-scope for this PR, I guess we can solve this next PR.

I see. So Halide developers treats performance regression tests as part of correctness tests. Thanks for the reminder.

Yes, I do prefer to write regression tests (with manual GPU schedules) in the next PR. And, stick to the following road map to ensure small-scale and frequent PRs and merges:

(this PR) Experimental GPU schedule support, with correctness tests;
Enable experimental GPU features for the pipelines in apps/*/; manual tuning of the last_level_cache_size estimates for complex pipelines.
Write manual GPU schedules for the test cases at test/autoscheduler/mullapudi2016/*.cpp;
Resolve the reorder bug having "reduction axes sandwiched between gpu_blocks;
(Optional) Resolve the apps/iir_blur/ error having vectorized width larger than the channel count (=3).
(Optional) Resolve other unknown issues not anticipated by the original Mullapudi paper, e.g. CodeGen_NVPTX does not support more than 4 gpu_blocks; needing fuse() to fit the 4 gpu clock limit imposed by the hardware.

By small scale PRs, I mean those having less than ~1,000 words of accumulated code review comments and conversations. In the ideal world, it avoids losing momentum whenever new reviewers join the chat.

-Antony

mcourteaux

Testing: LGTM.
AutoScheduler: Not reviewed in depth, but code changes look fine, style wise.

alexreinking

I have some style nitpicks, but my biggest concern is with the "experimental" gating -- I would like to see simpler checks at the key algorithmic points that boil down to "are we scheduling for GPU or not?" without redundantly (and inconsistently!) checking that the experimental flag is enabled.

I also think fixing the naming issues in the test code is important.

test/autoschedulers/mullapudi2016/get_autoscheduler_params.hpp

test/autoschedulers/mullapudi2016/fibonacci.cpp

test/autoschedulers/mullapudi2016/get_autoscheduler_params.hpp

src/autoschedulers/mullapudi2016/AutoSchedule.cpp

alexreinking · 2025-06-10T15:38:28Z

@antonysigma - I really want to get this merged, so I made the changes I requested myself. Please take a look and review my changes. If we're both happy, we can land this.

alexreinking

Approved pending 👍 from @antonysigma

antonysigma · 2025-06-10T20:05:44Z

I really want to get this merged, so I made the changes I requested myself. Please take a look and review my changes. If we're both happy, we can land this.

@alexreinking Thank you very much for initiating the code style changes. The buildbot failed after the changes. How do I assist you to pass all tests?

I also noticed Halide PR doesn't enforce linear git history by "squash and rebase merge" per PR. It doesn't impact me as much, but good to know about the PR policy here. Thanks.

alexreinking · 2025-06-10T20:07:35Z

I think the failures are due to some new nightly LLVM 20 regression, not to my changes. I'll investigate further.

We do enforce a linear history by squashing.

abadams · 2025-06-10T20:43:53Z

I'm investigating the llvm regression

alexreinking · 2025-06-11T21:37:57Z

Merged in main with LLVM fixes. Can merge this when green

alexreinking · 2025-06-12T03:03:31Z

One flaky performance test. Merging!

steven-johnson requested a review from abadams August 21, 2023 18:24

antonysigma commented Aug 21, 2023

View reviewed changes

src/autoschedulers/mullapudi2016/AutoSchedule.cpp Show resolved Hide resolved

antonysigma force-pushed the mullapudi2016-gpu branch from db5703d to 8683250 Compare August 21, 2023 22:42