Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsExtends #32538 to use AVX-512 (and AVX1) to zero locals for non-loop path. I am going to slightly refactor it to use AVX in the loop path too but later, this seems to be a low-hanging fruit with nice diffs.
|
8ac3214 to
61a05ee
Compare
|
@dotnet/jit-contrib PTAL, simple change with nice diffs (-122kb for benchmarks.pgo collection, -0.13% TP for the same collection). The logic has plenty of opportunities to optimize futher, e.g. use AVX in the loop - I didn't change it here because for that we need to align data to 32/64 bytes + remainder can be handled with overlapping -- but I am leaving it for future follow ups. I was mostly interested in removing loops by allowing up to 6*64=384 bytes to be zeroed directly with avx512 where previously we switched to the loop for >96 bytes. |
Extends #32538 to use AVX-512 (and AVX1) to zero locals for non-loop path. I am going to slightly refactor it to use AVX in the loop path too but later, this seems to be a low-hanging fruit with nice diffs.
Diff example:
(apparently this collection has no avx-512, but still looks better)