Optimize stackalloc zeroing via BLK#83255
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsLet's see if this works - I just insert GT_BLK (basically, Unsafe.InitMemoryUnaligned) after CEE_LOCALLOC in importer to rely on that for zeroing. GT_BLK has its own logic to unroll/emit MEMSET. [Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[40]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[50]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[64]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[100]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[128]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[150]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[256]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[512]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[1024]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[4096]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[8192]; Consume(ptr); }
[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(byte* ptr)
{
}
|
For comparison, what does the graph look like if you don't do that? |
It looks like we might want to revise our heuristics, e.g. here what Clang/LLVM does: For Zen4 (AMD 7xxx) it unrolls up to 512 bytes (AVX512): https://godbolt.org/z/PxvoE4P9r For a generic CPU without AVX it unrolls up to 128 bytes: https://godbolt.org/z/b4vd13PMz Our threshold is hard-coded to 128. (and 256 for ARM64) |
|
NOTE: afair, some (most?) libs in BCL use |
Yes, everything in the shared framework: runtime/src/libraries/Directory.Build.targets Line 209 in a923c64 |
|
cc @anthonycanino (in case if you're interested adjusting the BLK unroll heuristic for avx-512) |
Updated. Fixed via #83274 |
8c1d241 to
167ce5a
Compare
a4a84ab to
588b964
Compare
|
Is stackalloc different to locals? As .NET 7 will partially unroll and loop the local zeroing (though currently only uses vxorps xmm4, xmm4
mov rax, -0x2340
vmovdqa xmmword ptr [rbp+rax-60H], xmm4
vmovdqa xmmword ptr [rbp+rax-50H], xmm4
vmovdqa xmmword ptr [rbp+rax-40H], xmm4
add rax, 48
jne SHORT -5 instrRather than this weird thing G_M000_IG03:
push 0
push 0
dec rcx
jne SHORT G_M000_IG03 ;; slow loop (zeroing 16 bytes at once) |
|
i.e. should stackalloc be part of locals? |
Good question. We do that if void Test(bool cond)
{
if (cond)
{
// rarely taken condition
var p = stackalloc byte[128];
Consume(p);
}
else
{
Console.WriteLine();
}
}Codegen: ; Method Program:Test(bool):this
G_M34929_IG01: ;; offset=0000H
4881ECA8000000 sub rsp, 168
C5D857E4 vxorps xmm4, xmm4
C5F97F642420 vmovdqa xmmword ptr [rsp+20H], xmm4
C5F97F642430 vmovdqa xmmword ptr [rsp+30H], xmm4
48B8A0FFFFFFFFFFFFFF mov rax, -96
C5F97FA404A0000000 vmovdqa xmmword ptr [rsp+rax+A0H], xmm4
C5F97FA404B0000000 vmovdqa xmmword ptr [rsp+rax+B0H], xmm4
C5F97FA404C0000000 vmovdqa xmmword ptr [rsp+rax+C0H], xmm4
4883C030 add rax, 48
75DF jne SHORT -5 instr
48B878563412F0DEBC9A mov rax, 0x9ABCDEF012345678
48898424A0000000 mov qword ptr [rsp+A0H], rax
;; size=84 bbWeight=1 PerfScore 13.33
G_M34929_IG02: ;; offset=0054H
84D2 test dl, dl
742D je SHORT G_M34929_IG06
;; size=4 bbWeight=1 PerfScore 1.25
G_M34929_IG03: ;; offset=0058H
488D4C2420 lea rcx, [rsp+20H]
FF15150A7100 call [Program:Consume(ulong)]
48B978563412F0DEBC9A mov rcx, 0x9ABCDEF012345678
48398C24A0000000 cmp qword ptr [rsp+A0H], rcx
7405 je SHORT G_M34929_IG04
E824CA4B5F call CORINFO_HELP_FAIL_FAST
;; size=36 bbWeight=0.50 PerfScore 3.88
G_M34929_IG04: ;; offset=007CH
90 nop
;; size=1 bbWeight=0.50 PerfScore 0.12
G_M34929_IG05: ;; offset=007DH
4881C4A8000000 add rsp, 168
C3 ret
;; size=8 bbWeight=0.50 PerfScore 0.62
G_M34929_IG06: ;; offset=0085H
FF152DA69000 call [System.Console:WriteLine()]
48B978563412F0DEBC9A mov rcx, 0x9ABCDEF012345678
48398C24A0000000 cmp qword ptr [rsp+A0H], rcx
7405 je SHORT G_M34929_IG07
E8FCC94B5F call CORINFO_HELP_FAIL_FAST
;; size=31 bbWeight=0.50 PerfScore 3.62
G_M34929_IG07: ;; offset=00A4H
90 nop
;; size=1 bbWeight=0.50 PerfScore 0.12
G_M34929_IG08: ;; offset=00A5H
4881C4A8000000 add rsp, 168
C3 ret
;; size=8 bbWeight=0.50 PerfScore 0.62
; Total bytes of code: 173Also, here we don't do stack probing. |
|
@jakobbotsch @BruceForstall @dotnet/jit-contrib PTAL I inject a BLK node in Lower for all stackalloc nodes ( |
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
Closes #63500
Let's insert GT_BLK after GT_HEAPLCL to rely on the former to perform zeroing.
Codegen example:
Main:
PR:
For large constants, this PR switches to
call memsetwhile current Main's impl will still be doing that loop of double-push.Benchmark
Core i7 8700K
Ryzen 7950X
NOTE: 32 bytes and lower are handled separately so there are no differences for them.