Add intrinsic for SpanHelpers.Char.IndexOfAny on AArch64#73788
Add intrinsic for SpanHelpers.Char.IndexOfAny on AArch64#73788SwapnilGaikwad wants to merge 1 commit intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-memory Issue Detailsnull
|
| // So the bit position in 'matches' corresponds to the element offset. | ||
| if (matches == 0) | ||
| combinedVector = (Vector128.Equals(values0, search) | Vector128.Equals(values1, search)).AsByte(); | ||
| if (!VectorContainsMatch(combinedVector)) |
There was a problem hiding this comment.
This needs a helper like this? Other methods appear to achieve the same thing by using e.g. combinedVector.AsByte().ExtractMostSignificantBits() == 0... that's not feasible here, or doesn't perform well, or some such thing? e.g.
runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Lines 655 to 656 in 3e0a5ad
There was a problem hiding this comment.
The helper is emitting a better sequence of instructions, consequently higher performance, while detecting a match.
with VectorContainsMatch:
...
umaxp v19.16b, v18.16b, v18.16b
umov x7, v19.d[0]
cbnz x7, G_M000_IG17
...
with ExtractMostSignificantBits:
...
ldr q18, [@RWD00]
and v18.16b, v16.16b, v18.16b
ldr q17, [@RWD16]
ushl v16.16b, v18.16b, v17.16b
movi v17.4s, #0x00
ext v17.16b, v16.16b, v17.16b, #8
addv b17, v17.8b
umov w0, v17.b[0]
lsl w0, w0, #8
addv b16, v16.8b
umov w1, v16.b[0]
orr w1, w0, w1
cbz w1, G_M000_IG08
...
RWD00 dq 8080808080808080h, 8080808080808080h
RWD16 dq 00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h
On altra (not configured for benchmarking):
| Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|---------------------- |----------- |---------------------------------------------------------------------------------------------------------- |----- |----------:|---------:|---------:|----------:|----------:|----------:|------:|---------------- |----------:|------------:|
| IndexOfAnyTwoValues | Job-YNXVVV | /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 94.40 ns | 0.073 ns | 0.068 ns | 94.42 ns | 94.19 ns | 94.46 ns | 1.62 | Slower | - | NA |
| IndexOfAnyTwoValues | Job-TMIMPY | /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 32.40 ns | 0.019 ns | 0.015 ns | 32.39 ns | 32.38 ns | 32.43 ns | 0.56 | Faster | - | NA |
| IndexOfAnyTwoValues | Job-EKBZGE | /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 58.32 ns | 0.021 ns | 0.019 ns | 58.33 ns | 58.29 ns | 58.36 ns | 1.00 | Base | - | NA |
| | | | | | | | | | | | | | |
| IndexOfAnyThreeValues | Job-YNXVVV | /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 109.25 ns | 0.328 ns | 0.307 ns | 109.40 ns | 108.48 ns | 109.51 ns | 1.58 | Slower | - | NA |
| IndexOfAnyThreeValues | Job-TMIMPY | /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 43.68 ns | 0.028 ns | 0.026 ns | 43.68 ns | 43.64 ns | 43.74 ns | 0.63 | Faster | - | NA |
| IndexOfAnyThreeValues | Job-EKBZGE | /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 69.29 ns | 0.086 ns | 0.080 ns | 69.30 ns | 69.13 ns | 69.40 ns | 1.00 | Base | - | NA |
| | | | | | | | | | | | | | |
| IndexOfAnyFourValues | Job-YNXVVV | /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 129.95 ns | 0.024 ns | 0.022 ns | 129.94 ns | 129.92 ns | 129.99 ns | 1.53 | Slower | - | NA |
| IndexOfAnyFourValues | Job-TMIMPY | /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 57.56 ns | 0.066 ns | 0.062 ns | 57.58 ns | 57.49 ns | 57.64 ns | 0.68 | Faster | - | NA |
| IndexOfAnyFourValues | Job-EKBZGE | /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 84.85 ns | 0.109 ns | 0.102 ns | 84.87 ns | 84.61 ns | 84.99 ns | 1.00 | Base | - | NA |
I need to confirm the above benchmarking numbers on a better system, probably @adamsitnik can help.
Full Assembly:
VectorContainsMatch
; Assembly listing for method SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 4 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
G_M000_IG02: ;; offset=0008H
AA1F03E4 mov x4, xzr
2A0303E5 mov w5, w3
93407C63 sxtw x3, w3
D1002063 sub x3, x3, #8
F100007F cmp x3, #0
540003AB blt G_M000_IG05
G_M000_IG03: ;; offset=0020H
AA0303E5 mov x5, x3
14000033 b G_M000_IG14
align [0 bytes for IG07]
align [0 bytes]
align [0 bytes]
align [0 bytes]
G_M000_IG04: ;; offset=0028H
D37FF883 lsl x3, x4, #1
8B030003 add x3, x0, x3
79400066 ldrh w6, [x3]
53003C27 uxth w7, w1
6B0600FF cmp w7, w6
54000580 beq G_M000_IG13
53003C48 uxth w8, w2
6B06011F cmp w8, w6
54000520 beq G_M000_IG13
79400466 ldrh w6, [x3,#2]
6B0600FF cmp w7, w6
54000480 beq G_M000_IG12
6B06011F cmp w8, w6
54000440 beq G_M000_IG12
79400866 ldrh w6, [x3,#4]
6B0600FF cmp w7, w6
540003A0 beq G_M000_IG11
6B06011F cmp w8, w6
54000360 beq G_M000_IG11
79400C66 ldrh w6, [x3,#6]
6B0600FF cmp w7, w6
540002C0 beq G_M000_IG10
6B06011F cmp w8, w6
54000280 beq G_M000_IG10
91001084 add x4, x4, #4
D10010A5 sub x5, x5, #4
G_M000_IG05: ;; offset=0090H
F10010BF cmp x5, #4
54FFFCA2 bhs G_M000_IG04
G_M000_IG06: ;; offset=0098H
B4000185 cbz x5, G_M000_IG08
53003C27 uxth w7, w1
G_M000_IG07: ;; offset=00A0H
D37FF886 lsl x6, x4, #1
78666806 ldrh w6, [x0, x6]
6B0600FF cmp w7, w6
54000200 beq G_M000_IG13
53003C48 uxth w8, w2
6B06011F cmp w8, w6
540001A0 beq G_M000_IG13
91000484 add x4, x4, #1
D10004A5 sub x5, x5, #1
B5FFFEE5 cbnz x5, G_M000_IG07
G_M000_IG08: ;; offset=00C8H
12800000 movn w0, #0
G_M000_IG09: ;; offset=00CCH
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
G_M000_IG10: ;; offset=00D4H
11000C84 add w4, w4, #3
14000025 b G_M000_IG18
align [0 bytes for IG15]
align [0 bytes]
align [0 bytes]
align [0 bytes]
G_M000_IG11: ;; offset=00DCH
11000884 add w4, w4, #2
14000023 b G_M000_IG18
G_M000_IG12: ;; offset=00E4H
11000484 add w4, w4, #1
14000021 b G_M000_IG18
G_M000_IG13: ;; offset=00ECH
14000020 b G_M000_IG18
G_M000_IG14: ;; offset=00F0H
53003C27 uxth w7, w1
4E020CF0 dup v16.8h, w7
53003C48 uxth w8, w2
4E020D11 dup v17.8h, w8
B4000185 cbz x5, G_M000_IG16
G_M000_IG15: ;; offset=0104H
D37FF881 lsl x1, x4, #1
3CE16812 ldr q18, [x0, x1]
6E728E13 cmeq v19.8h, v16.8h, v18.8h
6E728E32 cmeq v18.8h, v17.8h, v18.8h
4EB21E72 orr v18.8h, v19.8h, v18.8h
6E32A653 umaxp v19.16b, v18.16b, v18.16b
4E083E67 umov x7, v19.d[0]
B50001A7 cbnz x7, G_M000_IG17
91002084 add x4, x4, #8
EB0400BF cmp x5, x4
54FFFEC8 bhi G_M000_IG15
G_M000_IG16: ;; offset=0130H
D37FF8A4 lsl x4, x5, #1
3CE46812 ldr q18, [x0, x4]
AA0503E4 mov x4, x5
6E728E10 cmeq v16.8h, v16.8h, v18.8h
6E728E31 cmeq v17.8h, v17.8h, v18.8h
4EB11E12 orr v18.8h, v16.8h, v17.8h
6E32A650 umaxp v16.16b, v18.16b, v18.16b
4E083E00 umov x0, v16.d[0]
B4FFFBC0 cbz x0, G_M000_IG08
G_M000_IG17: ;; offset=0154H
6E32A650 umaxp v16.16b, v18.16b, v18.16b
4E083E00 umov x0, v16.d[0]
DAC00000 rbit x0, x0
DAC01000 clz x0, x0
13037C00 asr w0, w0, #3
0B040004 add w4, w0, w4
G_M000_IG18: ;; offset=016CH
2A0403E0 mov w0, w4
G_M000_IG19: ;; offset=0170H
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
; Total bytes of code 376
ExtractMostSignificantBits
; Assembly listing for method SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 7 single block inlinees; 1 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
G_M000_IG02: ;; offset=0008H
AA1F03E4 mov x4, xzr
2A0303E5 mov w5, w3
93407C63 sxtw x3, w3
D1002063 sub x3, x3, #8
F100007F cmp x3, #0
540003AB blt G_M000_IG05
G_M000_IG03: ;; offset=0020H
AA0303E5 mov x5, x3
14000033 b G_M000_IG14
align [0 bytes for IG07]
align [0 bytes]
align [0 bytes]
align [0 bytes]
G_M000_IG04: ;; offset=0028H
D37FF883 lsl x3, x4, #1
8B030003 add x3, x0, x3
79400066 ldrh w6, [x3]
53003C27 uxth w7, w1
6B0600FF cmp w7, w6
54000580 beq G_M000_IG13
53003C48 uxth w8, w2
6B06011F cmp w8, w6
54000520 beq G_M000_IG13
79400466 ldrh w6, [x3,#2]
6B0600FF cmp w7, w6
54000480 beq G_M000_IG12
6B06011F cmp w8, w6
54000440 beq G_M000_IG12
79400866 ldrh w6, [x3,#4]
6B0600FF cmp w7, w6
540003A0 beq G_M000_IG11
6B06011F cmp w8, w6
54000360 beq G_M000_IG11
79400C66 ldrh w6, [x3,#6]
6B0600FF cmp w7, w6
540002C0 beq G_M000_IG10
6B06011F cmp w8, w6
54000280 beq G_M000_IG10
91001084 add x4, x4, #4
D10010A5 sub x5, x5, #4
G_M000_IG05: ;; offset=0090H
F10010BF cmp x5, #4
54FFFCA2 bhs G_M000_IG04
G_M000_IG06: ;; offset=0098H
B4000185 cbz x5, G_M000_IG08
53003C27 uxth w7, w1
G_M000_IG07: ;; offset=00A0H
D37FF886 lsl x6, x4, #1
78666806 ldrh w6, [x0, x6]
6B0600FF cmp w7, w6
54000200 beq G_M000_IG13
53003C48 uxth w8, w2
6B06011F cmp w8, w6
540001A0 beq G_M000_IG13
91000484 add x4, x4, #1
D10004A5 sub x5, x5, #1
B5FFFEE5 cbnz x5, G_M000_IG07
G_M000_IG08: ;; offset=00C8H
12800000 movn w0, #0
G_M000_IG09: ;; offset=00CCH
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
G_M000_IG10: ;; offset=00D4H
11000C84 add w4, w4, #3
14000039 b G_M000_IG18
align [0 bytes for IG15]
align [0 bytes]
align [0 bytes]
align [0 bytes]
G_M000_IG11: ;; offset=00DCH
11000884 add w4, w4, #2
14000037 b G_M000_IG18
G_M000_IG12: ;; offset=00E4H
11000484 add w4, w4, #1
14000035 b G_M000_IG18
G_M000_IG13: ;; offset=00ECH
14000034 b G_M000_IG18
G_M000_IG14: ;; offset=00F0H
53003C27 uxth w7, w1
4E020CF0 dup v16.8h, w7
53003C48 uxth w8, w2
4E020D11 dup v17.8h, w8
B40002C5 cbz x5, G_M000_IG16
9C000672 ldr q18, [@RWD00]
G_M000_IG15: ;; offset=0108H
D37FF881 lsl x1, x4, #1
3CE16813 ldr q19, [x0, x1]
6E738E14 cmeq v20.8h, v16.8h, v19.8h
6E738E33 cmeq v19.8h, v17.8h, v19.8h
4EB31E93 orr v19.8h, v20.8h, v19.8h
4E321E73 and v19.16b, v19.16b, v18.16b
9C000614 ldr q20, [@RWD16]
6E344673 ushl v19.16b, v19.16b, v20.16b
4F000414 movi v20.4s, #0x00
6E144274 ext v20.16b, v19.16b, v20.16b, #8
0E31BA94 addv b20, v20.8b
0E013E81 umov w1, v20.b[0]
53185C21 lsl w1, w1, #8
0E31BA73 addv b19, v19.8b
0E013E62 umov w2, v19.b[0]
2A020021 orr w1, w1, w2
350002E1 cbnz w1, G_M000_IG17
91002084 add x4, x4, #8
EB0400BF cmp x5, x4
54FFFDA8 bhi G_M000_IG15
G_M000_IG16: ;; offset=0158H
D37FF8A4 lsl x4, x5, #1
3CE46812 ldr q18, [x0, x4]
AA0503E4 mov x4, x5
6E728E10 cmeq v16.8h, v16.8h, v18.8h
6E728E32 cmeq v18.8h, v17.8h, v18.8h
4EB21E10 orr v16.8h, v16.8h, v18.8h
9C000312 ldr q18, [@RWD00]
4E321E12 and v18.16b, v16.16b, v18.16b
9C000351 ldr q17, [@RWD16]
6E314650 ushl v16.16b, v18.16b, v17.16b
4F000411 movi v17.4s, #0x00
6E114211 ext v17.16b, v16.16b, v17.16b, #8
0E31BA31 addv b17, v17.8b
0E013E20 umov w0, v17.b[0]
53185C00 lsl w0, w0, #8
0E31BA10 addv b16, v16.8b
0E013E01 umov w1, v16.b[0]
2A010001 orr w1, w0, w1
34FFF941 cbz w1, G_M000_IG08
G_M000_IG17: ;; offset=01A4H
5AC00020 rbit w0, w1
5AC01000 clz w0, w0
2A0003E0 mov w0, w0
D341FC00 lsr x0, x0, #1
8B000084 add x4, x4, x0
17FFFFCD b G_M000_IG13
G_M000_IG18: ;; offset=01BCH
2A0403E0 mov w0, w4
G_M000_IG19: ;; offset=01C0H
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
RWD00 dq 8080808080808080h, 8080808080808080h
RWD16 dq 00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h
; Total bytes of code 456
|
Is this PR needed given #73469? |
73469 has been merged, so should this be closed now? |
|
I'm going to close this as it doesn't appear to be actionable now. @SwapnilGaikwad, if there are specific pieces that should be ported over, can you open a new PR for that? Thanks! |
|
Slightly mistimed the updates. It seems we can squeeze some performance at the cost of readability. |
No description provided.