Add intrinsic for SpanHelpers.Char.IndexOfAny on AArch64 by SwapnilGaikwad · Pull Request #73788 · dotnet/runtime

SwapnilGaikwad · 2022-08-11T16:49:07Z

No description provided.

ghost · 2022-08-11T16:49:21Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author:	SwapnilGaikwad
Assignees:	-
Labels:	`area-System.Memory`
Milestone:	-

stephentoub · 2022-08-11T18:55:31Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

-                        // So the bit position in 'matches' corresponds to the element offset.
-                        if (matches == 0)
+                        combinedVector = (Vector128.Equals(values0, search) | Vector128.Equals(values1, search)).AsByte();
+                        if (!VectorContainsMatch(combinedVector))


This needs a helper like this? Other methods appear to achieve the same thing by using e.g. combinedVector.AsByte().ExtractMostSignificantBits() == 0... that's not feasible here, or doesn't perform well, or some such thing? e.g.

runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

Lines 655 to 656 in 3e0a5ad

uint matches = Vector128.Equals(values, search).AsByte().ExtractMostSignificantBits();

if (matches == 0)

The helper is emitting a better sequence of instructions, consequently higher performance, while detecting a match.

with VectorContainsMatch:

... umaxp v19.16b, v18.16b, v18.16b umov x7, v19.d[0] cbnz x7, G_M000_IG17 ...

with ExtractMostSignificantBits:

... ldr q18, [@RWD00] and v18.16b, v16.16b, v18.16b ldr q17, [@RWD16] ushl v16.16b, v18.16b, v17.16b movi v17.4s, #0x00 ext v17.16b, v16.16b, v17.16b, #8 addv b17, v17.8b umov w0, v17.b[0] lsl w0, w0, #8 addv b16, v16.8b umov w1, v16.b[0] orr w1, w0, w1 cbz w1, G_M000_IG08 ... RWD00 dq 8080808080808080h, 8080808080808080h RWD16 dq 00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h

On altra (not configured for benchmarking):

| Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio | |---------------------- |----------- |---------------------------------------------------------------------------------------------------------- |----- |----------:|---------:|---------:|----------:|----------:|----------:|------:|---------------- |----------:|------------:| | IndexOfAnyTwoValues | Job-YNXVVV | /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 94.40 ns | 0.073 ns | 0.068 ns | 94.42 ns | 94.19 ns | 94.46 ns | 1.62 | Slower | - | NA | | IndexOfAnyTwoValues | Job-TMIMPY | /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 32.40 ns | 0.019 ns | 0.015 ns | 32.39 ns | 32.38 ns | 32.43 ns | 0.56 | Faster | - | NA | | IndexOfAnyTwoValues | Job-EKBZGE | /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 58.32 ns | 0.021 ns | 0.019 ns | 58.33 ns | 58.29 ns | 58.36 ns | 1.00 | Base | - | NA | | | | | | | | | | | | | | | | | IndexOfAnyThreeValues | Job-YNXVVV | /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 109.25 ns | 0.328 ns | 0.307 ns | 109.40 ns | 108.48 ns | 109.51 ns | 1.58 | Slower | - | NA | | IndexOfAnyThreeValues | Job-TMIMPY | /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 43.68 ns | 0.028 ns | 0.026 ns | 43.68 ns | 43.64 ns | 43.74 ns | 0.63 | Faster | - | NA | | IndexOfAnyThreeValues | Job-EKBZGE | /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 69.29 ns | 0.086 ns | 0.080 ns | 69.30 ns | 69.13 ns | 69.40 ns | 1.00 | Base | - | NA | | | | | | | | | | | | | | | | | IndexOfAnyFourValues | Job-YNXVVV | /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 129.95 ns | 0.024 ns | 0.022 ns | 129.94 ns | 129.92 ns | 129.99 ns | 1.53 | Slower | - | NA | | IndexOfAnyFourValues | Job-TMIMPY | /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 57.56 ns | 0.066 ns | 0.062 ns | 57.58 ns | 57.49 ns | 57.64 ns | 0.68 | Faster | - | NA | | IndexOfAnyFourValues | Job-EKBZGE | /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 84.85 ns | 0.109 ns | 0.102 ns | 84.87 ns | 84.61 ns | 84.99 ns | 1.00 | Base | - | NA |

I need to confirm the above benchmarking numbers on a better system, probably @adamsitnik can help.

Full Assembly:

VectorContainsMatch

; Assembly listing for method SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int ; Emitting BLENDED_CODE for generic ARM64 CPU - Unix ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; 0 inlinees with PGO data; 6 single block inlinees; 4 inlinees without PGO data G_M000_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-16]! 910003FD mov fp, sp G_M000_IG02: ;; offset=0008H AA1F03E4 mov x4, xzr 2A0303E5 mov w5, w3 93407C63 sxtw x3, w3 D1002063 sub x3, x3, #8 F100007F cmp x3, #0 540003AB blt G_M000_IG05 G_M000_IG03: ;; offset=0020H AA0303E5 mov x5, x3 14000033 b G_M000_IG14 align [0 bytes for IG07] align [0 bytes] align [0 bytes] align [0 bytes] G_M000_IG04: ;; offset=0028H D37FF883 lsl x3, x4, #1 8B030003 add x3, x0, x3 79400066 ldrh w6, [x3] 53003C27 uxth w7, w1 6B0600FF cmp w7, w6 54000580 beq G_M000_IG13 53003C48 uxth w8, w2 6B06011F cmp w8, w6 54000520 beq G_M000_IG13 79400466 ldrh w6, [x3,#2] 6B0600FF cmp w7, w6 54000480 beq G_M000_IG12 6B06011F cmp w8, w6 54000440 beq G_M000_IG12 79400866 ldrh w6, [x3,#4] 6B0600FF cmp w7, w6 540003A0 beq G_M000_IG11 6B06011F cmp w8, w6 54000360 beq G_M000_IG11 79400C66 ldrh w6, [x3,#6] 6B0600FF cmp w7, w6 540002C0 beq G_M000_IG10 6B06011F cmp w8, w6 54000280 beq G_M000_IG10 91001084 add x4, x4, #4 D10010A5 sub x5, x5, #4 G_M000_IG05: ;; offset=0090H F10010BF cmp x5, #4 54FFFCA2 bhs G_M000_IG04 G_M000_IG06: ;; offset=0098H B4000185 cbz x5, G_M000_IG08 53003C27 uxth w7, w1 G_M000_IG07: ;; offset=00A0H D37FF886 lsl x6, x4, #1 78666806 ldrh w6, [x0, x6] 6B0600FF cmp w7, w6 54000200 beq G_M000_IG13 53003C48 uxth w8, w2 6B06011F cmp w8, w6 540001A0 beq G_M000_IG13 91000484 add x4, x4, #1 D10004A5 sub x5, x5, #1 B5FFFEE5 cbnz x5, G_M000_IG07 G_M000_IG08: ;; offset=00C8H 12800000 movn w0, #0 G_M000_IG09: ;; offset=00CCH A8C17BFD ldp fp, lr, [sp],#16 D65F03C0 ret lr G_M000_IG10: ;; offset=00D4H 11000C84 add w4, w4, #3 14000025 b G_M000_IG18 align [0 bytes for IG15] align [0 bytes] align [0 bytes] align [0 bytes] G_M000_IG11: ;; offset=00DCH 11000884 add w4, w4, #2 14000023 b G_M000_IG18 G_M000_IG12: ;; offset=00E4H 11000484 add w4, w4, #1 14000021 b G_M000_IG18 G_M000_IG13: ;; offset=00ECH 14000020 b G_M000_IG18 G_M000_IG14: ;; offset=00F0H 53003C27 uxth w7, w1 4E020CF0 dup v16.8h, w7 53003C48 uxth w8, w2 4E020D11 dup v17.8h, w8 B4000185 cbz x5, G_M000_IG16 G_M000_IG15: ;; offset=0104H D37FF881 lsl x1, x4, #1 3CE16812 ldr q18, [x0, x1] 6E728E13 cmeq v19.8h, v16.8h, v18.8h 6E728E32 cmeq v18.8h, v17.8h, v18.8h 4EB21E72 orr v18.8h, v19.8h, v18.8h 6E32A653 umaxp v19.16b, v18.16b, v18.16b 4E083E67 umov x7, v19.d[0] B50001A7 cbnz x7, G_M000_IG17 91002084 add x4, x4, #8 EB0400BF cmp x5, x4 54FFFEC8 bhi G_M000_IG15 G_M000_IG16: ;; offset=0130H D37FF8A4 lsl x4, x5, #1 3CE46812 ldr q18, [x0, x4] AA0503E4 mov x4, x5 6E728E10 cmeq v16.8h, v16.8h, v18.8h 6E728E31 cmeq v17.8h, v17.8h, v18.8h 4EB11E12 orr v18.8h, v16.8h, v17.8h 6E32A650 umaxp v16.16b, v18.16b, v18.16b 4E083E00 umov x0, v16.d[0] B4FFFBC0 cbz x0, G_M000_IG08 G_M000_IG17: ;; offset=0154H 6E32A650 umaxp v16.16b, v18.16b, v18.16b 4E083E00 umov x0, v16.d[0] DAC00000 rbit x0, x0 DAC01000 clz x0, x0 13037C00 asr w0, w0, #3 0B040004 add w4, w0, w4 G_M000_IG18: ;; offset=016CH 2A0403E0 mov w0, w4 G_M000_IG19: ;; offset=0170H A8C17BFD ldp fp, lr, [sp],#16 D65F03C0 ret lr ; Total bytes of code 376

ExtractMostSignificantBits

; Assembly listing for method SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int ; Emitting BLENDED_CODE for generic ARM64 CPU - Unix ; optimized code ; fp based frame ; fully interruptible ; No PGO data ; 0 inlinees with PGO data; 7 single block inlinees; 1 inlinees without PGO data G_M000_IG01: ;; offset=0000H A9BF7BFD stp fp, lr, [sp,#-16]! 910003FD mov fp, sp G_M000_IG02: ;; offset=0008H AA1F03E4 mov x4, xzr 2A0303E5 mov w5, w3 93407C63 sxtw x3, w3 D1002063 sub x3, x3, #8 F100007F cmp x3, #0 540003AB blt G_M000_IG05 G_M000_IG03: ;; offset=0020H AA0303E5 mov x5, x3 14000033 b G_M000_IG14 align [0 bytes for IG07] align [0 bytes] align [0 bytes] align [0 bytes] G_M000_IG04: ;; offset=0028H D37FF883 lsl x3, x4, #1 8B030003 add x3, x0, x3 79400066 ldrh w6, [x3] 53003C27 uxth w7, w1 6B0600FF cmp w7, w6 54000580 beq G_M000_IG13 53003C48 uxth w8, w2 6B06011F cmp w8, w6 54000520 beq G_M000_IG13 79400466 ldrh w6, [x3,#2] 6B0600FF cmp w7, w6 54000480 beq G_M000_IG12 6B06011F cmp w8, w6 54000440 beq G_M000_IG12 79400866 ldrh w6, [x3,#4] 6B0600FF cmp w7, w6 540003A0 beq G_M000_IG11 6B06011F cmp w8, w6 54000360 beq G_M000_IG11 79400C66 ldrh w6, [x3,#6] 6B0600FF cmp w7, w6 540002C0 beq G_M000_IG10 6B06011F cmp w8, w6 54000280 beq G_M000_IG10 91001084 add x4, x4, #4 D10010A5 sub x5, x5, #4 G_M000_IG05: ;; offset=0090H F10010BF cmp x5, #4 54FFFCA2 bhs G_M000_IG04 G_M000_IG06: ;; offset=0098H B4000185 cbz x5, G_M000_IG08 53003C27 uxth w7, w1 G_M000_IG07: ;; offset=00A0H D37FF886 lsl x6, x4, #1 78666806 ldrh w6, [x0, x6] 6B0600FF cmp w7, w6 54000200 beq G_M000_IG13 53003C48 uxth w8, w2 6B06011F cmp w8, w6 540001A0 beq G_M000_IG13 91000484 add x4, x4, #1 D10004A5 sub x5, x5, #1 B5FFFEE5 cbnz x5, G_M000_IG07 G_M000_IG08: ;; offset=00C8H 12800000 movn w0, #0 G_M000_IG09: ;; offset=00CCH A8C17BFD ldp fp, lr, [sp],#16 D65F03C0 ret lr G_M000_IG10: ;; offset=00D4H 11000C84 add w4, w4, #3 14000039 b G_M000_IG18 align [0 bytes for IG15] align [0 bytes] align [0 bytes] align [0 bytes] G_M000_IG11: ;; offset=00DCH 11000884 add w4, w4, #2 14000037 b G_M000_IG18 G_M000_IG12: ;; offset=00E4H 11000484 add w4, w4, #1 14000035 b G_M000_IG18 G_M000_IG13: ;; offset=00ECH 14000034 b G_M000_IG18 G_M000_IG14: ;; offset=00F0H 53003C27 uxth w7, w1 4E020CF0 dup v16.8h, w7 53003C48 uxth w8, w2 4E020D11 dup v17.8h, w8 B40002C5 cbz x5, G_M000_IG16 9C000672 ldr q18, [@RWD00] G_M000_IG15: ;; offset=0108H D37FF881 lsl x1, x4, #1 3CE16813 ldr q19, [x0, x1] 6E738E14 cmeq v20.8h, v16.8h, v19.8h 6E738E33 cmeq v19.8h, v17.8h, v19.8h 4EB31E93 orr v19.8h, v20.8h, v19.8h 4E321E73 and v19.16b, v19.16b, v18.16b 9C000614 ldr q20, [@RWD16] 6E344673 ushl v19.16b, v19.16b, v20.16b 4F000414 movi v20.4s, #0x00 6E144274 ext v20.16b, v19.16b, v20.16b, #8 0E31BA94 addv b20, v20.8b 0E013E81 umov w1, v20.b[0] 53185C21 lsl w1, w1, #8 0E31BA73 addv b19, v19.8b 0E013E62 umov w2, v19.b[0] 2A020021 orr w1, w1, w2 350002E1 cbnz w1, G_M000_IG17 91002084 add x4, x4, #8 EB0400BF cmp x5, x4 54FFFDA8 bhi G_M000_IG15 G_M000_IG16: ;; offset=0158H D37FF8A4 lsl x4, x5, #1 3CE46812 ldr q18, [x0, x4] AA0503E4 mov x4, x5 6E728E10 cmeq v16.8h, v16.8h, v18.8h 6E728E32 cmeq v18.8h, v17.8h, v18.8h 4EB21E10 orr v16.8h, v16.8h, v18.8h 9C000312 ldr q18, [@RWD00] 4E321E12 and v18.16b, v16.16b, v18.16b 9C000351 ldr q17, [@RWD16] 6E314650 ushl v16.16b, v18.16b, v17.16b 4F000411 movi v17.4s, #0x00 6E114211 ext v17.16b, v16.16b, v17.16b, #8 0E31BA31 addv b17, v17.8b 0E013E20 umov w0, v17.b[0] 53185C00 lsl w0, w0, #8 0E31BA10 addv b16, v16.8b 0E013E01 umov w1, v16.b[0] 2A010001 orr w1, w0, w1 34FFF941 cbz w1, G_M000_IG08 G_M000_IG17: ;; offset=01A4H 5AC00020 rbit w0, w1 5AC01000 clz w0, w0 2A0003E0 mov w0, w0 D341FC00 lsr x0, x0, #1 8B000084 add x4, x4, x0 17FFFFCD b G_M000_IG13 G_M000_IG18: ;; offset=01BCH 2A0403E0 mov w0, w4 G_M000_IG19: ;; offset=01C0H A8C17BFD ldp fp, lr, [sp],#16 D65F03C0 ret lr RWD00 dq 8080808080808080h, 8080808080808080h RWD16 dq 00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h ; Total bytes of code 456

stephentoub · 2022-08-11T18:57:29Z

Is this PR needed given #73469?

SwapnilGaikwad · 2022-08-12T17:40:47Z

Is this PR needed given #73469?

Sure, we don't need this PR. I will close this one once we transfer the useful parts of this to #73469 .

bartonjs · 2022-08-15T17:23:18Z

I will close this one once we transfer the useful parts of this to #73469 .

73469 has been merged, so should this be closed now?

stephentoub · 2022-08-16T13:30:55Z

I'm going to close this as it doesn't appear to be actionable now. @SwapnilGaikwad, if there are specific pieces that should be ported over, can you open a new PR for that? Thanks!

SwapnilGaikwad · 2022-08-16T14:15:45Z

Slightly mistimed the updates. It seems we can squeeze some performance at the cost of readability.
Created a new PR #74010 .

Add intrinsic for SpanHelpers.Char.IndexOfAny on AArch64

dec3a74

ghost added area-System.Memory community-contribution Indicates that the PR has been added by a community member labels Aug 11, 2022

SwapnilGaikwad mentioned this pull request Aug 11, 2022

port SpanHelpers.IndexOfAny(ref byte, byte, byte, int) to Vector128/256 #73556

Closed

stephentoub reviewed Aug 11, 2022

View reviewed changes

stephentoub closed this Aug 16, 2022

ghost locked as resolved and limited conversation to collaborators Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add intrinsic for SpanHelpers.Char.IndexOfAny on AArch64#73788

Add intrinsic for SpanHelpers.Char.IndexOfAny on AArch64#73788
SwapnilGaikwad wants to merge 1 commit intodotnet:mainfrom
SwapnilGaikwad:github-indexOfAny-intrinsics

SwapnilGaikwad commented Aug 11, 2022

Uh oh!

ghost commented Aug 11, 2022

Uh oh!

stephentoub Aug 11, 2022

Uh oh!

SwapnilGaikwad Aug 12, 2022

Uh oh!

stephentoub commented Aug 11, 2022

Uh oh!

SwapnilGaikwad commented Aug 12, 2022

Uh oh!

bartonjs commented Aug 15, 2022

Uh oh!

stephentoub commented Aug 16, 2022

Uh oh!

SwapnilGaikwad commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	uint matches = Vector128.Equals(values, search).AsByte().ExtractMostSignificantBits();
	if (matches == 0)

Comments

Conversation

SwapnilGaikwad commented Aug 11, 2022

Uh oh!

ghost commented Aug 11, 2022

Uh oh!

stephentoub Aug 11, 2022

Choose a reason for hiding this comment

Uh oh!

SwapnilGaikwad Aug 12, 2022

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Aug 11, 2022

Uh oh!

SwapnilGaikwad commented Aug 12, 2022

Uh oh!

bartonjs commented Aug 15, 2022

Uh oh!

stephentoub commented Aug 16, 2022

Uh oh!

SwapnilGaikwad commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants