Add native ARM64 popcount for sizeof(T) above 1 to TensorPrimitives#103214
Merged
tannergooding merged 1 commit intodotnet:mainfrom Jun 12, 2024
Merged
Add native ARM64 popcount for sizeof(T) above 1 to TensorPrimitives#103214tannergooding merged 1 commit intodotnet:mainfrom
tannergooding merged 1 commit intodotnet:mainfrom
Conversation
Contributor
|
Tagging subscribers to this area: @dotnet/area-system-numerics |
This was referenced Jun 10, 2024
stephentoub
approved these changes
Jun 11, 2024
Member
stephentoub
left a comment
There was a problem hiding this comment.
LGTM. Thanks. @tannergooding ?
tannergooding
approved these changes
Jun 11, 2024
Open
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I validated that the codegen before and after the change for
sizeof(T) is 1inInvokeSpanIntoSpan.Vectorized128is identical so whatever difference there is, it is likely the noise (from non-temporal stores? I found the numbers unstable even at 256KiB of data).For
sizeof(T) is 8I also tested Vector128 into Vector64x2 withvaddvq_u32and back into V128 but it was slower than continuing to bruteforce.Environment:
Benchmark:
Current:
PR: