Accelerate Vector128<long>::op_Multiply on x64#103555
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics |
|
Note: results should be better if we do it in JIT, it will enable loop hoisting, cse, etc for MUL |
|
Note #103539 (comment) (and https://godbolt.org/z/eqsrf341M) from xxHash128 issue. |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs
Outdated
Show resolved
Hide resolved
…sics/Vector128_1.cs Co-authored-by: Tanner Gooding <tagoo@outlook.com>
|
@EgorBot -amd -intel -arm64 -profiler --envvars DOTNET_PreferredVectorBitWidth:128 using System.IO.Hashing;
using BenchmarkDotNet.Attributes;
public class Bench
{
static readonly byte[] Data = new byte[1000000];
[Benchmark]
public byte[] BenchXxHash128()
{
XxHash128 hash = new();
hash.Append(Data);
return hash.GetHashAndReset();
}
} |
Benchmark results on Intel
Flame graphs: Main vs PR 🔥 For clean |
Benchmark results on Amd
Flame graphs: Main vs PR 🔥 For clean |
Benchmark results on Arm64
Flame graphs: Main vs PR 🔥 For clean |
|
/azp list |
This comment was marked as resolved.
This comment was marked as resolved.
|
/azp run runtime-coreclr jitstress-isas-x86 |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@tannergooding PTAL, I'll add arm64 separately, need to test different impls. Benchmark improvement: #103555 (comment) |
| // Vector256<int> tmp3 = Avx2.HorizontalAdd(tmp2.AsInt32(), Vector256<int>.Zero); | ||
| GenTreeHWIntrinsic* tmp3 = | ||
| gtNewSimdHWIntrinsicNode(type, tmp2, gtNewZeroConNode(type), | ||
| is256 ? NI_AVX2_HorizontalAdd : NI_SSSE3_HorizontalAdd, | ||
| CORINFO_TYPE_UINT, simdSize); |
There was a problem hiding this comment.
I know in other places we've started avoiding hadd in favor of shuffle+add, might be worth seeing if that's appropriate here too (low priority, non blocking)
There was a problem hiding this comment.
I tried to benchmark different implementations for it and they all were equaly fast e.g. #99871 (comment)
| if (TARGET_POINTER_SIZE == 4) | ||
| { | ||
| // TODO-XARCH-CQ: We should support long/ulong multiplication | ||
| // TODO-XARCH-CQ: 32bit support |
There was a problem hiding this comment.
What's blocking 32-bit support? It doesn't look like we're using any _X64 intrinsics in the fallback logic?
There was a problem hiding this comment.
Not sure to be honest, that check was pre-existing, I only changed comment
This PR optimizes
Vector128andVector256multiplication forlong/ulongwhen AVX512 is not presented in the system. It makes XxHash128 faster, see #103555 (comment)Current codegen on x64 cpu without AVX512:
New codegen: