Conversation
| int remainder; | ||
| if (Avx2.IsSupported) | ||
|
|
||
| if (Avx512BW.IsSupported) |
There was a problem hiding this comment.
This one's probably ok, but you'll want to consider bypassing the AVX-512 path on older hardware like Skylake-X, because of the well-known downclocking issues.
Vector512.IsHardwareAccelerated will return false on these processors, even though they support the AVX-512 instructions, so you can check both conditions to make sure the code only runs where AVX-512 is supported and fast.
There was a problem hiding this comment.
Foot meet gun!
I'll add the additional checks.
| result = AdvSimd.Insert(result, 3, AdvSimd.Extract(vector, (byte)((control >> 6) & 0x3))); | ||
| #pragma warning restore CA1857 // A constant is expected for the parameter | ||
| return result; | ||
| } |
There was a problem hiding this comment.
This is going to be a perf killer in most cases. It'll be worth testing any path that uses this to make sure it doesn't regress to slower than scalar when Arm starts taking a SIMD path it skipped before.
There was a problem hiding this comment.
I tossed a coin over this one. My thought was I was making enough savings elsewhere in calling method to justify the cost here. I need to see what the codegen is like, but I have no means to do so currently.
There was a problem hiding this comment.
🤮I'll um, i'll remove that then.
There was a problem hiding this comment.
I wish that you could define constants from a value given a parameter is a constant. I bet the codegen would be much better then.
- public static Vector128<float> Shuffle(Vector128<float> vector, [ConstantExpected] byte control)
+ public static Vector128<float> Shuffle(Vector128<float> vector, const byte control)Then...
const byte w = (byte)((control >> 6) & 0x3)There was a problem hiding this comment.
I would have thought the fact that it's inlined and the control byte is a constant would have been enough, so I'm quite surprised at that codegen. I'll have a look in the morning and see if I can figure out what's up.
There was a problem hiding this comment.
Looks to be an issue with where the constant folding happens.
The inlininer is happening and the operands seen are effectively:
[000000] ----------- arg0 +--* LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1> V00 arg0
[000006] ----------- arg1 +--* CNS_INT int 0
[000007] ----------- arg2 +--* LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1> V00 arg0
[000010] ----------- arg3 \--* AND int
[000008] ----------- +--* CNS_INT int 228
[000009] ----------- \--* CNS_INT int 3
We're likely missing an attempt to constant fold AND which is blocking it, which is unfortunate.
There was a problem hiding this comment.
Going to see if its a trivial fix I can get in...
There was a problem hiding this comment.
dotnet/runtime#97901 fixes this case (and similar scenarios) in particular
| /// <param name="vector">The value to convert.</param> | ||
| /// <returns>The <see cref="Vector256{Int32}"/>.</returns> | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector256<int> ConvertToInt32RoundToEven(Vector256<float> vector) |
There was a problem hiding this comment.
Personally, I'd avoid helpers like this for now. As of .NET 8, only AVX hardware supports hardware-accelerated Vector256 at all, so this is just that with more steps. You may be able to expand that when Arm SVE becomes available in .NET 9 (or 10 if you stick to the current pattern of targeting LTS), but one would hope by that time, you'll have a BCL-provided xplat helper you can use to replace any explicit AVX calls.
That is to say, at the time this becomes actually useful, you'll have to make changes anyway.
Same goes for Vector512, ofc.
There was a problem hiding this comment.
The hope is that bythe time 10 comes out I can replace Vector256Utilities.XX with Vector256.XX and it will be a simple job to refactor. Plus, the consistency feeds my soul.
There was a problem hiding this comment.
the consistency feeds my soul
I feel this 😄
My concern would be that the fallback code here is either
- Never used, so just a maintenance burden with no payoff, or
- Used accidentally, resulting in a perf drop over the optimized
Vector128path
Perhaps if you really do want to hide all intrinsics behind helpers like this, it would be better to throw PNSE outside the one optimized implementation. Be aware, though, adding these extra layers over intrinsics can hinder inlining when they're used in larger methods.
When refactoring time comes, most of these will hopefully just be Avx2.XX -> Vector256.XX anyway.
| DebugGuard.IsTrue(source.Length == destination.Length, nameof(source), "Input spans must be of same length!"); | ||
|
|
||
| if (Avx2.IsSupported || Sse2.IsSupported) | ||
| if (Avx512BW.IsSupported || Avx2.IsSupported || Sse2.IsSupported || AdvSimd.IsSupported) |
There was a problem hiding this comment.
| if (Avx512BW.IsSupported || Avx2.IsSupported || Sse2.IsSupported || AdvSimd.IsSupported) | |
| if (Sse2.IsSupported || AdvSimd.IsSupported) |
Avx512BW.IsSupported and Avx2.IsSupported strictly imply Sse2.IsSupported, so this is redundant.
| } | ||
|
|
||
| Vector512<float> sign = vector & Vector512.Create(-0.0f); | ||
| Vector512<float> val_2p23_f32 = sign | Vector512.Create(8388608.0f); |
There was a problem hiding this comment.
Can you make 8388608.0f a named constant and use that in the other VectorXyzUtilities too?
So it's defined three times.
There was a problem hiding this comment.
I'll have a look. Not sure where to put it.
There was a problem hiding this comment.
Worth noting that 8388608 is 2^23, which is the boundary value at which fractional data can no longer be represented (only whole integers at or above this boundary).
There was a problem hiding this comment.
Never found a place to put this.
Co-authored-by: Clinton Ingram <clinton.ingram@outlook.com>
…ageSharp into js/xplat-intrinsics
saucecontrol
left a comment
There was a problem hiding this comment.
Looks good aside from the tangle of IsSupported checks 👍
| if ((Vector512.IsHardwareAccelerated && Avx512F.IsSupported) || | ||
| Avx2.IsSupported || | ||
| Sse2.IsSupported || | ||
| AdvSimd.IsSupported) |
There was a problem hiding this comment.
| if ((Vector512.IsHardwareAccelerated && Avx512F.IsSupported) || | |
| Avx2.IsSupported || | |
| Sse2.IsSupported || | |
| AdvSimd.IsSupported) | |
| if (Vector128.IsHardwareAccelerated) |
The check as written won't prevent you running the AVX-512 code on the slow processors, because those will still pass the Avx2.IsSupported check and enter this block, where they'll also pass the Avx512F.IsSupported check below and use Vector512 anyway.
This entire expression boils down to the lowest common denominator, which is Sse2.IsSupported || AdvSimd.IsSupported. That would seem to be unnecessarily restrictive though, because your Vector128 path is 100% xplat intrinsics. I don't know if Vector128.IsHardwareAccelerated reports true for wasm on .NET 8, but if it does now or in .NET 9, you'd be getting that for free if you loosen this check to just Vector128.IsHardwareAccelerated.
There was a problem hiding this comment.
Ah yeah, this is much smarter, nice to know about wasm too.
| { | ||
| int remainder; | ||
| if (Avx2.IsSupported) | ||
| if (Avx512F.IsSupported) |
There was a problem hiding this comment.
| if (Avx512F.IsSupported) | |
| if (Vector512.IsHardwareAccelerated && Avx512F.IsSupported) |
There was a problem hiding this comment.
BTW, I'm not 100% sure you want to include the Vector512.IsHardwareAccelerated restriction on all of these. You do have the option of restricting on IsSupported only and telling users who don't want to use AVX-512 on the slower hardware to disable it with DOTNET_EnableAVX512F=0, though that does also disable the new instructions on narrower vectors (AVX-512 VL) as well. I haven't made this decision for my own libs yet.
There was a problem hiding this comment.
I think I'll stick and include the check. I want my users to get no surprises.
| Span<float> destination) | ||
| { | ||
| fixed (byte* sourceBase = source) | ||
| if (Avx512F.IsSupported) |
There was a problem hiding this comment.
| if (Avx512F.IsSupported) | |
| if (Vector512.IsHardwareAccelerated && Avx512F.IsSupported) |
| Unsafe.Add(ref d, 3) = f3; | ||
| } | ||
| } | ||
| else if (Sse2.IsSupported || AdvSimd.IsSupported) |
There was a problem hiding this comment.
| else if (Sse2.IsSupported || AdvSimd.IsSupported) | |
| else |
| if ((Vector512.IsHardwareAccelerated && Avx512BW.IsSupported) || | ||
| (Vector256.IsHardwareAccelerated && Avx2.IsSupported) || | ||
| (Vector128.IsHardwareAccelerated && (Sse2.IsSupported || AdvSimd.IsSupported))) |
There was a problem hiding this comment.
| if ((Vector512.IsHardwareAccelerated && Avx512BW.IsSupported) || | |
| (Vector256.IsHardwareAccelerated && Avx2.IsSupported) || | |
| (Vector128.IsHardwareAccelerated && (Sse2.IsSupported || AdvSimd.IsSupported))) | |
| if (Sse2.IsSupported || AdvSimd.IsSupported) |
Edit: Oops, this one doesn't have xplat fallback for saturated narrowing. Need to keep the platform checks.
There was a problem hiding this comment.
Is there one I can add? Would be great to do so.
There was a problem hiding this comment.
There's Vector128.Narrow (truncating), plus Min and Max, so you could put something together. It's a bit awkward because you really want to go int->byte, and splitting the helpers to plug in for each of the narrowing steps separately would be a lot of extra codegen. A single clamp to byte range followed by double narrow would be slightly cleaner.
There was a problem hiding this comment.
Thanks. I think I'll leave it for now. Don't want to spend too long figuring it out.
| int remainder; | ||
| if (Avx2.IsSupported) | ||
|
|
||
| if (Avx512BW.IsSupported) |
There was a problem hiding this comment.
| if (Avx512BW.IsSupported) | |
| if (Vector512.IsHardwareAccelerated && Avx512BW.IsSupported) |
| Span<byte> destination) | ||
| { | ||
| if (Avx2.IsSupported) | ||
| if (Avx512BW.IsSupported) |
There was a problem hiding this comment.
| if (Avx512BW.IsSupported) | |
| if (Vector512.IsHardwareAccelerated && Avx512BW.IsSupported) |
gfoidl
left a comment
There was a problem hiding this comment.
Modulo some minor things and the feedback from @saucecontrol: LGTM
|
|
||
| for (int i = 0; i < source.Length; i++) | ||
| { | ||
| Unsafe.Add(ref dBase, i) = Unsafe.Add(ref sBase, i) / 255f; |
There was a problem hiding this comment.
| Unsafe.Add(ref dBase, i) = Unsafe.Add(ref sBase, i) / 255f; | |
| Unsafe.Add(ref dBase, (uint)i) = Unsafe.Add(ref sBase, (uint)i) / 255f; |
| ref byte dBase = ref MemoryMarshal.GetReference(destination); | ||
| for (int i = 0; i < source.Length; i++) |
There was a problem hiding this comment.
| ref byte dBase = ref MemoryMarshal.GetReference(destination); | |
| for (int i = 0; i < source.Length; i++) | |
| ref byte dBase = ref MemoryMarshal.GetReference(destination); | |
| for (int i = 0; i < source.Length; i++) |
For code style to have the same blank line as above.
| ref byte dBase = ref MemoryMarshal.GetReference(destination); | ||
| for (int i = 0; i < source.Length; i++) | ||
| { | ||
| Unsafe.Add(ref dBase, i) = ConvertToByte(Unsafe.Add(ref sBase, i)); |
There was a problem hiding this comment.
| Unsafe.Add(ref dBase, i) = ConvertToByte(Unsafe.Add(ref sBase, i)); | |
| Unsafe.Add(ref dBase, (uint)i) = ConvertToByte(Unsafe.Add(ref sBase, (uint)i)); |
|
Thanks for the review chaps. I've actioned the lot now. Will merge on successful build. |
|
Stating to see some improvement in numbers here.
|

Prerequisites
Description
The plan here is to be able to remove
ExtendedIntrinsicsandFallbackIntrinsics128by normalizing theHwIntrinsicsimplementation addingVector512support and extendingVector128support to include ARM.Any and all input is welcome.
CC @tannergooding @saucecontrol @gfoidl @br3aker