Skip to content

Cleanup SimdUtils#2654

Merged
JimBobSquarePants merged 8 commits intomainfrom
js/xplat-intrinsics
Feb 14, 2024
Merged

Cleanup SimdUtils#2654
JimBobSquarePants merged 8 commits intomainfrom
js/xplat-intrinsics

Conversation

@JimBobSquarePants
Copy link
Member

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

The plan here is to be able to remove ExtendedIntrinsics and FallbackIntrinsics128 by normalizing the HwIntrinsics implementation adding Vector512 support and extending Vector128 support to include ARM.

Any and all input is welcome.

CC @tannergooding @saucecontrol @gfoidl @br3aker

Copy link
Contributor

@saucecontrol saucecontrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there's a test failure on Arm64. Nothing jumps out as a cause, but I left some comments anyway :)

int remainder;
if (Avx2.IsSupported)

if (Avx512BW.IsSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's probably ok, but you'll want to consider bypassing the AVX-512 path on older hardware like Skylake-X, because of the well-known downclocking issues.

Vector512.IsHardwareAccelerated will return false on these processors, even though they support the AVX-512 instructions, so you can check both conditions to make sure the code only runs where AVX-512 is supported and fast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Foot meet gun!
I'll add the additional checks.

result = AdvSimd.Insert(result, 3, AdvSimd.Extract(vector, (byte)((control >> 6) & 0x3)));
#pragma warning restore CA1857 // A constant is expected for the parameter
return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a perf killer in most cases. It'll be worth testing any path that uses this to make sure it doesn't regress to slower than scalar when Arm starts taking a SIMD path it skipped before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tossed a coin over this one. My thought was I was making enough savings elsewhere in calling method to justify the cost here. I need to see what the codegen is like, but I have no means to do so currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed codegen would just be the 4 extracts and 3 inserts (which would make the shuffle a minimum of 8 cycles vs the 1 it takes on x86), but it's actually worse.

image

Copy link
Member Author

@JimBobSquarePants JimBobSquarePants Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤮I'll um, i'll remove that then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish that you could define constants from a value given a parameter is a constant. I bet the codegen would be much better then.

- public static Vector128<float> Shuffle(Vector128<float> vector, [ConstantExpected] byte control)
+ public static Vector128<float> Shuffle(Vector128<float> vector, const byte control)

Then...

const byte w = (byte)((control >> 6) & 0x3)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought the fact that it's inlined and the control byte is a constant would have been enough, so I'm quite surprised at that codegen. I'll have a look in the morning and see if I can figure out what's up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to be an issue with where the constant folding happens.

The inlininer is happening and the operands seen are effectively:

               [000000] ----------- arg0                    +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V00 arg0         
               [000006] ----------- arg1                    +--*  CNS_INT   int    0
               [000007] ----------- arg2                    +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V00 arg0         
               [000010] ----------- arg3                    \--*  AND       int   
               [000008] -----------                            +--*  CNS_INT   int    228
               [000009] -----------                            \--*  CNS_INT   int    3

We're likely missing an attempt to constant fold AND which is blocking it, which is unfortunate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to see if its a trivial fix I can get in...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dotnet/runtime#97901 fixes this case (and similar scenarios) in particular

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hero!

/// <param name="vector">The value to convert.</param>
/// <returns>The <see cref="Vector256{Int32}"/>.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<int> ConvertToInt32RoundToEven(Vector256<float> vector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd avoid helpers like this for now. As of .NET 8, only AVX hardware supports hardware-accelerated Vector256 at all, so this is just that with more steps. You may be able to expand that when Arm SVE becomes available in .NET 9 (or 10 if you stick to the current pattern of targeting LTS), but one would hope by that time, you'll have a BCL-provided xplat helper you can use to replace any explicit AVX calls.

That is to say, at the time this becomes actually useful, you'll have to make changes anyway.

Same goes for Vector512, ofc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hope is that bythe time 10 comes out I can replace Vector256Utilities.XX with Vector256.XX and it will be a simple job to refactor. Plus, the consistency feeds my soul.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the consistency feeds my soul

I feel this 😄

My concern would be that the fallback code here is either

  1. Never used, so just a maintenance burden with no payoff, or
  2. Used accidentally, resulting in a perf drop over the optimized Vector128 path

Perhaps if you really do want to hide all intrinsics behind helpers like this, it would be better to throw PNSE outside the one optimized implementation. Be aware, though, adding these extra layers over intrinsics can hinder inlining when they're used in larger methods.

When refactoring time comes, most of these will hopefully just be Avx2.XX -> Vector256.XX anyway.

DebugGuard.IsTrue(source.Length == destination.Length, nameof(source), "Input spans must be of same length!");

if (Avx2.IsSupported || Sse2.IsSupported)
if (Avx512BW.IsSupported || Avx2.IsSupported || Sse2.IsSupported || AdvSimd.IsSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Avx512BW.IsSupported || Avx2.IsSupported || Sse2.IsSupported || AdvSimd.IsSupported)
if (Sse2.IsSupported || AdvSimd.IsSupported)

Avx512BW.IsSupported and Avx2.IsSupported strictly imply Sse2.IsSupported, so this is redundant.

}

Vector512<float> sign = vector & Vector512.Create(-0.0f);
Vector512<float> val_2p23_f32 = sign | Vector512.Create(8388608.0f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make 8388608.0f a named constant and use that in the other VectorXyzUtilities too?
So it's defined three times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look. Not sure where to put it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that 8388608 is 2^23, which is the boundary value at which fractional data can no longer be represented (only whole integers at or above this boundary).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never found a place to put this.

@JimBobSquarePants JimBobSquarePants changed the title WIP Cleanup SimdUtils Cleanup SimdUtils Feb 6, 2024
@JimBobSquarePants JimBobSquarePants marked this pull request as ready for review February 6, 2024 06:31
@JimBobSquarePants JimBobSquarePants added this to the v4.0.0 milestone Feb 6, 2024
Copy link
Contributor

@saucecontrol saucecontrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good aside from the tangle of IsSupported checks 👍

Comment on lines 764 to 767
if ((Vector512.IsHardwareAccelerated && Avx512F.IsSupported) ||
Avx2.IsSupported ||
Sse2.IsSupported ||
AdvSimd.IsSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((Vector512.IsHardwareAccelerated && Avx512F.IsSupported) ||
Avx2.IsSupported ||
Sse2.IsSupported ||
AdvSimd.IsSupported)
if (Vector128.IsHardwareAccelerated)

The check as written won't prevent you running the AVX-512 code on the slow processors, because those will still pass the Avx2.IsSupported check and enter this block, where they'll also pass the Avx512F.IsSupported check below and use Vector512 anyway.

This entire expression boils down to the lowest common denominator, which is Sse2.IsSupported || AdvSimd.IsSupported. That would seem to be unnecessarily restrictive though, because your Vector128 path is 100% xplat intrinsics. I don't know if Vector128.IsHardwareAccelerated reports true for wasm on .NET 8, but if it does now or in .NET 9, you'd be getting that for free if you loosen this check to just Vector128.IsHardwareAccelerated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, this is much smarter, nice to know about wasm too.

{
int remainder;
if (Avx2.IsSupported)
if (Avx512F.IsSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Avx512F.IsSupported)
if (Vector512.IsHardwareAccelerated && Avx512F.IsSupported)

Copy link
Contributor

@saucecontrol saucecontrol Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I'm not 100% sure you want to include the Vector512.IsHardwareAccelerated restriction on all of these. You do have the option of restricting on IsSupported only and telling users who don't want to use AVX-512 on the slower hardware to disable it with DOTNET_EnableAVX512F=0, though that does also disable the new instructions on narrower vectors (AVX-512 VL) as well. I haven't made this decision for my own libs yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll stick and include the check. I want my users to get no surprises.

Span<float> destination)
{
fixed (byte* sourceBase = source)
if (Avx512F.IsSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Avx512F.IsSupported)
if (Vector512.IsHardwareAccelerated && Avx512F.IsSupported)

Unsafe.Add(ref d, 3) = f3;
}
}
else if (Sse2.IsSupported || AdvSimd.IsSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else if (Sse2.IsSupported || AdvSimd.IsSupported)
else

Comment on lines 934 to 936
if ((Vector512.IsHardwareAccelerated && Avx512BW.IsSupported) ||
(Vector256.IsHardwareAccelerated && Avx2.IsSupported) ||
(Vector128.IsHardwareAccelerated && (Sse2.IsSupported || AdvSimd.IsSupported)))
Copy link
Contributor

@saucecontrol saucecontrol Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((Vector512.IsHardwareAccelerated && Avx512BW.IsSupported) ||
(Vector256.IsHardwareAccelerated && Avx2.IsSupported) ||
(Vector128.IsHardwareAccelerated && (Sse2.IsSupported || AdvSimd.IsSupported)))
if (Sse2.IsSupported || AdvSimd.IsSupported)

Edit: Oops, this one doesn't have xplat fallback for saturated narrowing. Need to keep the platform checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there one I can add? Would be great to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's Vector128.Narrow (truncating), plus Min and Max, so you could put something together. It's a bit awkward because you really want to go int->byte, and splitting the helpers to plug in for each of the narrowing steps separately would be a lot of extra codegen. A single clamp to byte range followed by double narrow would be slightly cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think I'll leave it for now. Don't want to spend too long figuring it out.

int remainder;
if (Avx2.IsSupported)

if (Avx512BW.IsSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Avx512BW.IsSupported)
if (Vector512.IsHardwareAccelerated && Avx512BW.IsSupported)

Span<byte> destination)
{
if (Avx2.IsSupported)
if (Avx512BW.IsSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Avx512BW.IsSupported)
if (Vector512.IsHardwareAccelerated && Avx512BW.IsSupported)

Copy link
Contributor

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo some minor things and the feedback from @saucecontrol: LGTM


for (int i = 0; i < source.Length; i++)
{
Unsafe.Add(ref dBase, i) = Unsafe.Add(ref sBase, i) / 255f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Unsafe.Add(ref dBase, i) = Unsafe.Add(ref sBase, i) / 255f;
Unsafe.Add(ref dBase, (uint)i) = Unsafe.Add(ref sBase, (uint)i) / 255f;

Comment on lines 68 to 69
ref byte dBase = ref MemoryMarshal.GetReference(destination);
for (int i = 0; i < source.Length; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ref byte dBase = ref MemoryMarshal.GetReference(destination);
for (int i = 0; i < source.Length; i++)
ref byte dBase = ref MemoryMarshal.GetReference(destination);
for (int i = 0; i < source.Length; i++)

For code style to have the same blank line as above.

ref byte dBase = ref MemoryMarshal.GetReference(destination);
for (int i = 0; i < source.Length; i++)
{
Unsafe.Add(ref dBase, i) = ConvertToByte(Unsafe.Add(ref sBase, i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Unsafe.Add(ref dBase, i) = ConvertToByte(Unsafe.Add(ref sBase, i));
Unsafe.Add(ref dBase, (uint)i) = ConvertToByte(Unsafe.Add(ref sBase, (uint)i));

@JimBobSquarePants
Copy link
Member Author

Thanks for the review chaps. I've actioned the lot now. Will merge on successful build.

@JimBobSquarePants JimBobSquarePants merged commit 1f22bce into main Feb 14, 2024
@JimBobSquarePants JimBobSquarePants deleted the js/xplat-intrinsics branch February 14, 2024 02:13
@JimBobSquarePants
Copy link
Member Author

Stating to see some improvement in numbers here.

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.3085/23H2/2023Update/SunValley3)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.101
  [Host]   : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  ShortRun : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Job=ShortRun  IterationCount=5  LaunchCount=1
WarmupCount=5
Method Mean Error StdDev Ratio Gen0 Gen1 Gen2 Allocated Alloc Ratio
'System.Drawing Load, Resize, Save' 368.56 ms 12.636 ms 1.955 ms 1.00 - - - 13.23 KB 1.00
'ImageFlow Load, Resize, Save' 270.28 ms 12.142 ms 3.153 ms 0.74 500.0000 500.0000 500.0000 4642.53 KB 351.00
'ImageSharp Load, Resize, Save' 114.87 ms 2.408 ms 0.625 ms 0.31 - - - 1314.02 KB 99.35
'ImageSharp TD Load, Resize, Save' 75.31 ms 1.923 ms 0.499 ms 0.20 166.6667 - - 1307.22 KB 98.83
'ImageMagick Load, Resize, Save' 393.97 ms 7.115 ms 1.101 ms 1.07 - - - 54.48 KB 4.12
'ImageFree Load, Resize, Save' 235.16 ms 2.376 ms 0.617 ms 0.64 6000.0000 6000.0000 6000.0000 95.8 KB 7.24
'MagicScaler Load, Resize, Save' 66.67 ms 0.873 ms 0.135 ms 0.18 - - - 45.42 KB 3.43
'SkiaSharp Load, Resize, Save' 136.42 ms 2.833 ms 0.736 ms 0.37 - - - 88.15 KB 6.66
'NetVips Load, Resize, Save' 122.52 ms 7.601 ms 1.974 ms 0.33 - - - 50.95 KB 3.85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants