Cleanup SimdUtils by JimBobSquarePants · Pull Request #2654 · SixLabors/ImageSharp

JimBobSquarePants · 2024-02-02T06:02:07Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

The plan here is to be able to remove ExtendedIntrinsics and FallbackIntrinsics128 by normalizing the HwIntrinsics implementation adding Vector512 support and extending Vector128 support to include ARM.

Any and all input is welcome.

CC @tannergooding @saucecontrol @gfoidl @br3aker

saucecontrol

I see there's a test failure on Arm64. Nothing jumps out as a cause, but I left some comments anyway :)

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

saucecontrol · 2024-02-02T07:28:49Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

                int remainder;
-                if (Avx2.IsSupported)
+
+                if (Avx512BW.IsSupported)


This one's probably ok, but you'll want to consider bypassing the AVX-512 path on older hardware like Skylake-X, because of the well-known downclocking issues.

Vector512.IsHardwareAccelerated will return false on these processors, even though they support the AVX-512 instructions, so you can check both conditions to make sure the code only runs where AVX-512 is supported and fast.

Foot meet gun!
I'll add the additional checks.

saucecontrol · 2024-02-02T07:31:19Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+            result = AdvSimd.Insert(result, 3, AdvSimd.Extract(vector, (byte)((control >> 6) & 0x3)));
+#pragma warning restore CA1857 // A constant is expected for the parameter
+            return result;
+        }


This is going to be a perf killer in most cases. It'll be worth testing any path that uses this to make sure it doesn't regress to slower than scalar when Arm starts taking a SIMD path it skipped before.

I tossed a coin over this one. My thought was I was making enough savings elsewhere in calling method to justify the cost here. I need to see what the codegen is like, but I have no means to do so currently.

I assumed codegen would just be the 4 extracts and 3 inserts (which would make the shuffle a minimum of 8 cycles vs the 1 it takes on x86), but it's actually worse.

🤮I'll um, i'll remove that then.

I wish that you could define constants from a value given a parameter is a constant. I bet the codegen would be much better then.

- public static Vector128<float> Shuffle(Vector128<float> vector, [ConstantExpected] byte control) + public static Vector128<float> Shuffle(Vector128<float> vector, const byte control)

Then...

const byte w = (byte)((control >> 6) & 0x3)

I would have thought the fact that it's inlined and the control byte is a constant would have been enough, so I'm quite surprised at that codegen. I'll have a look in the morning and see if I can figure out what's up.

Looks to be an issue with where the constant folding happens.

The inlininer is happening and the operands seen are effectively:

[000000] ----------- arg0 +--* LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1> V00 arg0 [000006] ----------- arg1 +--* CNS_INT int 0 [000007] ----------- arg2 +--* LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1> V00 arg0 [000010] ----------- arg3 \--* AND int [000008] ----------- +--* CNS_INT int 228 [000009] ----------- \--* CNS_INT int 3

We're likely missing an attempt to constant fold AND which is blocking it, which is unfortunate.

Going to see if its a trivial fix I can get in...

dotnet/runtime#97901 fixes this case (and similar scenarios) in particular

saucecontrol · 2024-02-02T07:36:27Z

src/ImageSharp/Common/Helpers/Vector256Utilities.cs

+    /// <param name="vector">The value to convert.</param>
+    /// <returns>The <see cref="Vector256{Int32}"/>.</returns>
+    [MethodImpl(MethodImplOptions.AggressiveInlining)]
+    public static Vector256<int> ConvertToInt32RoundToEven(Vector256<float> vector)


Personally, I'd avoid helpers like this for now. As of .NET 8, only AVX hardware supports hardware-accelerated Vector256 at all, so this is just that with more steps. You may be able to expand that when Arm SVE becomes available in .NET 9 (or 10 if you stick to the current pattern of targeting LTS), but one would hope by that time, you'll have a BCL-provided xplat helper you can use to replace any explicit AVX calls.

That is to say, at the time this becomes actually useful, you'll have to make changes anyway.

Same goes for Vector512, ofc.

The hope is that bythe time 10 comes out I can replace Vector256Utilities.XX with Vector256.XX and it will be a simple job to refactor. Plus, the consistency feeds my soul.

the consistency feeds my soul

I feel this 😄

My concern would be that the fallback code here is either

Never used, so just a maintenance burden with no payoff, or

Used accidentally, resulting in a perf drop over the optimized Vector128 path

Perhaps if you really do want to hide all intrinsics behind helpers like this, it would be better to throw PNSE outside the one optimized implementation. Be aware, though, adding these extra layers over intrinsics can hinder inlining when they're used in larger methods.

When refactoring time comes, most of these will hopefully just be Avx2.XX -> Vector256.XX anyway.

saucecontrol · 2024-02-02T07:43:27Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

+            DebugGuard.IsTrue(source.Length == destination.Length, nameof(source), "Input spans must be of same length!");

-            if (Avx2.IsSupported || Sse2.IsSupported)
+            if (Avx512BW.IsSupported || Avx2.IsSupported || Sse2.IsSupported || AdvSimd.IsSupported)


Suggested change

if (Avx512BW.IsSupported || Avx2.IsSupported || Sse2.IsSupported || AdvSimd.IsSupported)

if (Sse2.IsSupported || AdvSimd.IsSupported)

Avx512BW.IsSupported and Avx2.IsSupported strictly imply Sse2.IsSupported, so this is redundant.

gfoidl · 2024-02-02T09:23:34Z

src/ImageSharp/Common/Helpers/Vector512Utilities.cs

+        }
+
+        Vector512<float> sign = vector & Vector512.Create(-0.0f);
+        Vector512<float> val_2p23_f32 = sign | Vector512.Create(8388608.0f);


Can you make 8388608.0f a named constant and use that in the other VectorXyzUtilities too?
So it's defined three times.

I'll have a look. Not sure where to put it.

Worth noting that 8388608 is 2^23, which is the boundary value at which fractional data can no longer be represented (only whole integers at or above this boundary).

Never found a place to put this.

Co-authored-by: Clinton Ingram <clinton.ingram@outlook.com>

…ageSharp into js/xplat-intrinsics

saucecontrol

Looks good aside from the tangle of IsSupported checks 👍

saucecontrol · 2024-02-08T03:11:07Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

+            if ((Vector512.IsHardwareAccelerated && Avx512F.IsSupported) ||
+                Avx2.IsSupported ||
+                Sse2.IsSupported ||
+                AdvSimd.IsSupported)


Suggested change

if ((Vector512.IsHardwareAccelerated && Avx512F.IsSupported) ||

Avx2.IsSupported ||

Sse2.IsSupported ||

AdvSimd.IsSupported)

if (Vector128.IsHardwareAccelerated)

The check as written won't prevent you running the AVX-512 code on the slow processors, because those will still pass the Avx2.IsSupported check and enter this block, where they'll also pass the Avx512F.IsSupported check below and use Vector512 anyway.

This entire expression boils down to the lowest common denominator, which is Sse2.IsSupported || AdvSimd.IsSupported. That would seem to be unnecessarily restrictive though, because your Vector128 path is 100% xplat intrinsics. I don't know if Vector128.IsHardwareAccelerated reports true for wasm on .NET 8, but if it does now or in .NET 9, you'd be getting that for free if you loosen this check to just Vector128.IsHardwareAccelerated.

Ah yeah, this is much smarter, nice to know about wasm too.

saucecontrol · 2024-02-08T03:11:45Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

            {
                int remainder;
-                if (Avx2.IsSupported)
+                if (Avx512F.IsSupported)


Suggested change

if (Avx512F.IsSupported)

if (Vector512.IsHardwareAccelerated && Avx512F.IsSupported)

BTW, I'm not 100% sure you want to include the Vector512.IsHardwareAccelerated restriction on all of these. You do have the option of restricting on IsSupported only and telling users who don't want to use AVX-512 on the slower hardware to disable it with DOTNET_EnableAVX512F=0, though that does also disable the new instructions on narrower vectors (AVX-512 VL) as well. I haven't made this decision for my own libs yet.

I think I'll stick and include the check. I want my users to get no surprises.

saucecontrol · 2024-02-08T03:12:17Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

+            Span<float> destination)
        {
-            fixed (byte* sourceBase = source)
+            if (Avx512F.IsSupported)


Suggested change

if (Avx512F.IsSupported)

if (Vector512.IsHardwareAccelerated && Avx512F.IsSupported)

saucecontrol · 2024-02-08T03:12:36Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

+                    Unsafe.Add(ref d, 3) = f3;
+                }
+            }
+            else if (Sse2.IsSupported || AdvSimd.IsSupported)


Suggested change

else if (Sse2.IsSupported || AdvSimd.IsSupported)

else

saucecontrol · 2024-02-08T03:15:26Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

+            if ((Vector512.IsHardwareAccelerated && Avx512BW.IsSupported) ||
+                (Vector256.IsHardwareAccelerated && Avx2.IsSupported) ||
+                (Vector128.IsHardwareAccelerated && (Sse2.IsSupported || AdvSimd.IsSupported)))


Suggested change

if ((Vector512.IsHardwareAccelerated && Avx512BW.IsSupported) ||

(Vector256.IsHardwareAccelerated && Avx2.IsSupported) ||

(Vector128.IsHardwareAccelerated && (Sse2.IsSupported || AdvSimd.IsSupported)))

if (Sse2.IsSupported || AdvSimd.IsSupported)

Edit: Oops, this one doesn't have xplat fallback for saturated narrowing. Need to keep the platform checks.

Is there one I can add? Would be great to do so.

There's Vector128.Narrow (truncating), plus Min and Max, so you could put something together. It's a bit awkward because you really want to go int->byte, and splitting the helpers to plug in for each of the narrowing steps separately would be a lot of extra codegen. A single clamp to byte range followed by double narrow would be slightly cleaner.

Thanks. I think I'll leave it for now. Don't want to spend too long figuring it out.

saucecontrol · 2024-02-08T03:15:46Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

                int remainder;
-                if (Avx2.IsSupported)
+
+                if (Avx512BW.IsSupported)


Suggested change

if (Avx512BW.IsSupported)

if (Vector512.IsHardwareAccelerated && Avx512BW.IsSupported)

saucecontrol · 2024-02-08T03:16:13Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

+            Span<byte> destination)
        {
-            if (Avx2.IsSupported)
+            if (Avx512BW.IsSupported)


Suggested change

if (Avx512BW.IsSupported)

if (Vector512.IsHardwareAccelerated && Avx512BW.IsSupported)

gfoidl

Modulo some minor things and the feedback from @saucecontrol: LGTM

gfoidl · 2024-02-08T13:43:49Z

src/ImageSharp/Common/Helpers/SimdUtils.Convert.cs

+
+        for (int i = 0; i < source.Length; i++)
+        {
+            Unsafe.Add(ref dBase, i) = Unsafe.Add(ref sBase, i) / 255f;


Suggested change

Unsafe.Add(ref dBase, i) = Unsafe.Add(ref sBase, i) / 255f;

Unsafe.Add(ref dBase, (uint)i) = Unsafe.Add(ref sBase, (uint)i) / 255f;

gfoidl · 2024-02-08T13:44:23Z

src/ImageSharp/Common/Helpers/SimdUtils.Convert.cs

+        ref byte dBase = ref MemoryMarshal.GetReference(destination);
+        for (int i = 0; i < source.Length; i++)


Suggested change

ref byte dBase = ref MemoryMarshal.GetReference(destination);

for (int i = 0; i < source.Length; i++)

ref byte dBase = ref MemoryMarshal.GetReference(destination);

for (int i = 0; i < source.Length; i++)

For code style to have the same blank line as above.

gfoidl · 2024-02-08T13:44:46Z

src/ImageSharp/Common/Helpers/SimdUtils.Convert.cs

+        ref byte dBase = ref MemoryMarshal.GetReference(destination);
+        for (int i = 0; i < source.Length; i++)
+        {
+            Unsafe.Add(ref dBase, i) = ConvertToByte(Unsafe.Add(ref sBase, i));


Suggested change

Unsafe.Add(ref dBase, i) = ConvertToByte(Unsafe.Add(ref sBase, i));

Unsafe.Add(ref dBase, (uint)i) = ConvertToByte(Unsafe.Add(ref sBase, (uint)i));

JimBobSquarePants · 2024-02-14T01:55:41Z

Thanks for the review chaps. I've actioned the lot now. Will merge on successful build.

JimBobSquarePants · 2024-02-14T06:26:32Z

Stating to see some improvement in numbers here.

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.3085/23H2/2023Update/SunValley3)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.101
  [Host]   : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  ShortRun : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Job=ShortRun  IterationCount=5  LaunchCount=1
WarmupCount=5

Method	Mean	Error	StdDev	Ratio	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
'System.Drawing Load, Resize, Save'	368.56 ms	12.636 ms	1.955 ms	1.00	-	-	-	13.23 KB	1.00
'ImageFlow Load, Resize, Save'	270.28 ms	12.142 ms	3.153 ms	0.74	500.0000	500.0000	500.0000	4642.53 KB	351.00
'ImageSharp Load, Resize, Save'	114.87 ms	2.408 ms	0.625 ms	0.31	-	-	-	1314.02 KB	99.35
'ImageSharp TD Load, Resize, Save'	75.31 ms	1.923 ms	0.499 ms	0.20	166.6667	-	-	1307.22 KB	98.83
'ImageMagick Load, Resize, Save'	393.97 ms	7.115 ms	1.101 ms	1.07	-	-	-	54.48 KB	4.12
'ImageFree Load, Resize, Save'	235.16 ms	2.376 ms	0.617 ms	0.64	6000.0000	6000.0000	6000.0000	95.8 KB	7.24
'MagicScaler Load, Resize, Save'	66.67 ms	0.873 ms	0.135 ms	0.18	-	-	-	45.42 KB	3.43
'SkiaSharp Load, Resize, Save'	136.42 ms	2.833 ms	0.736 ms	0.37	-	-	-	88.15 KB	6.66
'NetVips Load, Resize, Save'	122.52 ms	7.601 ms	1.974 ms	0.33	-	-	-	50.95 KB	3.85

ENhance NormalizedFloatToByteSaturate

980347e

JimBobSquarePants added area:performance arch:arm64 labels Feb 2, 2024

JimBobSquarePants requested a review from antonfirsov February 2, 2024 06:02

saucecontrol reviewed Feb 2, 2024

View reviewed changes

Remove unused methods

89cd849

gfoidl reviewed Feb 2, 2024

View reviewed changes

JimBobSquarePants and others added 3 commits February 2, 2024 19:48

Update src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

68b7dbc

Co-authored-by: Clinton Ingram <clinton.ingram@outlook.com>

Avoid downclocking

9d737b7

Merge branch 'js/xplat-intrinsics' of https://github.com/SixLabors/Im…

805c7c4

…ageSharp into js/xplat-intrinsics

tannergooding mentioned this pull request Feb 2, 2024

Try to fold basic arithmetic operations as part of import dotnet/runtime#97901

Merged

Optimize and cleanup ByteToNormalizedFloatReduce

c6758df

JimBobSquarePants changed the title ~~WIP Cleanup SimdUtils~~ Cleanup SimdUtils Feb 6, 2024

JimBobSquarePants marked this pull request as ready for review February 6, 2024 06:31

JimBobSquarePants added this to the v4.0.0 milestone Feb 6, 2024

Update SimdUtils.HwIntrinsics.cs

0ec839f

JimBobSquarePants requested review from gfoidl, saucecontrol and tannergooding February 8, 2024 00:35

saucecontrol approved these changes Feb 8, 2024

View reviewed changes

gfoidl approved these changes Feb 8, 2024

View reviewed changes

Simplify checks

4e494d4

JimBobSquarePants merged commit 1f22bce into main Feb 14, 2024

JimBobSquarePants deleted the js/xplat-intrinsics branch February 14, 2024 02:13

	if (Avx512BW.IsSupported \|\| Avx2.IsSupported \|\| Sse2.IsSupported \|\| AdvSimd.IsSupported)
	if (Sse2.IsSupported \|\| AdvSimd.IsSupported)

	if (Avx512F.IsSupported)
	if (Vector512.IsHardwareAccelerated && Avx512F.IsSupported)

	if (Avx512BW.IsSupported)
	if (Vector512.IsHardwareAccelerated && Avx512BW.IsSupported)

	Unsafe.Add(ref dBase, i) = Unsafe.Add(ref sBase, i) / 255f;
	Unsafe.Add(ref dBase, (uint)i) = Unsafe.Add(ref sBase, (uint)i) / 255f;

		ref byte dBase = ref MemoryMarshal.GetReference(destination);
		for (int i = 0; i < source.Length; i++)

	Unsafe.Add(ref dBase, i) = ConvertToByte(Unsafe.Add(ref sBase, i));
	Unsafe.Add(ref dBase, (uint)i) = ConvertToByte(Unsafe.Add(ref sBase, (uint)i));

Uh oh!

Conversation

JimBobSquarePants commented Feb 2, 2024

Prerequisites

Description

Uh oh!

saucecontrol left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JimBobSquarePants Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saucecontrol left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saucecontrol Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saucecontrol Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saucecontrol left a comment •

edited

Loading

JimBobSquarePants Feb 2, 2024 •

edited

Loading

saucecontrol Feb 8, 2024 •

edited

Loading

saucecontrol Feb 8, 2024 •

edited

Loading