JIT: Skip redundant AND masking in NarrowWithSaturation codegen#122898
JIT: Skip redundant AND masking in NarrowWithSaturation codegen#122898laveeshb wants to merge 2 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR optimizes the codegen for Vector128/256.NarrowWithSaturation by eliminating redundant AND masking instructions on x86/x64. The optimization recognizes that when inputs are already clamped to the target range by preceding Min/Max operations, the subsequent AND operations are unnecessary.
Key Changes:
- Added an optional
inputsAlreadyClampedparameter togtNewSimdNarrowNodeto skip AND masking when inputs are guaranteed to be within target range - Updated the
NarrowWithSaturationcodegen path to passinputsAlreadyClamped=trueafter explicit clamping operations - Added regression test with disasm checks to verify the optimization
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/jit/compiler.h | Added optional inputsAlreadyClamped parameter (default false) to gtNewSimdNarrowNode function signature |
| src/coreclr/jit/gentree.cpp | Implemented conditional logic to skip AND masking operations when inputsAlreadyClamped is true for four specific narrowing scenarios (TYP_UBYTE and TYP_USHORT paths in both AVX2 and SSE2) |
| src/coreclr/jit/hwintrinsicxarch.cpp | Modified NarrowWithSaturation implementation to pass inputsAlreadyClamped=true when calling gtNewSimdNarrowNode after explicit Min/Max clamping |
| src/tests/JIT/Regression/JitBlue/Runtime_116526/Runtime_116526.csproj | Added test project configuration with disasm checking enabled and JIT optimization settings |
| src/tests/JIT/Regression/JitBlue/Runtime_116526/Runtime_116526.cs | Added regression test with disasm assertions to verify vpand is not generated and functional tests to verify correctness for UInt16→Byte and UInt32→UShort narrowing |
bd48091 to
ea91488
Compare
src/tests/JIT/Regression/JitBlue/Runtime_116526/Runtime_116526.cs
Outdated
Show resolved
Hide resolved
ea91488 to
431213e
Compare
|
Friendly ping - @tannergooding since you suggested the approach in #116526, would you mind taking a look when you have a moment? |
|
@tannergooding I see some CI tasks have failed. AFAICT these are not due to my changes. I had tested them locally. Can you advise how to unblock the PR? |
src/coreclr/jit/hwintrinsicxarch.cpp
Outdated
| /* isMax */ false, /* isMagnitude */ false, /* isNumber */ false); | ||
|
|
||
| retNode = gtNewSimdNarrowNode(retType, op1, op2, narrowSimdBaseType, simdSize); | ||
| retNode = gtNewSimdNarrowNode(retType, op1, op2, narrowSimdBaseType, simdSize, |
There was a problem hiding this comment.
Looks like this is failing System.Numerics.Tests.GenericVectorTests.NarrowWithSaturationInt32Test
There was a problem hiding this comment.
@tannergooding Thanks for pointing this out! I investigated the root cause:
The Problem:
The inputsAlreadyClamped = true optimization was incorrectly applied to all narrow types, but it's only valid for unsigned types.
For signed narrowing (e.g., Int32 → Int16):
- Values are clamped to
[-32768, 32767] - But negative values still have sign-extended upper bits (e.g.,
-1is0xFFFFFFFF) PackUnsignedSaturateneeds the AND masking to clear these upper bits before packing- Without the mask,
PackUnsignedSaturatesees the full 32-bit value and produces incorrect results
For unsigned narrowing (e.g., UInt32 → UInt16):
- Values are clamped to
[0, 65535] - The upper bits are naturally zero after clamping
- The AND masking is redundant and can safely be skipped
The Fix:
Changed the code to only set inputsAlreadyClamped = true for unsigned narrow types:
bool inputsAlreadyClamped = varTypeIsUnsigned(narrowSimdBaseType);I've pushed the fix and updated the PR description to clarify this limitation. The optimization now only applies to TYP_UBYTE and TYP_USHORT narrow types.
Head branch was pushed to by a user without write access
Vector128/256.NarrowWithSaturation was generating redundant vpand instructions because gtNewSimdNarrowNode didn't know the inputs were already clamped to the target range by the preceding Min() operations. This change adds an optional parameter to gtNewSimdNarrowNode to skip the AND masking when the caller knows inputs are already in range. The fix applies to x86/x64 only. ARM64 uses native AdvSimd instructions that don't have this issue. Before: vpminuw xmm1, xmm0, xmmword ptr [...] vpand xmm1, xmm1, xmm0 ; redundant vpminuw xmm2, xmm0, xmmword ptr [...] vpand xmm0, xmm2, xmm0 ; redundant vpackuswb xmm0, xmm1, xmm0 After: vpminuw xmm1, xmm0, xmmword ptr [...] vpminuw xmm0, xmm0, xmmword ptr [...] vpackuswb xmm0, xmm1, xmm0
847f263 to
e6cef00
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/coreclr/jit/gentree.cpp:25216
inputsAlreadyClampedis not referenced on theTARGET_ARM64side ofgtNewSimdNarrowNode, which can trigger unused-parameter warnings on non-xarch builds. Consider explicitly suppressing it for the ARM64 path (or only naming/using it underTARGET_XARCH) to avoid build breaks under Werror configurations.
| // For unsigned narrow types (TYP_UBYTE, TYP_USHORT), the clamping already ensures | ||
| // the upper bits are zero, so the AND masking in gtNewSimdNarrowNode is redundant. | ||
| // For signed narrow types, we still need the AND masking to clear sign-extended bits | ||
| // before PackUnsignedSaturate can correctly pack the values. |
There was a problem hiding this comment.
The comment says this is only relevant for unsigned narrow types TYP_UBYTE/TYP_USHORT, but the implementation sets inputsAlreadyClamped based on varTypeIsUnsigned(narrowSimdBaseType), which also includes TYP_UINT. Either tighten the condition to the types that actually benefit from skipping the mask (the PackUnsignedSaturate paths) or broaden the comment to match the condition.
| // For unsigned narrow types (TYP_UBYTE, TYP_USHORT), the clamping already ensures | |
| // the upper bits are zero, so the AND masking in gtNewSimdNarrowNode is redundant. | |
| // For signed narrow types, we still need the AND masking to clear sign-extended bits | |
| // before PackUnsignedSaturate can correctly pack the values. | |
| // For unsigned narrow element types (for example, TYP_UBYTE, TYP_USHORT, TYP_UINT), | |
| // the clamping already ensures the upper bits are zero, so the AND masking in | |
| // gtNewSimdNarrowNode is redundant. For signed narrow types, we still need the AND | |
| // masking to clear sign-extended bits before PackUnsignedSaturate can correctly pack | |
| // the values. |
…uration The inputsAlreadyClamped optimization was incorrectly applied to all types, but it's only valid for unsigned narrow types (TYP_UBYTE, TYP_USHORT). For signed narrow types, the AND masking is still required to clear the sign-extended upper bits before PackUnsignedSaturate can correctly pack the values. Without this masking, negative values would have incorrect upper bits that would cause PackUnsignedSaturate to produce wrong results. This fixes System.Numerics.Tests.GenericVectorTests.NarrowWithSaturationInt32Test
e6cef00 to
d600dbc
Compare
Fixes #116526
When
NarrowWithSaturationis used with unsigned narrow types (e.g.,UInt32→UInt16,UInt16→Byte), we're already clamping viaMin()internally - so the subsequentvpandto mask the result is redundant. This adds aninputsAlreadyClampedparam togtNewSimdNarrowNodeso callers likeNarrowWithSaturationcan skip the extra AND for unsigned types.Note: For signed narrow types (e.g.,
Int32→Int16), the AND masking is still required. After clamping, negative values have sign-extended upper bits that must be cleared beforePackUnsignedSaturatecan correctly pack them.Approach suggested by @tannergooding in the issue.
Before (unsigned narrowing):
After (unsigned narrowing):
ARM64 isn't affected - it uses AdvSimd intrinsics that handle this natively.
Tested with System.Runtime.Intrinsics and System.Numerics.Vectors test suites.