Skip to content

JIT: Skip redundant AND masking in NarrowWithSaturation codegen#122898

Open
laveeshb wants to merge 2 commits intodotnet:mainfrom
laveeshb:fix/narrow-with-saturation-codegen
Open

JIT: Skip redundant AND masking in NarrowWithSaturation codegen#122898
laveeshb wants to merge 2 commits intodotnet:mainfrom
laveeshb:fix/narrow-with-saturation-codegen

Conversation

@laveeshb
Copy link
Contributor

@laveeshb laveeshb commented Jan 5, 2026

Fixes #116526

When NarrowWithSaturation is used with unsigned narrow types (e.g., UInt32UInt16, UInt16Byte), we're already clamping via Min() internally - so the subsequent vpand to mask the result is redundant. This adds an inputsAlreadyClamped param to gtNewSimdNarrowNode so callers like NarrowWithSaturation can skip the extra AND for unsigned types.

Note: For signed narrow types (e.g., Int32Int16), the AND masking is still required. After clamping, negative values have sign-extended upper bits that must be cleared before PackUnsignedSaturate can correctly pack them.

Approach suggested by @tannergooding in the issue.

Before (unsigned narrowing):

vpminuw  xmm1, xmm0, xmmword ptr [...]
vpand    xmm1, xmm1, xmm0
vpminuw  xmm2, xmm0, xmmword ptr [...]
vpand    xmm0, xmm2, xmm0
vpackuswb xmm0, xmm1, xmm0

After (unsigned narrowing):

vpminuw  xmm1, xmm0, xmmword ptr [...]
vpminuw  xmm0, xmm0, xmmword ptr [...]
vpackuswb xmm0, xmm1, xmm0

ARM64 isn't affected - it uses AdvSimd intrinsics that handle this natively.

Tested with System.Runtime.Intrinsics and System.Numerics.Vectors test suites.

Copilot AI review requested due to automatic review settings January 5, 2026 21:05
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 5, 2026
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 5, 2026
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the codegen for Vector128/256.NarrowWithSaturation by eliminating redundant AND masking instructions on x86/x64. The optimization recognizes that when inputs are already clamped to the target range by preceding Min/Max operations, the subsequent AND operations are unnecessary.

Key Changes:

  • Added an optional inputsAlreadyClamped parameter to gtNewSimdNarrowNode to skip AND masking when inputs are guaranteed to be within target range
  • Updated the NarrowWithSaturation codegen path to pass inputsAlreadyClamped=true after explicit clamping operations
  • Added regression test with disasm checks to verify the optimization

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/coreclr/jit/compiler.h Added optional inputsAlreadyClamped parameter (default false) to gtNewSimdNarrowNode function signature
src/coreclr/jit/gentree.cpp Implemented conditional logic to skip AND masking operations when inputsAlreadyClamped is true for four specific narrowing scenarios (TYP_UBYTE and TYP_USHORT paths in both AVX2 and SSE2)
src/coreclr/jit/hwintrinsicxarch.cpp Modified NarrowWithSaturation implementation to pass inputsAlreadyClamped=true when calling gtNewSimdNarrowNode after explicit Min/Max clamping
src/tests/JIT/Regression/JitBlue/Runtime_116526/Runtime_116526.csproj Added test project configuration with disasm checking enabled and JIT optimization settings
src/tests/JIT/Regression/JitBlue/Runtime_116526/Runtime_116526.cs Added regression test with disasm assertions to verify vpand is not generated and functional tests to verify correctness for UInt16→Byte and UInt32→UShort narrowing

@laveeshb laveeshb force-pushed the fix/narrow-with-saturation-codegen branch from ea91488 to 431213e Compare January 6, 2026 02:25
@laveeshb
Copy link
Contributor Author

Friendly ping - @tannergooding since you suggested the approach in #116526, would you mind taking a look when you have a moment?

@laveeshb
Copy link
Contributor Author

@tannergooding I see some CI tasks have failed. AFAICT these are not due to my changes. I had tested them locally. Can you advise how to unblock the PR?

/* isMax */ false, /* isMagnitude */ false, /* isNumber */ false);

retNode = gtNewSimdNarrowNode(retType, op1, op2, narrowSimdBaseType, simdSize);
retNode = gtNewSimdNarrowNode(retType, op1, op2, narrowSimdBaseType, simdSize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is failing System.Numerics.Tests.GenericVectorTests.NarrowWithSaturationInt32Test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Thanks for pointing this out! I investigated the root cause:

The Problem:
The inputsAlreadyClamped = true optimization was incorrectly applied to all narrow types, but it's only valid for unsigned types.

For signed narrowing (e.g., Int32Int16):

  • Values are clamped to [-32768, 32767]
  • But negative values still have sign-extended upper bits (e.g., -1 is 0xFFFFFFFF)
  • PackUnsignedSaturate needs the AND masking to clear these upper bits before packing
  • Without the mask, PackUnsignedSaturate sees the full 32-bit value and produces incorrect results

For unsigned narrowing (e.g., UInt32UInt16):

  • Values are clamped to [0, 65535]
  • The upper bits are naturally zero after clamping
  • The AND masking is redundant and can safely be skipped

The Fix:
Changed the code to only set inputsAlreadyClamped = true for unsigned narrow types:

bool inputsAlreadyClamped = varTypeIsUnsigned(narrowSimdBaseType);

I've pushed the fix and updated the PR description to clarify this limitation. The optimization now only applies to TYP_UBYTE and TYP_USHORT narrow types.

auto-merge was automatically disabled February 11, 2026 15:33

Head branch was pushed to by a user without write access

Copilot AI review requested due to automatic review settings February 11, 2026 15:33
Vector128/256.NarrowWithSaturation was generating redundant vpand instructions because gtNewSimdNarrowNode didn't know the inputs were already clamped to the target range by the preceding Min() operations.

This change adds an optional parameter to gtNewSimdNarrowNode to skip the AND masking when the caller knows inputs are already in range.

The fix applies to x86/x64 only. ARM64 uses native AdvSimd instructions that don't have this issue.

Before:
  vpminuw  xmm1, xmm0, xmmword ptr [...]
  vpand    xmm1, xmm1, xmm0          ; redundant
  vpminuw  xmm2, xmm0, xmmword ptr [...]
  vpand    xmm0, xmm2, xmm0          ; redundant
  vpackuswb xmm0, xmm1, xmm0

After:
  vpminuw  xmm1, xmm0, xmmword ptr [...]
  vpminuw  xmm0, xmm0, xmmword ptr [...]
  vpackuswb xmm0, xmm1, xmm0
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/coreclr/jit/gentree.cpp:25216

  • inputsAlreadyClamped is not referenced on the TARGET_ARM64 side of gtNewSimdNarrowNode, which can trigger unused-parameter warnings on non-xarch builds. Consider explicitly suppressing it for the ARM64 path (or only naming/using it under TARGET_XARCH) to avoid build breaks under Werror configurations.

Comment on lines +3544 to +3547
// For unsigned narrow types (TYP_UBYTE, TYP_USHORT), the clamping already ensures
// the upper bits are zero, so the AND masking in gtNewSimdNarrowNode is redundant.
// For signed narrow types, we still need the AND masking to clear sign-extended bits
// before PackUnsignedSaturate can correctly pack the values.
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says this is only relevant for unsigned narrow types TYP_UBYTE/TYP_USHORT, but the implementation sets inputsAlreadyClamped based on varTypeIsUnsigned(narrowSimdBaseType), which also includes TYP_UINT. Either tighten the condition to the types that actually benefit from skipping the mask (the PackUnsignedSaturate paths) or broaden the comment to match the condition.

Suggested change
// For unsigned narrow types (TYP_UBYTE, TYP_USHORT), the clamping already ensures
// the upper bits are zero, so the AND masking in gtNewSimdNarrowNode is redundant.
// For signed narrow types, we still need the AND masking to clear sign-extended bits
// before PackUnsignedSaturate can correctly pack the values.
// For unsigned narrow element types (for example, TYP_UBYTE, TYP_USHORT, TYP_UINT),
// the clamping already ensures the upper bits are zero, so the AND masking in
// gtNewSimdNarrowNode is redundant. For signed narrow types, we still need the AND
// masking to clear sign-extended bits before PackUnsignedSaturate can correctly pack
// the values.

Copilot uses AI. Check for mistakes.
…uration

The inputsAlreadyClamped optimization was incorrectly applied to all types,
but it's only valid for unsigned narrow types (TYP_UBYTE, TYP_USHORT).

For signed narrow types, the AND masking is still required to clear the
sign-extended upper bits before PackUnsignedSaturate can correctly pack
the values. Without this masking, negative values would have incorrect
upper bits that would cause PackUnsignedSaturate to produce wrong results.

This fixes System.Numerics.Tests.GenericVectorTests.NarrowWithSaturationInt32Test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suboptimal codgen for Vector128.NarrowWithSaturation

3 participants