Add `WidenLatin1ToUtf16_MisalignedAddress` by xtqqczze · Pull Request #90319 · dotnet/runtime

xtqqczze · 2023-08-10T15:20:12Z

Handle the case where the pUtf16Buffer parameter is misaligned, and fallback to a simple loop.

The current code uses of Sse2.StoreAligned (MOVDQA), which is documented to throw a general-protection exception if the memory operand is misaligned. However, it is possible data could be silently corrupted if the existing logic does not account for misaligned address.

ghost · 2023-08-10T15:20:27Z

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

Handle the case where the pUtf16Buffer parameter is misaligned.

The current code uses of Sse2.StoreAligned (MOVDQA, which is documented to throw a general-protection exception if the memory operand is misaligned. However it is likely the method will silently corrupt data instead
.

Author:	xtqqczze
Assignees:	-
Labels:	`area-System.Text.Encoding`, `community-contribution`
Milestone:	-

EgorBo · 2023-08-10T18:44:08Z

src/libraries/System.Private.CoreLib/src/System/Text/Latin1Utility.cs

+            if (((nuint)pUtf16Buffer & 1) != 0)
+            {
+                // Input isn't char aligned, we won't be able to vectorize.
+                WidenLatin1ToUtf16_MisalignedAddress(pLatin1Buffer, pUtf16Buffer, elementCount);


I am not sure I understand the need of this, I am seeing that we already align data in

runtime/src/libraries/System.Private.CoreLib/src/System/Text/Latin1Utility.cs

Lines 990 to 1015 in 4218f4c

if (elementCount >= SizeOfVector128)

{

// First, perform an unaligned 1x 64-bit read from the input buffer and an unaligned

// 1x 128-bit write to the destination buffer.

latin1Vector = Sse2.LoadScalarVector128((ulong*)pLatin1Buffer).AsByte(); // unaligned load

Sse2.Store((byte*)pUtf16Buffer, Sse2.UnpackLow(latin1Vector, zeroVector)); // unaligned write

// Calculate how many elements we wrote in order to get pOutputBuffer to its next alignment

// point, then use that as the base offset going forward. Remember the >> 1 to account for

// that we wrote chars, not bytes. This means we may re-read data in the next iteration of

// the loop, but this is ok.

currentOffset = (SizeOfVector128 >> 1) - (((nuint)pUtf16Buffer >> 1) & (MaskOfAllBitsInVector128 >> 1));

Debug.Assert(0 < currentOffset && currentOffset <= SizeOfVector128 / sizeof(char));

// Calculating the destination address outside the loop results in significant

// perf wins vs. relying on the JIT to fold memory addressing logic into the

// write instructions. See: https://github.com/dotnet/runtime/issues/33002

char* pCurrentWriteAddress = pUtf16Buffer + currentOffset;

// Now run the main 1x 128-bit read + 2x 128-bit write loop.

nuint finalOffsetWhereCanIterateLoop = elementCount - SizeOfVector128;

while (currentOffset <= finalOffsetWhereCanIterateLoop)

, don't we?

If the address that pUtf16Buffer points to is naturally aligned, then this logic should work as expected.

The problem is:

char* pCurrentWriteAddress = pUtf16Buffer + currentOffset;

which does not effectively ensure correct alignment for a misaligned char*.

@EgorBo As an alternative approach, change the following to Sse2.Store and accept the performance hit pre-Nehalem.

runtime/src/libraries/System.Private.CoreLib/src/System/Text/Latin1Utility.cs

Line 899 in 4218f4c

Sse2.StoreAligned(pLatin1Buffer + currentOffsetInElements, latin1Vector); // aligned

We should just use unaligned load/store here. It may have some small hit on hardware that's more than 10-12 years old, but it simplifies things overall and will be increasingly rare to encounter such hardware over time. Plus, we don't use aligned loads/stores in most of our workloads already, so this won't be any different.

Opportunistically aligning the data is still goodness and will avoid cache line splits.

It looks like Sse2.Store approach is better since unaligned char* should be rare? (e.g. char[] is always aligned)

The consideration would be that Sse2.Store doesn't allow containment on pre-AVX hardware and on pre-AVX hardware movups was frequently slower than movaps even if the data was definitely aligned.

But, given the age of the hardware involved and the logic we use elsewhere, it shouldn't be a huge consideration (and the cost also wasn't huge, typically an extra cycle or so; the split cache line accesses were much worse and closer to 20 cycles for loads).

Plus, we don't use aligned loads/stores in most of our workloads already, so this won't be any different.

WidenLatin1ToUtf16_Sse2 and NarrowUtf16ToLatin1_Sse2 are the only methods in the repository that use StoreAligned.

It looks like Sse2.Store approach is better since unaligned char* should be rare? (e.g. char[] is always aligned)

A potential case is the public API System.Text.Encoding.GetChars(System.Byte*, System.Int32, System.Char*, System.Int32).

WidenLatin1ToUtf16_Sse2 and NarrowUtf16ToLatin1_Sse2 are the only methods in the repository that use StoreAligned.

Right, and so preserving the alignment here won't make a big difference to downlevel hardware. We should just normalize to not using StoreAligned so the code can be simpler and robust in the face of unalignable data.

adamsitnik

We should just normalize to not using StoreAligned so the code can be simpler and robust in the face of unalignable data.

I agree with @tannergooding . @xtqqczze please apply the suggested change

@xtqqczze thank you for your contribution!

ghost · 2023-11-05T21:01:17Z

This pull request has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

ghost · 2023-11-20T00:01:27Z

This pull request will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the pull request, but please note that it will be locked if it remains inactive for another 30 days.

Add WidenLatin1ToUtf16_MisalignedAddress

34fd97b

ghost added area-System.Text.Encoding community-contribution Indicates that the PR has been added by a community member labels Aug 10, 2023

xtqqczze marked this pull request as draft August 10, 2023 15:50

xtqqczze added 2 commits August 10, 2023 16:53

fix SA1028

83119c3

fix CS0127

4218f4c

xtqqczze marked this pull request as ready for review August 10, 2023 15:56

xtqqczze mentioned this pull request Aug 10, 2023

Use StoreAligned not Store in WidenAsciiToUtf16 #89892

Closed

EgorBo reviewed Aug 10, 2023

View reviewed changes

tarekgh added this to the 9.0.0 milestone Aug 16, 2023

adamsitnik requested changes Oct 20, 2023

View reviewed changes

ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 20, 2023

adamsitnik self-assigned this Oct 22, 2023

ghost added the no-recent-activity label Nov 5, 2023

ghost closed this Nov 20, 2023

github-actions bot locked and limited conversation to collaborators Dec 20, 2023

This pull request was closed.

	if (elementCount >= SizeOfVector128)
	{
	// First, perform an unaligned 1x 64-bit read from the input buffer and an unaligned
	// 1x 128-bit write to the destination buffer.

	latin1Vector = Sse2.LoadScalarVector128((ulong*)pLatin1Buffer).AsByte(); // unaligned load
	Sse2.Store((byte*)pUtf16Buffer, Sse2.UnpackLow(latin1Vector, zeroVector)); // unaligned write

	// Calculate how many elements we wrote in order to get pOutputBuffer to its next alignment
	// point, then use that as the base offset going forward. Remember the >> 1 to account for
	// that we wrote chars, not bytes. This means we may re-read data in the next iteration of
	// the loop, but this is ok.

	currentOffset = (SizeOfVector128 >> 1) - (((nuint)pUtf16Buffer >> 1) & (MaskOfAllBitsInVector128 >> 1));
	Debug.Assert(0 < currentOffset && currentOffset <= SizeOfVector128 / sizeof(char));

	// Calculating the destination address outside the loop results in significant
	// perf wins vs. relying on the JIT to fold memory addressing logic into the
	// write instructions. See: https://github.com/dotnet/runtime/issues/33002

	char* pCurrentWriteAddress = pUtf16Buffer + currentOffset;

	// Now run the main 1x 128-bit read + 2x 128-bit write loop.

	nuint finalOffsetWhereCanIterateLoop = elementCount - SizeOfVector128;
	while (currentOffset <= finalOffsetWhereCanIterateLoop)

Conversation

xtqqczze commented Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Aug 10, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

ghost commented Nov 5, 2023

Uh oh!

ghost commented Nov 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xtqqczze commented Aug 10, 2023 •

edited

Loading