-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WidenLatin1ToUtf16_MisalignedAddress
#90319
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-encoding Issue DetailsHandle the case where the The current code uses of
|
if (((nuint)pUtf16Buffer & 1) != 0) | ||
{ | ||
// Input isn't char aligned, we won't be able to vectorize. | ||
WidenLatin1ToUtf16_MisalignedAddress(pLatin1Buffer, pUtf16Buffer, elementCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the need of this, I am seeing that we already align data in
runtime/src/libraries/System.Private.CoreLib/src/System/Text/Latin1Utility.cs
Lines 990 to 1015 in 4218f4c
if (elementCount >= SizeOfVector128) | |
{ | |
// First, perform an unaligned 1x 64-bit read from the input buffer and an unaligned | |
// 1x 128-bit write to the destination buffer. | |
latin1Vector = Sse2.LoadScalarVector128((ulong*)pLatin1Buffer).AsByte(); // unaligned load | |
Sse2.Store((byte*)pUtf16Buffer, Sse2.UnpackLow(latin1Vector, zeroVector)); // unaligned write | |
// Calculate how many elements we wrote in order to get pOutputBuffer to its next alignment | |
// point, then use that as the base offset going forward. Remember the >> 1 to account for | |
// that we wrote chars, not bytes. This means we may re-read data in the next iteration of | |
// the loop, but this is ok. | |
currentOffset = (SizeOfVector128 >> 1) - (((nuint)pUtf16Buffer >> 1) & (MaskOfAllBitsInVector128 >> 1)); | |
Debug.Assert(0 < currentOffset && currentOffset <= SizeOfVector128 / sizeof(char)); | |
// Calculating the destination address outside the loop results in significant | |
// perf wins vs. relying on the JIT to fold memory addressing logic into the | |
// write instructions. See: https://github.com/dotnet/runtime/issues/33002 | |
char* pCurrentWriteAddress = pUtf16Buffer + currentOffset; | |
// Now run the main 1x 128-bit read + 2x 128-bit write loop. | |
nuint finalOffsetWhereCanIterateLoop = elementCount - SizeOfVector128; | |
while (currentOffset <= finalOffsetWhereCanIterateLoop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the address that pUtf16Buffer
points to is naturally aligned, then this logic should work as expected.
The problem is:
char* pCurrentWriteAddress = pUtf16Buffer + currentOffset;
which does not effectively ensure correct alignment for a misaligned char*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EgorBo As an alternative approach, change the following to Sse2.Store
and accept the performance hit pre-Nehalem.
Sse2.StoreAligned(pLatin1Buffer + currentOffsetInElements, latin1Vector); // aligned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just use unaligned load/store here. It may have some small hit on hardware that's more than 10-12 years old, but it simplifies things overall and will be increasingly rare to encounter such hardware over time. Plus, we don't use aligned loads/stores in most of our workloads already, so this won't be any different.
Opportunistically aligning the data is still goodness and will avoid cache line splits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like Sse2.Store
approach is better since unaligned char* should be rare? (e.g. char[] is always aligned)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consideration would be that Sse2.Store
doesn't allow containment on pre-AVX hardware and on pre-AVX hardware movups
was frequently slower than movaps
even if the data was definitely aligned.
But, given the age of the hardware involved and the logic we use elsewhere, it shouldn't be a huge consideration (and the cost also wasn't huge, typically an extra cycle or so; the split cache line accesses were much worse and closer to 20 cycles for loads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus, we don't use aligned loads/stores in most of our workloads already, so this won't be any different.
WidenLatin1ToUtf16_Sse2
and NarrowUtf16ToLatin1_Sse2
are the only methods in the repository that use StoreAligned
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like
Sse2.Store
approach is better since unaligned char* should be rare? (e.g. char[] is always aligned)
A potential case is the public API System.Text.Encoding.GetChars(System.Byte*, System.Int32, System.Char*, System.Int32)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WidenLatin1ToUtf16_Sse2 and NarrowUtf16ToLatin1_Sse2 are the only methods in the repository that use StoreAligned.
Right, and so preserving the alignment here won't make a big difference to downlevel hardware. We should just normalize to not using StoreAligned
so the code can be simpler and robust in the face of unalignable data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just normalize to not using
StoreAligned
so the code can be simpler and robust in the face of unalignable data.
I agree with @tannergooding . @xtqqczze please apply the suggested change
@xtqqczze thank you for your contribution!
This pull request has been automatically marked |
This pull request will now be closed since it had been marked |
Handle the case where the
pUtf16Buffer
parameter is misaligned, and fallback to a simple loop.The current code uses of
Sse2.StoreAligned
(MOVDQA), which is documented to throw a general-protection exception if the memory operand is misaligned. However, it is possible data could be silently corrupted if the existing logic does not account for misaligned address.