-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsOverviewThis PR upgrades Vector128/256/512 Sum() implementations. The existing Sum() implementations(Vector128 and Vector256) use Vector128Case 1: Without PR With PR
Case 2: Without PR With PR
Case 3: Without PR With PR
Case 4: Without PR With PR
Case 5: Without PR With PR
Case 6: Without PR With PR
Vector256Case 1: Without PR With PR
Case 2: Without PR With PR
Case 3: Without PR With PR
Case 4: Without PR With PR
Case 5: Without PR With PR
Case 6: Without PR With PR
Vector512Case 1: Without PR With PR
Case 2: Without PR With PR
Case 3: Without PR With PR
Case 4: Without PR With PR
Case 5: Without PR With PR
Case 6: Without PR With PR
Instructions and dependenciesVector128 8 bits integer 16 bits integer 32 bits integer 64 bits integer float double Vector256 All integer types All float types Vector512 All integer types All float types
|
26e8e12
to
2408ec9
Compare
Does new implementation change the resulting value compared to the previous? Due to IEEE754 rules |
The resulting values should remain the same. I see some test failures. Will verify/fix those before making PR 'ReadyForReview' re Feel free to let me know if you see any issues here. |
@EgorBo The test failures confuse me. They are with AVX512F disabled. I looked at the disassembly and the disasm is the same with my change vs main. I'll look at this but pointing it out incase I'm missing anything obvious
With AVX512F Enabled - This passes
|
216f087
to
044aebe
Compare
src/coreclr/jit/hwintrinsicxarch.cpp
Outdated
break; | ||
} | ||
else if (varTypeIsByte(simdBaseType) || varTypeIsLong(simdBaseType)) | ||
#if defined(TARGET_X86) | ||
else if (varTypeIsLong(simdBaseType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What support is missing for long
on 32-bit?
I would have expected this to generally "just work" and for us to be able to use _ToScalar
provided SSE4.1
is supported given that GetElement
has the required decomposition support for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a failure with decomposition. I tried again and passes locally. I enabled that code to get it to run on CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the right way to handle this - 9e3657f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't quite work unfortunately and the actual handling is going to be a little bit more complex.
Noting that I'm fine with this being handled as a separate PR (and am happy to do that work, seeing as I'm pretty sure I know what will need to be done here), I had initially thought it might be slightly simpler but looks like its not quite all there..
The "proper" fix likely entails:
- Extracting this logic to a
gtNewSimdToScalarNode
helper: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L2917-L2937 - Updating various places that are manually doing
gtNewSimdHWIntrinsicNode(retType, op1, NI_Vector###_ToScalar, simdBaseJitType, simdSize)
to call the introduced helper - Have the
impSpecialIntrinsic
handling forNI_Vector###_Sum
do:
#if defined(TARGET_X86)
else if (varTypeIsLong(simdBaseType) && !compOpportunisticallyDependsOn(InstructionSet_SSE41))
{
// We need SSE41 to handle long, use software fallback
break;
}
#endif // TARGET_X86
At some future point we can ensure that NI_Vector###_GetElement
has decomp handling for pre-SSE4.1 as well, it's just slightly more complex given that TYP_INT
currently requires SSE4.1
as well. The fix to get that working is likely to just reuse the TYP_FLOAT
handling for getting the relevant 32-bit part and then using ToScalar
(shifting can then be used for the relevant 8-bit or 16-bit part for byte/short).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the implementation to use the helper. Waiting for CI tests to pass. My local tests passed
df322c6
to
6e77487
Compare
6e77487
to
9e3657f
Compare
d7a3c62
to
6b3eec7
Compare
6b3eec7
to
76e307e
Compare
// | ||
GenTree* Compiler::gtNewSimdToScalarNode(var_types type, GenTree* op1, CorInfoType simdBaseJitType, unsigned simdSize) | ||
{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray empty line:
CC. @dotnet/jit-contrib, this needs a secondary review but should be good to merge otherwise |
…wering for Vector512 (dotnet#95568) * Updating Sum() implementation. * Fixing codegen * Addressing review comments. * Fix Formatting * Enabling for long on x86. * Cleaning up ToScalar implementation
Overview
This PR upgrades Vector128/256/512 Sum() implementations. The existing Sum() implementations(Vector128 and Vector256) use
hadd
and are not the most efficient. This commit modifies the existing implementations and adds the Vector512 sum(0 implementationsVector128
Case 1:
byte/ubyte
typesWithout PR
Not lowered
With PR
Case 2:
Int16
typesWithout PR
With PR
Case 3:
Int32
typesWithout PR
With PR
Case 4:
long
typesWithout PR
Not lowered
With PR
Case 5:
float
typesWithout PR
With PR
Case 6:
double
typesWithout PR
With PR
Vector256
Case 1:
byte/ubyte
typesWithout PR
Not lowered
With PR
Case 2:
Int16
typesWithout PR
With PR
Case 3:
Int32
typesWithout PR
With PR
Case 4:
long
typesWithout PR
Not lowered
With PR
Case 5:
float
typesWithout PR
With PR
Case 6:
double
typesWithout PR
With PR
Vector512
Case 1:
byte/ubyte
typesWithout PR
Not lowered
With PR
Case 2:
Int16
typesWithout PR
Not lowered
With PR
Case 3:
Int32
typesWithout PR
Not lowered
With PR
Case 4:
long
typesWithout PR
Not lowered
With PR
Case 5:
float
typesWithout PR
Not lowered
With PR
Case 6:
double
typesWithout PR
Not lowered
With PR
Instructions and dependencies
Vector128
8 bits integer
vpsrldq
- sse2vpaddb
- sse216 bits integer
vpsrldq
- sse2vpaddw
- sse232 bits integer
vpsrldq
- sse2vpaddd
- sse264 bits integer
vpsrldq
- sse2vpaddq
- sse2float
shufps
- sseaddps
- ssedouble
shufpd
- sseaddpd
- sse2Vector256
the corresponding Vector128 instr + the following
All integer types
vextracti128
- AVX2All float types
vextractf128
- AVXVector512
the corresponding Vector128 instr + the following
All integer types
vextracti64x4
- AVX512FAll float types
vextractf64x4
- AVX512FPerformance numbers
On ICX
On SPR