Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Enabled embedded broadcast for binary ops #87946

Merged
merged 9 commits into from
Jul 14, 2023

Conversation

Ruihan-Yin
Copy link
Contributor

This PR provides the embedded broadcast support for binary ops, to be more specific, ops using the genHWIntrinsic_R_R_RM path.

Including the following ops:
and, andn, or, xor,
min, max,
div, mul, mull, sub,
variable shiftleftlogical/rightarithmetic/rightlogical

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 22, 2023
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jun 22, 2023
@ghost
Copy link

ghost commented Jun 22, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR provides the embedded broadcast support for binary ops, to be more specific, ops using the genHWIntrinsic_R_R_RM path.

Including the following ops:
and, andn, or, xor,
min, max,
div, mul, mull, sub,
variable shiftleftlogical/rightarithmetic/rightlogical

Author: Ruihan-Yin
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@Ruihan-Yin Ruihan-Yin changed the title Enabled embedded broadcast for binary ops JIT: Enabled embedded broadcast for binary ops Jun 23, 2023
@Ruihan-Yin Ruihan-Yin closed this Jun 23, 2023
@Ruihan-Yin Ruihan-Yin reopened this Jun 23, 2023
@tannergooding tannergooding added the avx512 Related to the AVX-512 architecture label Jun 27, 2023
@Ruihan-Yin Ruihan-Yin marked this pull request as ready for review June 27, 2023 22:01
@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented Jun 28, 2023

Hi @tannergooding , the PR should be ready for review, would you please take a look at it?
it mostly includes:

  1. some updates on the metadata in the related tables to enable embedded broadcast in more instructions.
  2. some workaround on the bitwise instructions to make it work correctly with embedded broadcast when the input data type is in 64-bit.
  3. a bug fix where some broadcast nodes with data type less than 32 bits are accidentally contained, made some changes in the contain check to filter out these situations.

@Ruihan-Yin
Copy link
Contributor Author

windows x64

Diffs are based on 1,627,004 contexts (467,427 MinOpts, 1,159,577 FullOpts).

MISSED contexts: 2,227 (0.14%)

Overall (+2,406 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 44,062,516 +156
benchmarks.run.windows.x64.checked.mch 7,244,558 +152
benchmarks.run_pgo.windows.x64.checked.mch 26,963,808 +143
benchmarks.run_tiered.windows.x64.checked.mch 10,507,436 +157
coreclr_tests.run.windows.x64.checked.mch 385,009,583 +736
libraries.pmi.windows.x64.checked.mch 59,506,418 +190
libraries_tests.pmi.windows.x64.checked.mch 121,311,207 +220
realworld.run.windows.x64.checked.mch 13,650,175 +652
MinOpts (+383 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run_pgo.windows.x64.checked.mch 10,473,053 +73
benchmarks.run_tiered.windows.x64.checked.mch 7,567,856 +73
coreclr_tests.run.windows.x64.checked.mch 269,424,110 +237
FullOpts (+2,023 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 29,278,684 +156
benchmarks.run.windows.x64.checked.mch 6,870,267 +152
benchmarks.run_pgo.windows.x64.checked.mch 16,490,755 +70
benchmarks.run_tiered.windows.x64.checked.mch 2,939,580 +84
coreclr_tests.run.windows.x64.checked.mch 115,585,473 +499
libraries.pmi.windows.x64.checked.mch 57,987,107 +190
libraries_tests.pmi.windows.x64.checked.mch 114,958,091 +220
realworld.run.windows.x64.checked.mch 12,500,933 +652

@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented Jun 29, 2023

Fails should be irrelevant.

We had regression of about 2,400 bytes in the code size introduced by the changes. This should be expected as the regression is the conversion from VEX encoding to EVEX encoding when enabling embedded broadcast.
And there is no influence on the throughput.

Do we accept the results?

@tannergooding
Copy link
Member

Do we accept the results?

The results look good/within reason to me. There are two important bits to call out...

First this change, in the typical case, trades a slight increase in code size (typically +2-bytes) for a decrease in data size (going from sizeof(VectorXXX<T>) to sizeof(T)) thereby increasing data locality and minimizing impact to the cache. This will often be a performance win.

Second, SPMI diffs are not currently tracking/including the data size allocations as part of "bytes of code" metric. HardwareIntrinsics.RayTracer.Packet256Tracer:Shade for example is +2 bytes of code, but -32 bytes of data so its actually a net win of -30 bytes of allocations

@@ -8138,11 +8138,18 @@ bool Lowering::IsContainableHWIntrinsicOp(GenTreeHWIntrinsic* parentNode, GenTre
case NI_AVX2_BroadcastScalarToVector256:
case NI_AVX512F_BroadcastScalarToVector512:
{
var_types baseType = hwintrinsic->GetSimdBaseType();
if (baseType == TYP_BYTE || baseType == TYP_UBYTE || baseType == TYP_SHORT || baseType == TYP_USHORT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better handled as

Suggested change
if (baseType == TYP_BYTE || baseType == TYP_UBYTE || baseType == TYP_SHORT || baseType == TYP_USHORT)
if (varTypeIsSmall(baseType))

@tannergooding
Copy link
Member

CC. @dotnet/jit-contrib for secondary review.

@BruceForstall
Copy link
Member

superpmi-replay and superpmi-diffs pipelines are failing with messages like:

[17:40:21] ERROR: Couldn't load base metrics summary created by child process
[17:40:21] General fatal error

I wonder if the JIT is crashing (not asserting) with this PR?

@Ruihan-Yin
Copy link
Contributor Author

I wonder if the JIT is crashing (not asserting) with this PR?

First time seeing this fail in this PR, the recent changes might not cause the crashing, I will rebase the PR and re-run the CI again to confirm.

@@ -598,6 +598,23 @@ struct HWIntrinsicInfo
#endif
}

#if defined(TARGET_XARCH)
static bool IsBitwiseOperation(instruction ins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an odd location for this function. All the other functions here take a NamedIntrinsic. Should this instead be in emitxarch.h/cpp like IsMovInstruction or HasKMaskRegisterDest ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the function into emitxarch.h/cpp as a static function.

@@ -399,7 +399,6 @@ INST3(aesenclast, "aesenclast", IUM_WR, BAD_CODE, BAD_CODE,
INST3(aesimc, "aesimc", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0xDB), INS_TT_NONE, REX_WIG | Encoding_VEX) // Perform the AES InvMixColumn Transformation
INST3(aeskeygenassist, "aeskeygenassist", IUM_WR, BAD_CODE, BAD_CODE, SSE3A(0xDF), INS_TT_NONE, REX_WIG | Encoding_VEX) // AES Round Key Generation Assist
INST3(pclmulqdq, "pclmulqdq" , IUM_WR, BAD_CODE, BAD_CODE, SSE3A(0x44), INS_TT_NONE, Input_64Bit | REX_WIG | Encoding_VEX | INS_Flags_IsDstDstSrcAVXInstruction) // Perform a carry-less multiplication of two quadwords

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to delete this empty line?

@tannergooding
Copy link
Member

I wonder if the JIT is crashing (not asserting) with this PR?

There are similar failures in multiple PRs from what I've seen recently.

Noting that SPMI diffs is faulting in Python when processing the diffs summary:

Traceback (most recent call last):
  File "C:\h\w\B2080989\p\superpmi.py", line 4667, in <module>
    sys.exit(main(args))
  File "C:\h\w\B2080989\p\superpmi.py", line 4558, in main
    success = asm_diffs.replay_with_asm_diffs()
  File "C:\h\w\B2080989\p\superpmi.py", line 2019, in replay_with_asm_diffs
    self.write_asmdiffs_markdown_summary(write_fh, asm_diffs)
  File "C:\h\w\B2080989\p\superpmi.py", line 2170, in write_asmdiffs_markdown_summary
    write_row(*t)
  File "C:\h\w\B2080989\p\superpmi.py", line 2165, in write_row
    num_missed_base / num_contexts * 100,
ZeroDivisionError: division by zero

Replay is similarly failing with things like:

[17:42:50] Invoking: C:\h\w\B6DA095E\p\superpmi.exe -v ewi -r C:\h\w\B6DA095E\t\tmpdsiibugo\repro -p -jitoption JitStressRegs=0x80 -f C:\h\w\B6DA095E\t\tmpdsiibugo\coreclr_tests.run.windows.x64.checked.mch_fail.mcl -metricsSummary C:\h\w\B6DA095E\t\tmpdsiibugo\coreclr_tests.run.windows.x64.checked.mch_metrics.csv C:\h\w\B6DA095E\p\clrjit_win_x64_x64.dll C:\h\w\B6DA095E\p\artifacts\spmi\mch\02e334af-4e6e-4a68-9feb-308d3d2661bc.windows.x64\coreclr_tests.run.windows.x64.checked.mch
[17:42:50] ERROR: Couldn't load base metrics summary created by child process
[17:42:50] ERROR: Couldn't load base metrics summary created by child process
[17:42:50] ERROR: Couldn't load base metrics summary created by child process
[17:42:50] ERROR: Couldn't load base metrics summary created by child process
[17:42:50] General fatal error
[17:42:50] Running SuperPMI replay of C:\h\w\B6DA095E\p\artifacts\spmi\mch\02e334af-4e6e-4a68-9feb-308d3d2661bc.windows.x64\libraries.crossgen2.windows.x64.checked.mch

Did we recently get a JIT/EE version change and potentially not have up to date baselines?

@BruceForstall
Copy link
Member

Noting that SPMI diffs is faulting in Python when processing the diffs summary:

superpmi.py should gracefully handle that. But I think that's a symptom that superpmi.exe crashed.

Did we recently get a JIT/EE version change and potentially not have up to date baselines?

If the GUID changed, we wouldn't have this problem because we would only pick up matched MCH files.

If someone changed the interface without changing the GUID, then anything is possible.

@BruceForstall
Copy link
Member

@BruceForstall
Copy link
Member

@Ruihan-Yin The superpmi pipelines are now running clean. I would suggest to merge up to HEAD and re-push your PR to trigger re-run of these pipelines.

@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented Jul 7, 2023

The superpmi pipelines are now running clean. I would suggest to merge up to HEAD and re-push your PR to trigger re-run of these pipelines.

Thanks for the notification, will resolve the reviews and rebase the branch soon.

and, andn, or, xor,
min, max,
div, mul, mull, sub,
variable shiftleftlogical/rightarithmetic/rightlogical
 JIT used to use a uniform intrinsic for bitwise operations with all data
 types, embedded broadcast is sensitive to input size in this case,
 adding a helper to let emitter aware when input size is long/ulong.
when embedded broadcast is actually enabled
There are cases when broadcast node are falsely contained by a embedded
broadcast compatible node, while the data type is actually not supported
Adding extra logics to avoid this situation.
instructions with either long or ulong as basetype should be reset to
qword instructions.
make the typecheck based on broadcast node it self.
use `varTypeIsSmall` type check to cover all the unsupported data type
in embedded broadcast.
1. put the IsBitwiseInstruction to a proper place.
2. nit: restored unnecessary line delete.
@Ruihan-Yin
Copy link
Contributor Author

Hi @BruceForstall, some tests are cancelled for some reasons, I suppose it is not relevant to the changes in this PR, shall I re-run the test again? And can you please review the new changes, if any. Thanks!

@@ -1230,6 +1230,30 @@ void CodeGen::inst_RV_RV_TT(
if (IsEmbBroadcast)
{
instOptions = INS_OPTS_EVEX_b;
if (emitter::IsBitwiseInstruction(ins) && varTypeIsLong(op2->AsHWIntrinsic()->GetSimdBaseType()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need IsBitwiseInstruction at all since we have a switch over ins anyway? (we can just remove the unreached in debug

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly not. I'll let Ruihan-Yin fix this in a follow up PR if they want to, that way we can land the important bit before the snap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, we will fix this point in the following PR on embedded broadcast.

@tannergooding tannergooding merged commit ef9a07c into dotnet:main Jul 14, 2023
132 of 135 checks passed
@ghost ghost locked as resolved and limited conversation to collaborators Aug 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants