Default to -Z plt=yes #106380

MaskRay · 2023-01-02T20:09:38Z

6009da0 defaulted to -Z plt=no (like
clang -fno-plt) which a not a useful default[1].

On x86-64, if the target symbol is preemptible, there is an
R_X86_64_GLOB_DAT relocation, and the (very minor) optimization works as
intended. However, if the target is non-preemptible, i.e. the target is
resolved to the same component, this is actually a pessimization due to
the longer instruction.

On RISC architectures, there is typically no single instruction which
can load a GOT entry and perform an indirect call. -fno-plt has a longer
code sequence. For example, AArch64 needs 3 instructions:

adrp    x0, _GLOBAL_OFFSET_TABLE_
ldr     x0, [x0, #:gotpage_lo15:bar]
br      x0

This does not end up with a serious code size issue, because LLVM
"RtLibUseGOT" is not implemented for non-x86 targets.

On x86-32, very new lld[2] (2022-12-31) is needed to support
general-dynamic/local-dynamic TLS models.

-Z plt=no is not an appropriate default, so just default to true for
all targets.

[1] https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table#fno-plt
[2] llvm/llvm-project@8dc7366

rustbot · 2023-01-02T20:09:46Z

r? @oli-obk

(rustbot has picked a reviewer for you, use r? to override)

rustbot · 2023-01-02T20:09:48Z

These commits modify compiler targets.
(See the Target Tier Policy.)

MaskRay · 2023-01-02T20:18:34Z

@GabrielMajeri FYI

GabrielMajeri · 2023-01-02T20:49:14Z

For the record, I'll leave a link to the original pull request which enabled -fno-plt by default: #54592. There's more discussion there, and a few synthetic benchmarks.

As for this:

However, if the target is non-preemptible, i.e. the target is
resolved to the same component, this is actually a pessimization due to
the longer instruction.

I'm afraid I'm not familiar enough with CPU (micro)architectures to counter this claim. On x86-64, at the very least, the -fno-plt optimization seemed to do more good than harm, and it was enabled by default for all arches due to an implicit assumption that it would help them as well (although, again, I haven't benchmarked the changes for anything except x86).

nikic · 2023-01-02T22:41:42Z

@bors try @rust-timer queue

bors · 2023-01-02T22:41:50Z

⌛ Trying commit 5a3791f76dde2490094ebeb8977614a55a72fc25 with merge 5f4f27a0a98c60c9218675e35d2cc60c47546394...

tmiasko · 2023-01-03T09:42:20Z

@rust-timer build 5f4f27a0a98c60c9218675e35d2cc60c47546394

rust-timer · 2023-01-03T11:03:04Z

Finished benchmarking commit (5f4f27a0a98c60c9218675e35d2cc60c47546394): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.4%	[0.2%, 1.0%]	28
Regressions ❌ (secondary)	0.4%	[0.2%, 0.5%]	24
Improvements ✅ (primary)	-1.1%	[-2.3%, -0.4%]	34
Improvements ✅ (secondary)	-1.2%	[-2.0%, -0.6%]	11
All ❌✅ (primary)	-0.4%	[-2.3%, 1.0%]	62

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-1.2%	[-2.3%, -0.1%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-1.2%	[-2.3%, -0.1%]	2

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.9%	[0.9%, 2.8%]	15
Regressions ❌ (secondary)	2.2%	[1.5%, 3.5%]	10
Improvements ✅ (primary)	-1.7%	[-2.5%, -1.1%]	13
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.2%	[-2.5%, 2.8%]	28

krasimirgg · 2023-01-05T10:15:35Z

@rustbot label: +llvm-main

nikic · 2023-01-05T10:32:45Z

@krasimirgg Is that because llvm/llvm-project@2679e8b broke the Rust build? If so, consider this PR blocked on that change getting reverted. We can evaluate changing our defaults, but it's not okay to break LLVM and force us to change defaults because of that.

6009da0 defaulted to `-Z plt=no` (like `clang -fno-plt`) which a not a useful default[1]. On x86-64, if the target symbol is preemptible, there is an `R_X86_64_GLOB_DAT` relocation, and the (very minor) optimization works as intended. However, if the target is non-preemptible, i.e. the target is resolved to the same component, this is actually a pessimization due to the longer instruction. On RISC architectures, there is typically no single instruction which can load a GOT entry and perform an indirect call. `-fno-plt` has a longer code quence. For example, AArch64 needs 3 instructions: adrp x0, _GLOBAL_OFFSET_TABLE_ ldr x0, [x0, #:gotpage_lo15:bar] br x0 This does not end up with a serious code size issue, because LLVM "RtLibUseGOT" is not implemented for non-x86 targets. On x86-32, very new lld[2] (2022-12-31) is needed to support general-dynamic/local-dynamic TLS models. `-Z plt=no` is not an appropriate default, so just default to true for all targets. [1] https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table#fno-plt [2] llvm/llvm-project@8dc7366

MaskRay · 2023-01-11T18:40:09Z

Rebase

nikic · 2023-01-21T17:39:55Z

Nominating for discussion at the next compiler meeting. The proposal here is to change the current -Z plt=no default to -Z plt=yes (ignoring some details about relro levels and target-specific overrides here). The blog post https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table has some background information on what GOT and PLT are.

In short, the basic idea is that if you call a function from a different shared object (say from libc.so), what actually gets called is a PLT stub, which then looks up the function to call in the GOT and calls it. With -Z plt=no, we instead directly look up the function in the GOT and call it. On x86-64, this can be done in a single instruction, which is one byte longer than the call to the PLT stub. This means we trade off one less level of call indirection for one byte more code size. (The PLT stub can also be used to perform lazy binding, but this is irrelevant in the context of this discussion.)

If the called function is in the same shared object / executable, it will get resolved to a direct call. In this case, the longer -Z plt=no encoding is a pure code size loss.

On non-x86 targets, the encoding for a GOT call rather than PLT call may be substantially larger. Rather than going from 5 to 6 bytes, it may go from 4 to 12 bytes (I guess that would be the case for the AArch64 example).

Relevant results from the perf run above are:

cycles: About 2% regressions on doc builds, i.e. generated code becomes 2% slower, for this workload. About 2% improvement on debug builds, which indicates that generating PLT binaries is faster.
binary size: Improvements in the 1% range.

I believe these are the relevant facts. My personal recommendation based on my understanding of the tradeoffs here would be to default to -Z plt=no for x86-64 only and use -Z plt=yes for all other targets. The current default was only ever performance evaluated for x86-64 (see #54592 for the original change), and the tradeoffs are clearly very different on other targets. However, for x86-64 there do seem to be clear performance benefits in typical usage scenarios, so I'm not convinced by the proposal to use -Z plt=yes for all targets, including x86-64.

I'd like some broader input from T-compiler on what to do here. Also cc @nagisa who reviewed the original PR.

Finally, I think whatever we do here, it probably makes sense to stabilize -C plt, so people can override this.

pnkfelix · 2023-01-26T16:32:42Z

I have filed an MCP for the plan regarding the change in defaults outlined by @nikic; that is rust-lang/compiler-team#581

(It probably would be good to expose control of the PLT via a proper -C flag as @nikic suggests, but I opted not to fold that into rust-lang/compiler-team#581.)

bors · 2023-02-21T13:07:51Z

☔ The latest upstream changes (presumably #108301) made this pull request unmergeable. Please resolve the merge conflicts.

durin42 · 2023-02-21T19:51:55Z

@rustbot label: -llvm-main

apiraino · 2023-03-09T11:31:09Z

MCP has been approved compiler-team#581

pnkfelix · 2023-03-23T14:17:49Z

@MaskRay can you revise this PR to reflect the refinement described in rust-lang/compiler-team#581 (namely, that the default should be PLT=yes for everything except x86_64 ?)

@rustbot label: -S-waiting-on-review +S-waiting-on-author

MaskRay · 2023-03-26T01:50:43Z

@MaskRay can you revise this PR to reflect the refinement described in rust-lang/compiler-team#581 (namely, that the default should be PLT=yes for everything except x86_64 ?)

@rustbot label: -S-waiting-on-review +S-waiting-on-author

I disagree with keeping -Z plt=no for x86-64. We should just use -Z plt=yes for all architectures.

nekopsykose · 2023-03-26T02:26:13Z

I disagree with keeping -Z plt=no for x86-64. We should just use -Z plt=yes for all architectures.

for the sake of nonperfection, could you make the change anyway?

i fear otherwise this change would just get lost in a cyclic argument, because it is correct from both arguments sides in some capacity (on x86 specifically, in the most common case of no special user-optimised static linking overrides (few people are doing this in rust to my knowledge), plt=no is generally a small gain (the rustc-itself benchmarks), by everything established above). i feel like everyone here has already discussed this to the full capacity possible, and there is no new information to really take in on the subject.

but it would be nice to at least indeed change it everywhere else (as acked via MCP above), and people then would get a real gain.

Dylan-DPC · 2023-05-18T12:58:40Z

@MaskRay any updates on this?

Per the discussion in rust-lang#106380 plt=no isn't a great default, and rust-lang/compiler-team#581 decided that the default should be PLT=yes for everything except x86_64. Not everyone agrees about the x86_64 part of this change, but this at least is an improvement in the state of things without changing the x86_64 situation, so I've attempted making this change in the name of not letting the perfect be the enemy of the good.

rustc_session: default to -Z plt=yes on non-x86_64 Per the discussion in rust-lang#106380 plt=no isn't a great default, and rust-lang/compiler-team#581 decided that the default should be PLT=yes for everything except x86_64. Not everyone agrees about the x86_64 part of this change, but this at least is an improvement in the state of things without changing the x86_64 situation, so I've attempted making this change in the name of not letting the perfect be the enemy of the good. Please let me know if I've messed this up somehow - I'm not wholly confident I got this right. r? `@nikic`

nikic · 2023-06-27T20:12:18Z

Superseded by #109982.

rustbot assigned oli-obk Jan 2, 2023

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 2, 2023

MaskRay force-pushed the needs_plt branch from 1aa438f to 8032ec4 Compare January 2, 2023 20:14

MaskRay mentioned this pull request Jan 2, 2023

Support for disabling PLT for better function call performance #54592

Merged