Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use load+store instead of memcpy for small integer arrays #111999

Merged
merged 2 commits into from
Jun 6, 2023

Conversation

scottmcm
Copy link
Member

@scottmcm scottmcm commented May 26, 2023

I was inspired by #98892 to see whether, rather than making mem::swap do something smart in the library, we could update MIR assignments like *_1 = *_2 to do something smarter than memcpy for sufficiently-small types that doing it inline is going to be better than a memcpy call in assembly anyway. After all, special code may help mem::swap, but if the "obvious" MIR can just result in the correct thing that helps everything -- other code like mem::replace, people doing it manually, and just passing around by value in general -- as well as makes MIR inlining happier since it doesn't need to deal with all the complicated library code if it just sees a couple assignments.

LLVM will turn the short, known-length memcpys into direct instructions in the backend, but that's too late for it to be able to remove allocas. In general, replacing memcpys with typed instructions is hard in the middle-end -- even for memcpy.inline where it knows it won't be a function call -- is hard due to poison propagation issues. So because we know more about the type invariants -- these are typed copies -- rustc can emit something more specific, allowing LLVM to mem2reg away the allocas in some situations.

#52051 previously did something like this in the library for mem::swap, but it ended up regressing during enabling mir inlining (cbbf06b), so this has been suboptimal on stable for ≈5 releases now.

The code in this PR is narrowly targeted at just integer arrays in LLVM, but works via a new method on the LayoutTypeMethods trait, so specific backends based on cg_ssa can enable this for more situations over time, as we find them. I don't want to try to bite off too much in this PR, though. (Transparent newtypes and simple things like the 3×usize String would be obvious candidates for a follow-up.)

Codegen demonstrations: https://llvm.godbolt.org/z/fK8hT9aqv

Before:

define void @swap_rgb48_old(ptr noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #1 {
  %a.i = alloca [3 x i16], align 2
  call void @llvm.lifetime.start.p0(i64 6, ptr nonnull %a.i)
  call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 2 dereferenceable(6) %a.i, ptr noundef nonnull align 2 dereferenceable(6) %x, i64 6, i1 false)
  tail call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 2 dereferenceable(6) %x, ptr noundef nonnull align 2 dereferenceable(6) %y, i64 6, i1 false)
  call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 2 dereferenceable(6) %y, ptr noundef nonnull align 2 dereferenceable(6) %a.i, i64 6, i1 false)
  call void @llvm.lifetime.end.p0(i64 6, ptr nonnull %a.i)
  ret void
}

Note it going to stack:

swap_rgb48_old:                         # @swap_rgb48_old
        movzx   eax, word ptr [rdi + 4]
        mov     word ptr [rsp - 4], ax
        mov     eax, dword ptr [rdi]
        mov     dword ptr [rsp - 8], eax
        movzx   eax, word ptr [rsi + 4]
        mov     word ptr [rdi + 4], ax
        mov     eax, dword ptr [rsi]
        mov     dword ptr [rdi], eax
        movzx   eax, word ptr [rsp - 4]
        mov     word ptr [rsi + 4], ax
        mov     eax, dword ptr [rsp - 8]
        mov     dword ptr [rsi], eax
        ret

Now:

define void @swap_rgb48(ptr noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #0 {
start:
  %0 = load <3 x i16>, ptr %x, align 2
  %1 = load <3 x i16>, ptr %y, align 2
  store <3 x i16> %1, ptr %x, align 2
  store <3 x i16> %0, ptr %y, align 2
  ret void
}

still lowers to dword+word operations, but has no stack traffic:

swap_rgb48:                             # @swap_rgb48
        mov     eax, dword ptr [rdi]
        movzx   ecx, word ptr [rdi + 4]
        movzx   edx, word ptr [rsi + 4]
        mov     r8d, dword ptr [rsi]
        mov     dword ptr [rdi], r8d
        mov     word ptr [rdi + 4], dx
        mov     word ptr [rsi + 4], cx
        mov     dword ptr [rsi], eax
        ret

And as a demonstration that this isn't just mem::swap, a mem::replace on a small array (since replace doesn't use swap since #83022), which used to be memcpys in LLVM changes in IR

define void @replace_short_array(ptr noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
  %1 = load <3 x i32>, ptr %r, align 4
  store <3 x i32> %1, ptr %0, align 4
  %2 = load <3 x i32>, ptr %v, align 4
  store <3 x i32> %2, ptr %r, align 4
  ret void
}

but that lowers to reasonable dword+qword instructions still

replace_short_array:                    # @replace_short_array
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        mov     edi, dword ptr [rsi + 8]
        mov     dword ptr [rax + 8], edi
        mov     qword ptr [rax], rcx
        mov     rcx, qword ptr [rdx]
        mov     edx, dword ptr [rdx + 8]
        mov     dword ptr [rsi + 8], edx
        mov     qword ptr [rsi], rcx
        ret

@rustbot
Copy link
Collaborator

rustbot commented May 26, 2023

r? @b-naber

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 26, 2023
@rust-log-analyzer

This comment has been minimized.

@scottmcm
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 26, 2023
@bors
Copy link
Contributor

bors commented May 26, 2023

⌛ Trying commit 44c45ca0492db8f40cdcb19ff7dbfca05e95be75 with merge daa8f4a9835d98cf1139c088d10b29b4e8e084a5...

@the8472
Copy link
Member

the8472 commented May 26, 2023

since replace doesn't use swap since #83226)

Wrong link?

@scottmcm
Copy link
Member Author

@the8472 Oops, yup, wrong link. Should be #83022

@bors
Copy link
Contributor

bors commented May 26, 2023

☀️ Try build successful - checks-actions
Build commit: daa8f4a9835d98cf1139c088d10b29b4e8e084a5 (daa8f4a9835d98cf1139c088d10b29b4e8e084a5)

1 similar comment
@bors
Copy link
Contributor

bors commented May 26, 2023

☀️ Try build successful - checks-actions
Build commit: daa8f4a9835d98cf1139c088d10b29b4e8e084a5 (daa8f4a9835d98cf1139c088d10b29b4e8e084a5)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (daa8f4a9835d98cf1139c088d10b29b4e8e084a5): comparison URL.

Overall result: ❌ regressions - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.7% [0.7%, 0.7%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.1% [3.1%, 3.1%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-1.4% [-1.9%, -1.0%] 2
Improvements ✅
(secondary)
-2.2% [-2.7%, -1.5%] 3
All ❌✅ (primary) 0.1% [-1.9%, 3.1%] 3

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
1.7% [1.2%, 2.1%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.0% [0.0%, 0.0%] 4
Regressions ❌
(secondary)
0.7% [0.3%, 2.9%] 8
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.0% [0.0%, 0.0%] 4

Bootstrap: 643.928s -> 645.194s (0.20%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 26, 2023
@b-naber
Copy link
Contributor

b-naber commented May 28, 2023

r? compiler

@rustbot rustbot assigned compiler-errors and unassigned b-naber May 28, 2023
if flags == MemFlags::empty()
&& let Some(bty) = bx.cx().scalar_copy_backend_type(layout)
{
// I look forward to only supporting opaque pointers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a FIXME so someone may remove this when llvm only has opaque ptrs? (Well, I guess removing this logic would also preclude other backends with typed ptrs, too. In that case, maybe no comment at all.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bigger question because GCC would need to move, so I'll leave it tracked by conversations like https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/llvm.20bitcasts.20in.20codegen/near/356591425 instead of something specific in this place.

///
/// This *should* return `None` for particularly-large types, where leaving
/// the `memcpy` may well be important to avoid code size explosion.
fn scalar_copy_backend_type(&self, layout: TyAndLayout<'tcx>) -> Option<Self::Type> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a default impl here may make it not obvious that other backends should emit typed load/store. Is it bad style to just add this to cranelift and codegen_gcc here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, cg_clif doesn't use cg_ssa, so it's not impacted.

I have no practical way to test cg_gcc, being on windows, and I also don't actually have any information that doing something other than memcpy for it would actually be an improvement for it, so I figured I'd just leave the default here since it's a semantically sufficient implementation.

Notably, GCC apparently doesn't have the poison semantics that are what nikic mentioned as being the problem for better optimizing this:

fn const_poison(&self, typ: Type<'gcc>) -> RValue<'gcc> {
// No distinction between undef and poison.
self.const_undef(typ)
}

so indeed it might just never need to do this.

Comment on lines +388 to +389
let src = bx.pointercast(src, pty);
let dst = bx.pointercast(dst, pty);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do these not match pty? Why not just return a more "accurate" type in scalar_copy_llvm_type? Like the actual LLVM type that corresponds to src/dst? (Or am I misunderstanding something?)

Copy link
Member Author

@scottmcm scottmcm May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Well, the shallow answer is that they don't match because it's -> Option<Type>, not should_load_store_instead_of_memcpy() -> bool.)

Because emitting these as the backend type doesn't necessarily do what we want. Notably, this PR is emitting the load/store as <4 x i8>, the vector type, rather than the llvm_backend_type of [4 x i8] for arrays.

I tried using the LLVM type first, but with arrays that results in exploding out the IR: https://llvm.godbolt.org/z/vjjsdea9e

Optimizing the version that loads/stores arrays

define void @replace_short_array_using_arrays(ptr noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
  %1 = load [3 x i32], ptr %r, align 4
  store [3 x i32] %1, ptr %0, align 4
  %2 = load [3 x i32], ptr %v, align 4
  store [3 x i32] %2, ptr %r, align 4
  ret void
}

gives

define void @replace_short_array_using_arrays(ptr noalias nocapture noundef writeonly sret([3 x i32]) dereferenceable(12) %0, ptr noalias nocapture noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
  %.unpack = load i32, ptr %r, align 4
  %.elt1 = getelementptr inbounds [3 x i32], ptr %r, i64 0, i64 1
  %.unpack2 = load i32, ptr %.elt1, align 4
  %.elt3 = getelementptr inbounds [3 x i32], ptr %r, i64 0, i64 2
  %.unpack4 = load i32, ptr %.elt3, align 4
  store i32 %.unpack, ptr %0, align 4
  %.repack5 = getelementptr inbounds [3 x i32], ptr %0, i64 0, i64 1
  store i32 %.unpack2, ptr %.repack5, align 4
  %.repack7 = getelementptr inbounds [3 x i32], ptr %0, i64 0, i64 2
  store i32 %.unpack4, ptr %.repack7, align 4
  %.unpack9 = load i32, ptr %v, align 4
  %.elt10 = getelementptr inbounds [3 x i32], ptr %v, i64 0, i64 1
  %.unpack11 = load i32, ptr %.elt10, align 4
  %.elt12 = getelementptr inbounds [3 x i32], ptr %v, i64 0, i64 2
  %.unpack13 = load i32, ptr %.elt12, align 4
  store i32 %.unpack9, ptr %r, align 4
  store i32 %.unpack11, ptr %.elt1, align 4
  store i32 %.unpack13, ptr %.elt3, align 4
  ret void
}

whereas optimizing the version with vectors leaves the operations together

define void @replace_short_array_using_vectors(ptr noalias nocapture noundef writeonly sret([3 x i32]) dereferenceable(12) %0, ptr noalias nocapture noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
  %1 = load <3 x i32>, ptr %r, align 4
  store <3 x i32> %1, ptr %0, align 4
  %2 = load <3 x i32>, ptr %v, align 4
  store <3 x i32> %2, ptr %r, align 4
  ret void
}

for the backend to legalize instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I originally expected to do this with LLVM's arbitrary-length integer support https://llvm.godbolt.org/z/hzG9aeqhM, like I did for comparisons back in #85828

define void @replace_short_array_using_long_integer(ptr noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
  %1 = load i96, ptr %r, align 4
  store i96 %1, ptr %0, align 4
  %2 = load i96, ptr %v, align 4
  store i96 %2, ptr %r, align 4
  ret void
}

But that broke some of the autovectorization codegen tests for operations on arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ok, so we explicitly want to be lowering these moves to vector types specifically because they can optimize better than arrays.

tests/codegen/array-map.rs Outdated Show resolved Hide resolved
@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 31, 2023
@rustbot rustbot removed the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Jun 4, 2023
@bors bors merged commit fd9bf59 into rust-lang:master Jun 6, 2023
@rustbot rustbot added this to the 1.72.0 milestone Jun 6, 2023
@scottmcm scottmcm deleted the codegen-less-memcpy branch June 6, 2023 04:44
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (fd9bf59): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.2% [0.1%, 5.7%] 3
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.3% [-0.3%, -0.3%] 4
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
3.0% [1.1%, 4.8%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.4% [2.4%, 2.4%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.3% [0.0%, 0.5%] 12
Regressions ❌
(secondary)
0.6% [0.2%, 2.9%] 68
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.3% [0.0%, 0.5%] 12

Bootstrap: 647.471s -> 646.658s (-0.13%)

@rustbot rustbot added the perf-regression Performance regression. label Jun 6, 2023
@pnkfelix
Copy link
Member

  • changes are all to secondary benchmarks
  • only notable was coercions debug incr-full regressing by 5.7%, but that's acceptable given what we expect benefits to be here w.r.t. codegen.
  • marking as triaged.

@rustbot label: +perf-regression-triaged

@tpelkone
Copy link

In noticed a quite large performance (15-20%) regression in an analytical database I'm building with [u8; 24] and [u8; 28] arrays. I bisected the regression between nightly commits e6d4725 and b2b34bd. This PR was the obvious suspicion since this changed how arrays that small are copied around. There's a lot of small arrays being moved around in the database I'm building. I didn't notice any performance regressions for other data types (u128 or u64 etc.) I've only tested on aarch64-apple-darwin so far. I haven't published my work yet, but I can come up with a benchmark program if needed.

@lqd
Copy link
Member

lqd commented Aug 25, 2023

(In case you don’t already know about it: cargo-bisect-rustc should bisect within the rollup PR of that last commit, to hopefully give you the offending PR or in/validate it’s this one)

@tpelkone
Copy link

I'll verify the offending PR using cargo-bisect-rustc. In the meantime here's a simple benchmark which reproduces the issue on my Macbook M1 Pro:

#![feature(test)]
extern crate test;
use test::Bencher;

#[bench]
fn bench(b: &mut Bencher) {
    b.iter(|| {
        let data = vec![[1u8; 24]; 1024 * 1024];
        let mut new_vec = Vec::with_capacity(1024 * 1024);
        for entry in data {
            new_vec.push(entry);
        }
    });
}
$ cargo +nightly-2023-06-06-aarch64-apple-darwin bench
    Finished bench [optimized] target(s) in 0.00s
     Running unittests src/main.rs (target/release/deps/benchmarks-0c0dfc3834e16b96)

running 1 test
test bench ... bench:   5,899,666 ns/iter (+/- 137,707)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out; finished in 5.34s
$ cargo +nightly-2023-06-07-aarch64-apple-darwin bench
   Compiling benchmarks v0.1.0 (/Users/tuomas/rust/benchmarks)
    Finished bench [optimized] target(s) in 0.18s
     Running unittests src/main.rs (target/release/deps/benchmarks-0c0dfc3834e16b96)

running 1 test
test bench ... bench:   9,094,616 ns/iter (+/- 46,638)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out; finished in 8.23s

@the8472
Copy link
Member

the8472 commented Aug 25, 2023

Please open an issue so we can track this. Closed PRs aren't a good place for it.

@tpelkone
Copy link

Will do.

bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 29, 2024
Remove my `scalar_copy_backend_type` optimization attempt

I added this back in rust-lang#111999 , but I no longer think it's a good idea
- It had to get scaled back to only power-of-two things to not break a bunch of targets
- LLVM seems to be getting better at memcpy removal anyway
- Introducing vector instructions has seemed to sometimes (rust-lang#115515 (comment)) make autovectorization worse

So this removes it from the codegen crates entirely, and instead just tries to use <https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/builder/trait.BuilderMethods.html#method.typed_place_copy> instead of direct `memcpy` so things will still use load/store when a type isn't `OperandValue::Ref`.
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 9, 2024
…-errors

Remove my `scalar_copy_backend_type` optimization attempt

I added this back in rust-lang#111999 , but I no longer think it's a good idea
- It had to get scaled back to only power-of-two things to not break a bunch of targets
- LLVM seems to be getting better at memcpy removal anyway
- Introducing vector instructions has seemed to sometimes (rust-lang#115515 (comment)) make autovectorization worse

So this removes it from the codegen crates entirely, and instead just tries to use <https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/builder/trait.BuilderMethods.html#method.typed_place_copy> instead of direct `memcpy` so things will still use load/store when a type isn't `OperandValue::Ref`.
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 10, 2024
…-errors

Remove my `scalar_copy_backend_type` optimization attempt

I added this back in rust-lang#111999 , but I no longer think it's a good idea
- It had to get scaled back to only power-of-two things to not break a bunch of targets
- LLVM seems to be getting better at memcpy removal anyway
- Introducing vector instructions has seemed to sometimes (rust-lang#115515 (comment)) make autovectorization worse

So this removes it from the codegen crates entirely, and instead just tries to use <https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/builder/trait.BuilderMethods.html#method.typed_place_copy> instead of direct `memcpy` so things will still use load/store when a type isn't `OperandValue::Ref`.
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 10, 2024
…-errors

Remove my `scalar_copy_backend_type` optimization attempt

I added this back in rust-lang#111999 , but I no longer think it's a good idea
- It had to get scaled back to only power-of-two things to not break a bunch of targets
- LLVM seems to be getting better at memcpy removal anyway
- Introducing vector instructions has seemed to sometimes (rust-lang#115515 (comment)) make autovectorization worse

So this removes it from the codegen crates entirely, and instead just tries to use <https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/builder/trait.BuilderMethods.html#method.typed_place_copy> instead of direct `memcpy` so things will still use load/store when a type isn't `OperandValue::Ref`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.