Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a real InstCombine MIR pass #105808

Closed
wants to merge 1 commit into from
Closed

Conversation

saethlin
Copy link
Member

@saethlin saethlin commented Dec 17, 2022

This pass is built around an observation that in optimized MIR, after inlining, we often end up with a lot of temporary locals that do not even escape their block. When the temporary is only used in two statements which are adjacent, we can fold it away.

This happens with place projections:

_2 = _1.a; // First use of _2
_3 = _2.b; // Last use of _2

With pointer casts (and also numeric casts, but those are more complicated):

_2 = move _1 as *mut T (PtrToPtr);
_3 = move _2 as *mut U (PtrToPtr);

And also with temporary (including raw) references:

_2 = &_1;
_3 = *_2;

r? @cjgillot

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 17, 2022
@saethlin
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 17, 2022
@bors
Copy link
Contributor

bors commented Dec 17, 2022

⌛ Trying commit 6b9e5d1d28e19dbdd14c6b1993cec3e0ee2283df with merge 50bd766f65a66a9b9b92fccecf21d94135864fc2...

@saethlin
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Dec 17, 2022

⌛ Trying commit 9a6c0dc50c5886bd7906450bc0a7001ac1b2bf1d with merge 186ff00597beff5caa4a628567a8b026333c6a8f...

@bors
Copy link
Contributor

bors commented Dec 17, 2022

💔 Test failed - checks-actions

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 17, 2022
@saethlin
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Dec 17, 2022

⌛ Trying commit 9db861ff637d5add29e5102d4ca82d83f57ab2da with merge 139f427fcbd48a5c8f71a159f3d1be058bf7fffd...

@bors
Copy link
Contributor

bors commented Dec 17, 2022

💔 Test failed - checks-actions

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Dec 18, 2022

☔ The latest upstream changes (presumably #105876) made this pull request unmergeable. Please resolve the merge conflicts.

@saethlin
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Feb 19, 2023

⌛ Trying commit cdec0209cc3a9ff8b0fb0296e908481459fabc39 with merge 995f7db06ddfd96031de6e2fcf91247b1416dcbf...

@saethlin
Copy link
Member Author

I opened #111518 for the niche problem. I just want to land this instead of letting it languish. I think even with a little hack it stands on it own.

@saethlin saethlin marked this pull request as ready for review May 12, 2023 22:45
@rustbot
Copy link
Collaborator

rustbot commented May 12, 2023

Some changes occurred to MIR optimizations

cc @rust-lang/wg-mir-opt

@bors
Copy link
Contributor

bors commented May 13, 2023

☔ The latest upstream changes (presumably #111447) made this pull request unmergeable. Please resolve the merge conflicts.

// _2 = &_1;
// _3 = *_2;
// Into:
// _3 = _1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the explanation why this is correct?
This boils down to the second statement being the only use of _2, but that needs to be explicit.

Could you specify in the comment which among _1/_2/_3 is temp_place, temp_rvalue...

let Some(second_place) = final_operand.place() else { return None; };
if second_place.projection.get(0) != Some(&ProjectionElem::Deref) {
return None;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you assert that Some(second_place.local) == _temp_place.as_local())?

compiler/rustc_mir_transform/src/instcombine.rs Outdated Show resolved Hide resolved

if second_place.projection.get(0) != Some(&ProjectionElem::Deref) {
return None;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, an assertion on second_place.local could be useful.

compiler/rustc_mir_transform/src/instcombine.rs Outdated Show resolved Hide resolved
// our StorageDead statements into the last slots, and our new statement somewhere in
// the middle.
// This ensures that we do not change the location of any statements that we have not
// optimized, which minimizes the amount of our analysis that we have invalidated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if:

_2 = _1 as *const u8;
StorageDead(_4)
StorageLive(_4)
_3 = _2 as *const ();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_4 isn't mentioned in the combined statement, so those statements would be ignored. Did you mean _3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. My question is about the StorageDead-then-StorageLive getting swapped, so if that's the one triggering the behaviour, _3 it is.

for (slot, statement) in slots.iter().rev().zip(storage_dead.into_iter().rev()) {
assert!(matches!(statements[*slot].kind, StatementKind::Nop));
statements[*slot] = statement;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining why we don't add those storage statements to invalidated_statements?

}
PlaceContext::MutatingUse(
MutatingUseContext::Borrow
| MutatingUseContext::Projection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Projection cannot happen in visit_place.

Comment on lines 375 to 416
PlaceContext::MutatingUse(
MutatingUseContext::Borrow
| MutatingUseContext::Projection
| MutatingUseContext::AddressOf,
) => {
self.analysis[place.local].read.insert(location);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch needs an explanation.
We usually consider all refs and addressofs, either mutable or immutable, to be writes in optimizations.
Why are they all reads here?

@scottmcm
Copy link
Member

If I could propose a test here, https://rust.godbolt.org/z/zh8EP9KcT

pub fn simple_swap<T>(x: &mut T, y: &mut T) {
    use std::ptr::{read, write};
    unsafe {
        let temp = read(x);
        write(x, read(y));
        write(y, temp);
    }
}

Today that's

    bb0: {
        _4 = &raw const (*_1);           // scope 1 at /app/example.rs:4:25: 4:26
        _3 = (*_4);                      // scope 4 at /rustc/2c41369acc445d04129db40ba998dd7a89fb0d2e/library/core/src/ptr/mod.rs:1172:9: 1172:46
        _5 = &raw mut (*_1);             // scope 2 at /app/example.rs:5:15: 5:16
        _7 = &raw const (*_2);           // scope 2 at /app/example.rs:5:23: 5:24
        _6 = (*_7);                      // scope 7 at /rustc/2c41369acc445d04129db40ba998dd7a89fb0d2e/library/core/src/ptr/mod.rs:1172:9: 1172:46
        (*_5) = move _6;                 // scope 10 at /rustc/2c41369acc445d04129db40ba998dd7a89fb0d2e/library/core/src/ptr/mod.rs:1370:9: 1370:45
        _8 = &raw mut (*_2);             // scope 2 at /app/example.rs:6:15: 6:16
        (*_8) = move _3;                 // scope 13 at /rustc/2c41369acc445d04129db40ba998dd7a89fb0d2e/library/core/src/ptr/mod.rs:1370:9: 1370:45
        return;                          // scope 0 at /app/example.rs:8:2: 8:2
    }

but with only combining adjacent statements I think that could inprove all the way down to

    bb0: {
        _3 = (*_1);                      // scope 4 at /rustc/2c41369acc445d04129db40ba998dd7a89fb0d2e/library/core/src/ptr/mod.rs:1172:9: 1172:46
        (*_1) = (*_2);                   // scope 10 at /rustc/2c41369acc445d04129db40ba998dd7a89fb0d2e/library/core/src/ptr/mod.rs:1370:9: 1370:45
        (*_2) = move _3;                 // scope 13 at /rustc/2c41369acc445d04129db40ba998dd7a89fb0d2e/library/core/src/ptr/mod.rs:1370:9: 1370:45
        return;                          // scope 0 at /app/example.rs:8:2: 8:2
    }

using mostly things that it looks like you might have already implemented here.

@Dylan-DPC
Copy link
Member

@saethlin any updates on this?

@saethlin
Copy link
Member Author

This has only been idle for 5 days. I'm having a busy week and haven't had much time for volunteering.

@Dylan-DPC
Copy link
Member

My bad :P oversaw that part :P no worries

@saethlin
Copy link
Member Author

Just rebasing, don't get excited just yet

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Jun 19, 2023

☔ The latest upstream changes (presumably #112238) made this pull request unmergeable. Please resolve the merge conflicts.

@rust-log-analyzer
Copy link
Collaborator

The job mingw-check-tidy failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
Prepare all required actions
Getting action download info
Download action repository 'actions/checkout@v4' (SHA:b4ffde65f46336ab88eb53be808477a3936bae11)
Download action repository 'actions/upload-artifact@v3' (SHA:a8a3f3ad30e3422c9c7b888a15615d19a852ae32)
Complete job name: PR - mingw-check-tidy
git config --global core.autocrlf false
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
---
Removing intermediate container be00e418118b
 ---> b7b02b74592c
Step 6/10 : COPY host-x86_64/mingw-check/reuse-requirements.txt /tmp/
 ---> 491731421d92
Step 7/10 : RUN pip3 install --no-deps --no-cache-dir --require-hashes -r /tmp/reuse-requirements.txt     && pip3 install virtualenv
Collecting binaryornot==0.4.4
  Downloading binaryornot-0.4.4-py2.py3-none-any.whl (9.0 kB)
Collecting boolean-py==4.0
  Downloading boolean.py-4.0-py3-none-any.whl (25 kB)
---
Building wheels for collected packages: reuse
  Building wheel for reuse (pyproject.toml): started
  Building wheel for reuse (pyproject.toml): finished with status 'done'
  Created wheel for reuse: filename=reuse-1.1.0-cp310-cp310-manylinux_2_35_x86_64.whl size=180117 sha256=2196c9034bf565528bbb1ee6dad4f753eb813f58822363e6b768f09c73e4d4ff
  Stored in directory: /tmp/pip-ephem-wheel-cache-oamd25x6/wheels/c2/3c/b9/1120c2ab4bd82694f7e6f0537dc5b9a085c13e2c69a8d0c76d
Installing collected packages: boolean-py, binaryornot, setuptools, reuse, python-debian, markupsafe, license-expression, jinja2, chardet
  Attempting uninstall: setuptools
    Found existing installation: setuptools 59.6.0
    Not uninstalling setuptools at /usr/lib/python3/dist-packages, outside environment /usr
    Not uninstalling setuptools at /usr/lib/python3/dist-packages, outside environment /usr
    Can't uninstall 'setuptools'. No files were found to uninstall.
Successfully installed binaryornot-0.4.4 boolean-py-4.0 chardet-5.1.0 jinja2-3.1.2 license-expression-30.0.0 markupsafe-2.1.1 python-debian-0.1.49 reuse-1.1.0 setuptools-66.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting virtualenv
  Downloading virtualenv-20.24.6-py3-none-any.whl (3.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 62.4 MB/s eta 0:00:00
Collecting platformdirs<4,>=3.9.1
  Downloading platformdirs-3.11.0-py3-none-any.whl (17 kB)
Collecting filelock<4,>=3.12.2
  Downloading filelock-3.13.1-py3-none-any.whl (11 kB)
Collecting distlib<1,>=0.3.7
  Downloading distlib-0.3.7-py2.py3-none-any.whl (468 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 468.9/468.9 KB 127.4 MB/s eta 0:00:00
Installing collected packages: distlib, platformdirs, filelock, virtualenv
Successfully installed distlib-0.3.7 filelock-3.13.1 platformdirs-3.11.0 virtualenv-20.24.6
Removing intermediate container 3f9a3a5770c8
 ---> 15ae60e6f0a5
Step 8/10 : COPY host-x86_64/mingw-check/validate-toolstate.sh /scripts/
 ---> 48cc9780f505
 ---> 48cc9780f505
Step 9/10 : COPY host-x86_64/mingw-check/validate-error-codes.sh /scripts/
 ---> cec34b512e31
Step 10/10 : ENV SCRIPT TIDY_PRINT_DIFF=1 python2.7 ../x.py test            --stage 0 src/tools/tidy tidyselftest --extra-checks=py:lint
Removing intermediate container 93629dd9b190
 ---> a16921979805
Successfully built a16921979805
Successfully tagged rust-ci:latest
Successfully tagged rust-ci:latest
##[endgroup]
Built container sha256:a16921979805e0c075820a6b186dc881555a2dc9aece59984be206fdfb36f80e
Uploading finished image sha256:a16921979805e0c075820a6b186dc881555a2dc9aece59984be206fdfb36f80e to https://ci-caches.rust-lang.org/docker/8849b25aebb63c7041ab10114da59fac9c6c89ff409673e53f6251b7e63c69daeaca7298d30885d05004ab27b231421908523f297222d07a53450f37e4691d72
IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
a16921979805   1 second ago     /bin/sh -c #(nop)  ENV SCRIPT=TIDY_PRINT_DIF…   0B        
48cc9780f505   2 seconds ago    /bin/sh -c #(nop) COPY file:078ea1d11e7b7cda…   367B      
15ae60e6f0a5   3 seconds ago    |1 DEBIAN_FRONTEND=noninteractive /bin/sh -c…   23.9MB    
491731421d92   9 seconds ago    /bin/sh -c #(nop) COPY file:ac591dd6bc5afa66…   5.33kB    
b7b02b74592c   10 seconds ago   |1 DEBIAN_FRONTEND=noninteractive /bin/sh -c…   23.1MB    
---
<missing>      3 weeks ago      /bin/sh -c #(nop)  LABEL org.opencontainers.…   0B        
<missing>      3 weeks ago      /bin/sh -c #(nop)  ARG LAUNCHPAD_BUILD_ARCH     0B        
<missing>      3 weeks ago      /bin/sh -c #(nop)  ARG RELEASE                  0B        

<botocore.awsrequest.AWSRequest object at 0x7ff75b3eef10>
gzip: stdout: Broken pipe
xargs: docker: terminated by signal 13
https://ci-caches.rust-lang.org/docker/8849b25aebb63c7041ab10114da59fac9c6c89ff409673e53f6251b7e63c69daeaca7298d30885d05004ab27b231421908523f297222d07a53450f37e4691d72
sha256:a16921979805e0c075820a6b186dc881555a2dc9aece59984be206fdfb36f80e
---
DirectMap4k:      182208 kB
DirectMap2M:     5060608 kB
DirectMap1G:    13631488 kB
##[endgroup]
Executing TIDY_PRINT_DIFF=1 python2.7 ../x.py test            --stage 0 src/tools/tidy tidyselftest --extra-checks=py:lint
+ TIDY_PRINT_DIFF=1 python2.7 ../x.py test --stage 0 src/tools/tidy tidyselftest --extra-checks=py:lint
    Finished dev [unoptimized] target(s) in 0.03s
##[endgroup]
downloading https://ci-artifacts.rust-lang.org/rustc-builds-alt/ffb7ed9fa420e9bcd98d84b431b2009445b7b967/rust-dev-nightly-x86_64-unknown-linux-gnu.tar.xz
extracting /checkout/obj/build/cache/llvm-ffb7ed9fa420e9bcd98d84b431b2009445b7b967-true/rust-dev-nightly-x86_64-unknown-linux-gnu.tar.xz to /checkout/obj/build/x86_64-unknown-linux-gnu/ci-llvm
---
    Finished release [optimized] target(s) in 25.67s
##[endgroup]
fmt check
tidy check
tidy error: the following output file is not associated with any mir-opt test, you can remove it: /checkout/tests/mir-opt/nested_getter.outer_get.InstCombine.diff
tidy error: the following output file is not associated with any mir-opt test, you can remove it: /checkout/tests/mir-opt/casts.roundtrip.PreCodegen.after.mir
removing old virtual environment
removing old virtual environment
creating virtual environment at '/checkout/obj/build/venv' using 'python3.10'
Requirement already satisfied: pip in ./build/venv/lib/python3.10/site-packages (23.3.1)
Collecting black==23.3.0 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 7))
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 32.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 32.8 MB/s eta 0:00:00
Collecting click==8.1.3 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 34))
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
Collecting importlib-metadata==6.7.0 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 38))
  Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB)
  Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB)
Collecting mypy-extensions==1.0.0 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 42))
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Collecting packaging==23.1 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 46))
  Downloading packaging-23.1-py3-none-any.whl (48 kB)
Collecting pathspec==0.11.1 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 50))
  Downloading pathspec-0.11.1-py3-none-any.whl (29 kB)
  Downloading pathspec-0.11.1-py3-none-any.whl (29 kB)
Collecting platformdirs==3.6.0 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 54))
  Downloading platformdirs-3.6.0-py3-none-any.whl (16 kB)
Collecting ruff==0.0.272 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 58))
  Downloading ruff-0.0.272-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
Collecting tomli==2.0.1 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 77))
  Downloading tomli-2.0.1-py3-none-any.whl (12 kB)
Collecting typed-ast==1.5.4 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 81))
  Downloading typed_ast-1.5.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (877 kB)
  Downloading typed_ast-1.5.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (877 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 877.7/877.7 kB 81.9 MB/s eta 0:00:00
Collecting typing-extensions==4.6.3 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 107))
  Downloading typing_extensions-4.6.3-py3-none-any.whl (31 kB)
Collecting zipp==3.15.0 (from -r /checkout/src/tools/tidy/config/requirements.txt (line 114))
  Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)
Installing collected packages: zipp, typing-extensions, typed-ast, tomli, ruff, platformdirs, pathspec, packaging, mypy-extensions, click, importlib-metadata, black
Successfully installed black-23.3.0 click-8.1.3 importlib-metadata-6.7.0 mypy-extensions-1.0.0 packaging-23.1 pathspec-0.11.1 platformdirs-3.6.0 ruff-0.0.272 tomli-2.0.1 typed-ast-1.5.4 typing-extensions-4.6.3 zipp-3.15.0
some tidy checks failed
Build completed unsuccessfully in 0:00:49
  local time: Tue Oct 31 12:40:30 UTC 2023
  network time: Tue, 31 Oct 2023 12:40:30 GMT

bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 26, 2023
Add `Ord::cmp` for primitives as a `BinOp` in MIR

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 27, 2023
Add `Ord::cmp` for primitives as a `BinOp` in MIR

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
@saethlin
Copy link
Member Author

Nearly all the ideas in this PR have been better-implemented in GVN, which is for the best.

I'm closing this because while I think GVN still doesn't handle &*ref, I don't know how to fix the storage marker issue above and I don't intend to figure that out just to land what value remains in this PR.

@saethlin saethlin closed this Feb 14, 2024
@saethlin saethlin deleted the fold branch February 14, 2024 21:11
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 24, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 24, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 2, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
bjorn3 pushed a commit to bjorn3/rust that referenced this pull request Apr 5, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
GuillaumeGomez pushed a commit to GuillaumeGomez/rust that referenced this pull request Jul 10, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-mir-opt Area: MIR optimizations S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants