MIR: Bootstrap up to a half-hour longer with --enable-orbit #33111

alexcrichton · 2016-04-20T17:01:49Z

I just took a look at our list of bots and was surprised that the builder lagging behind was the auto-win-msvc-64-opt-mir builder. Investigating it looks like a MIR bootstrap (--enable-orbit) is taking about a half hour longer than a normal bootstrap. For example compare:

In the normal case a bootstrap took 58m and the MIR case took 1h22m. The difference is not always quite as drastic, but historically the normal builders are consistently faster than MIR. The discrepancy also exists on the Linux MIR builders, although it's not quite as large as the Windows machines (~15m slower).

cc @eddyb, @nagisa

The text was updated successfully, but these errors were encountered:

nagisa · 2016-04-20T17:11:50Z

Not surprising, honestly. We still generate worse LLVM-IR (lack of cheap optimising/simplifying transformations) compared to the regular trans, so despite our translation taking a small fraction of the time old trans takes, the LLVM recoups doubly.

Aatch · 2016-04-21T00:32:32Z

I checked against libsyntax, since it's a nice large crate that LLVM can chew through.

LLVM takes ~50% longer using the MIR trans path compared to the normal one to optimise. Translating to LLVM is actually about the same between them though.

It looks like the main issue is that we're simply generating more IR than before. Some of the LLVM passes have moved up in terms of relative time. One that has shot up quite a lot is SROA. I have a feeling that our handling of function arguments, especially fat pointers, is probably a big part of this.

sanxiyn · 2016-05-04T12:12:07Z

nickel seems to exhibit an extreme form of this. With nightly-2016-04-29 (which includes #32980), LLVM takes nearly 8x longer.

$ git clone https://github.com/nickel-org/nickel.rs.git
$ cd nickel.rs
$ git checkout 0.8.0
$ cargo build
$ cargo clean -p nickel
$ cargo rustc -- -Z time-passes | grep LLVM
time: 0.740; rss: 144MB LLVM passes
$ cargo clean -p nickel
$ cargo rustc -- -Z orbit -Z time-passes | grep LLVM
time: 5.738; rss: 146MB LLVM passes

MagaTailor · 2016-05-04T20:16:02Z

Apart from being slower, what's the reason for much larger code size it generates? (e.g. a newly bootstrapped lib directory weighs at 208M vs 180M normally)

nagisa · 2016-05-04T22:51:49Z

For the same reasons it takes longer to compile. I believe that fixing one will fix another.

Then again, MIR already generates more correct code than the old trans in some corner cases. It should cost some bytes too.

dotdash · 2016-05-10T18:08:09Z

@sanxiyn A major reason for the difference in nickel in particular is a C-like enum with about 800 variants and a derived PartialEq implementation. That leads to nested matches with about 800 arms each and a basic block for each match arm. So we get about 640k basic blocks in total. SimplifyCfg actually removes most of that going down to about 1.6k basic blocks, but before we enter trans, we break all critical edges and that blows it up to about the initial size again, making LLVM struggle.

Currently, to prepare for MIR trans, we break _all_ critical edges, although we only actually need to do this for edges originating from a call that gets translated to an invoke instruction in LLVM. This has the unfortunate effect of undoing a bunch of the things that SimplifyCfg has done. A particularly bad case arises when you have a C-like enum with N variants and a derived PartialEq implementation. In that case, the match on the (&lhs, &rhs) tuple gets translated into nested matches with N arms each and a basic block each, resulting in N² basic blocks. SimplifyCfg reduces that to roughly 2*N basic blocks, but breaking the critical edges means that we go back to N². In nickel.rs, there is such an enum with roughly N=800. So we get about 640K basic blocks or 2.5M lines of LLVM IR. LLVM takes a while to reduce that to the final "disr_a == disr_b". So before this patch, we had 2.5M lines of IR with 640K basic blocks, which took about about 3.6s in LLVM to get optimized and translated. After this patch, we get about 650K lines with about 1.6K basic blocks and spent a little less than 0.2s in LLVM. cc rust-lang#33111

Currently, all switches in MIR are exhausitive, meaning that we can have a lot of arms that all go to the same basic block, the extreme case being an if-let expression which results in just 2 possible cases, be might end up with hundreds of arms for large enums. To improve this situation and give LLVM less code to chew on, we can detect whether there's a pre-dominant target basic block in a switch and then promote this to be the default target, not translating the corresponding arms at all. In combination with rust-lang#33544 this makes unoptimized MIR trans of nickel.rs as fast as using old trans and greatly improves the times for optimized builds, which are only 30-40% slower instead of ~300%. cc rust-lang#33111

@Aatch

… r=Aatch Only break critical edges where actually needed Currently, to prepare for MIR trans, we break _all_ critical edges, although we only actually need to do this for edges originating from a call that gets translated to an invoke instruction in LLVM. This has the unfortunate effect of undoing a bunch of the things that SimplifyCfg has done. A particularly bad case arises when you have a C-like enum with N variants and a derived PartialEq implementation. In that case, the match on the (&lhs, &rhs) tuple gets translated into nested matches with N arms each and a basic block each, resulting in N² basic blocks. SimplifyCfg reduces that to roughly 2*N basic blocks, but breaking the critical edges means that we go back to N². In nickel.rs, there is such an enum with roughly N=800. So we get about 640K basic blocks or 2.5M lines of LLVM IR. LLVM takes a while to reduce that to the final "disr_a == disr_b". So before this patch, we had 2.5M lines of IR with 640K basic blocks, which took about about 3.6s in LLVM to get optimized and translated. After this patch, we get about 650K lines with about 1.6K basic blocks and spent a little less than 0.2s in LLVM. cc rust-lang#33111 r? @Aatch

[MIR trans] Optimize trans for biased switches Currently, all switches in MIR are exhausitive, meaning that we can have a lot of arms that all go to the same basic block, the extreme case being an if-let expression which results in just 2 possible cases, be might end up with hundreds of arms for large enums. To improve this situation and give LLVM less code to chew on, we can detect whether there's a pre-dominant target basic block in a switch and then promote this to be the default target, not translating the corresponding arms at all. In combination with rust-lang#33544 this makes unoptimized MIR trans of nickel.rs as fast as using old trans and greatly improves the times for optimized builds, which are only 30-40% slower instead of ~300%. cc rust-lang#33111

@Aatch

… r=Aatch Only break critical edges where actually needed Currently, to prepare for MIR trans, we break _all_ critical edges, although we only actually need to do this for edges originating from a call that gets translated to an invoke instruction in LLVM. This has the unfortunate effect of undoing a bunch of the things that SimplifyCfg has done. A particularly bad case arises when you have a C-like enum with N variants and a derived PartialEq implementation. In that case, the match on the (&lhs, &rhs) tuple gets translated into nested matches with N arms each and a basic block each, resulting in N² basic blocks. SimplifyCfg reduces that to roughly 2*N basic blocks, but breaking the critical edges means that we go back to N². In nickel.rs, there is such an enum with roughly N=800. So we get about 640K basic blocks or 2.5M lines of LLVM IR. LLVM takes a while to reduce that to the final "disr_a == disr_b". So before this patch, we had 2.5M lines of IR with 640K basic blocks, which took about about 3.6s in LLVM to get optimized and translated. After this patch, we get about 650K lines with about 1.6K basic blocks and spent a little less than 0.2s in LLVM. cc rust-lang#33111 r? @Aatch

[MIR trans] Optimize trans for biased switches Currently, all switches in MIR are exhausitive, meaning that we can have a lot of arms that all go to the same basic block, the extreme case being an if-let expression which results in just 2 possible cases, be might end up with hundreds of arms for large enums. To improve this situation and give LLVM less code to chew on, we can detect whether there's a pre-dominant target basic block in a switch and then promote this to be the default target, not translating the corresponding arms at all. In combination with rust-lang#33544 this makes unoptimized MIR trans of nickel.rs as fast as using old trans and greatly improves the times for optimized builds, which are only 30-40% slower instead of ~300%. cc rust-lang#33111

@Aatch

… r=Aatch Only break critical edges where actually needed Currently, to prepare for MIR trans, we break _all_ critical edges, although we only actually need to do this for edges originating from a call that gets translated to an invoke instruction in LLVM. This has the unfortunate effect of undoing a bunch of the things that SimplifyCfg has done. A particularly bad case arises when you have a C-like enum with N variants and a derived PartialEq implementation. In that case, the match on the (&lhs, &rhs) tuple gets translated into nested matches with N arms each and a basic block each, resulting in N² basic blocks. SimplifyCfg reduces that to roughly 2*N basic blocks, but breaking the critical edges means that we go back to N². In nickel.rs, there is such an enum with roughly N=800. So we get about 640K basic blocks or 2.5M lines of LLVM IR. LLVM takes a while to reduce that to the final "disr_a == disr_b". So before this patch, we had 2.5M lines of IR with 640K basic blocks, which took about about 3.6s in LLVM to get optimized and translated. After this patch, we get about 650K lines with about 1.6K basic blocks and spent a little less than 0.2s in LLVM. cc rust-lang#33111 r? @Aatch

[MIR trans] Optimize trans for biased switches Currently, all switches in MIR are exhausitive, meaning that we can have a lot of arms that all go to the same basic block, the extreme case being an if-let expression which results in just 2 possible cases, be might end up with hundreds of arms for large enums. To improve this situation and give LLVM less code to chew on, we can detect whether there's a pre-dominant target basic block in a switch and then promote this to be the default target, not translating the corresponding arms at all. In combination with rust-lang#33544 this makes unoptimized MIR trans of nickel.rs as fast as using old trans and greatly improves the times for optimized builds, which are only 30-40% slower instead of ~300%. cc rust-lang#33111

@Aatch

… r=Aatch Only break critical edges where actually needed Currently, to prepare for MIR trans, we break _all_ critical edges, although we only actually need to do this for edges originating from a call that gets translated to an invoke instruction in LLVM. This has the unfortunate effect of undoing a bunch of the things that SimplifyCfg has done. A particularly bad case arises when you have a C-like enum with N variants and a derived PartialEq implementation. In that case, the match on the (&lhs, &rhs) tuple gets translated into nested matches with N arms each and a basic block each, resulting in N² basic blocks. SimplifyCfg reduces that to roughly 2*N basic blocks, but breaking the critical edges means that we go back to N². In nickel.rs, there is such an enum with roughly N=800. So we get about 640K basic blocks or 2.5M lines of LLVM IR. LLVM takes a while to reduce that to the final "disr_a == disr_b". So before this patch, we had 2.5M lines of IR with 640K basic blocks, which took about about 3.6s in LLVM to get optimized and translated. After this patch, we get about 650K lines with about 1.6K basic blocks and spent a little less than 0.2s in LLVM. cc rust-lang#33111 r? @Aatch

[MIR trans] Optimize trans for biased switches Currently, all switches in MIR are exhausitive, meaning that we can have a lot of arms that all go to the same basic block, the extreme case being an if-let expression which results in just 2 possible cases, be might end up with hundreds of arms for large enums. To improve this situation and give LLVM less code to chew on, we can detect whether there's a pre-dominant target basic block in a switch and then promote this to be the default target, not translating the corresponding arms at all. In combination with rust-lang#33544 this makes unoptimized MIR trans of nickel.rs as fast as using old trans and greatly improves the times for optimized builds, which are only 30-40% slower instead of ~300%. cc rust-lang#33111

eddyb · 2016-06-08T07:19:55Z

Looking at the latest builds, MIR takes 7 minutes less on x64 linux and 45 minutes less on x64 MSVC.

alexcrichton added the A-mir Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html label Apr 20, 2016

nagisa added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Apr 20, 2016

sanxiyn added I-compiletime Issue: Problems and improvements with respect to compile times. and removed I-slow Issue: Problems and improvements with respect to performance of generated code. labels Apr 21, 2016

dotdash mentioned this issue May 10, 2016

Only break critical edges where actually needed #33544

Merged

dotdash mentioned this issue May 11, 2016

[MIR trans] Optimize trans for biased switches #33566

Merged

eddyb closed this as completed Jun 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MIR: Bootstrap up to a half-hour longer with --enable-orbit #33111

MIR: Bootstrap up to a half-hour longer with --enable-orbit #33111

alexcrichton commented Apr 20, 2016

nagisa commented Apr 20, 2016

Aatch commented Apr 21, 2016

sanxiyn commented May 4, 2016 •

edited

Loading

MagaTailor commented May 4, 2016

nagisa commented May 4, 2016 •

edited

Loading

dotdash commented May 10, 2016

eddyb commented Jun 8, 2016

MIR: Bootstrap up to a half-hour longer with --enable-orbit #33111

MIR: Bootstrap up to a half-hour longer with --enable-orbit #33111

Comments

alexcrichton commented Apr 20, 2016

nagisa commented Apr 20, 2016

Aatch commented Apr 21, 2016

sanxiyn commented May 4, 2016 • edited Loading

MagaTailor commented May 4, 2016

nagisa commented May 4, 2016 • edited Loading

dotdash commented May 10, 2016

eddyb commented Jun 8, 2016

sanxiyn commented May 4, 2016 •

edited

Loading

nagisa commented May 4, 2016 •

edited

Loading