Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discard LLVM modules earlier when performing ThinLTO #56487

Merged
merged 6 commits into from
Dec 7, 2018

Conversation

nikic
Copy link
Contributor

@nikic nikic commented Dec 4, 2018

Currently ThinLTO is performed by first compiling all modules (and keeping them in memory), and then serializing them into ThinLTO buffers in a separate, synchronized step. Modules are later read back from ThinLTO buffers when running the ThinLTO optimization pipeline.

We can also find the following comment in lto.rs:

    // FIXME: right now, like with fat LTO, we serialize all in-memory
    //        modules before working with them and ThinLTO. We really
    //        shouldn't do this, however, and instead figure out how to
    //        extract a summary from an in-memory module and then merge that
    //        into the global index. It turns out that this loop is by far
    //        the most expensive portion of this small bit of global
    //        analysis!

I don't think that what is suggested here is the right approach: One of the primary benefits of using ThinLTO over ordinary LTO is that it's not necessary to keep all the modules (merged or not) in memory for the duration of the linking step.

However, we currently don't really make use of this (at least for crate-local ThinLTO), because we keep all modules in memory until the start of the LTO step. This PR changes the implementation to instead perform the serialization into ThinLTO buffers directly after the initial optimization step.

Most of the changes here are plumbing to separate out fat and thin lto handling in write.rs, as these now use different intermediate artifacts. For fat lto this will be in-memory modules, for thin lto it will be ThinLTO buffers.

r? @alexcrichton

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 4, 2018
@alexcrichton
Copy link
Member

@bors: try

Whoa this is a great idea! This seems like the correct strategy for incremental too, although I forget if that takes different paths.

This also makes me think that we should link in fat LTO ASAP instead of synchronizing and then linking as it'd allow pipelining a bit ideally. In any case that's a patch for another time!

I'll take a closer look later, but I'm curious on the perf impact here too, it should both make builds faster (slightly) and decrease peak memory usage in theory

@bors
Copy link
Contributor

bors commented Dec 4, 2018

⌛ Trying commit 94131ebd21ed6d64acdeae6d4766c5669414c488 with merge 360659bd585b2529141e8fc3228fc3bf2e1fa6d1...

@bors
Copy link
Contributor

bors commented Dec 4, 2018

☀️ Test successful - status-travis
State: approved= try=True

@alexcrichton
Copy link
Member

@rust-timer build 360659bd585b2529141e8fc3228fc3bf2e1fa6d1

@rust-timer
Copy link
Collaborator

Success: Queued 360659bd585b2529141e8fc3228fc3bf2e1fa6d1 with parent 0c999ed, comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit 360659bd585b2529141e8fc3228fc3bf2e1fa6d1

@nikic
Copy link
Contributor Author

nikic commented Dec 4, 2018

The max-rss results basically looks like noise to me. The wall-time seem to be a minor win for opt builds (a few percent for clean/baseline-incremental).

I guess that means that LLVM memory usage is dominated by rustc memory usage, at least for the build types used here (opt w/o debuginfo), so it has no impact on max-rss. Unfortunately I was not able to get massif working with rustc, it always segfaults early on :(

@nikic
Copy link
Contributor Author

nikic commented Dec 4, 2018

I've got massif to run (need to directly call the jemalloc free rustc, no rustc +stage2). Here's a comparison of two runs for a crate where I see a minor max-rss reduction of about 3% for a release/debuginfo build. Note that these vary a lot between runs, so exact numbers are not meaningful:

Before:

    MB
217.0^                         #
     |                       ::#                       :::
     |                  :@@@:::#                 :::::::::
     |               @@@:@ @:::#               ::::: ::::::
     |              @@@ :@ @:::#              :::::: :::::::  ::::
     |         :::::@@@ :@ @:::#              :::::: ::::::::::::: :@: : :
     |        ::::: @@@ :@ @:::#      ::     ::::::: :::::::::::::::@:::::
     |      ::::::: @@@ :@ @:::#::    :      ::::::: :::::::::::::::@::::::
     |      ::::::: @@@ :@ @:::#: :::::      ::::::: :::::::::::::::@::::::
     |    ::::::::: @@@ :@ @:::#: ::: :     :::::::: :::::::::::::::@::::::
     |    : ::::::: @@@ :@ @:::#: ::: :    @:::::::: :::::::::::::::@::::::
     |    : ::::::: @@@ :@ @:::#: ::: :    @:::::::: :::::::::::::::@::::::@
     |   @: ::::::: @@@ :@ @:::#: ::: :   :@:::::::: :::::::::::::::@::::::@
     |   @: ::::::: @@@ :@ @:::#: ::: :  ::@:::::::: :::::::::::::::@::::::@
     |   @: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@:
     |  @@: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@::
     |  @@: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@::
     |::@@: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@::
     |: @@: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@::
     |: @@: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   133.8

After:

    MB
217.5^                                                    ##
     |                                                @@::#                   
     |                   @                           :@ : # ::::              
     |           :::@@:::@:::                     ::::@ : # :::               
     |         :::: @@:: @:: :                ::::: ::@ : # ::: @  
     |       ::: :: @@:: @:: ::             ::::: : ::@ : # ::: @:::
     |       : : :: @@:: @:: ::  @         :: ::: : ::@ : # ::: @:: :@@       
     |    :::: : :: @@:: @:: ::::@         :: ::: : ::@ : # ::: @:: :@     
     |    :: : : :: @@:: @:: ::: @        ::: ::: : ::@ : # ::: @:: :@ :::::
     |    :: : : :: @@:: @:: ::: @        ::: ::: : ::@ : # ::: @:: :@ ::::   
     |    :: : : :: @@:: @:: ::: @        ::: ::: : ::@ : # ::: @:: :@ ::::   
     |    :: : : :: @@:: @:: ::: @      ::::: ::: : ::@ : # ::: @:: :@ ::::
     |    :: : : :: @@:: @:: ::: @      : ::: ::: : ::@ : # ::: @:: :@ :::: 
     |    :: : : :: @@:: @:: ::: @::    : ::: ::: : ::@ : # ::: @:: :@ :::: :
     |  :::: : : :: @@:: @:: ::: @: ::  : ::: ::: : ::@ : # ::: @:: :@ :::: : 
     |  : :: : : :: @@:: @:: ::: @: : ::: ::: ::: : ::@ : # ::: @:: :@ :::: ::
     | @: :: : : :: @@:: @:: ::: @: : : : ::: ::: : ::@ : # ::: @:: :@ :::: ::
     | @: :: : : :: @@:: @:: ::: @: : : : ::: ::: : ::@ : # ::: @:: :@ :::: ::
     | @: :: : : :: @@:: @:: ::: @: : : : ::: ::: : ::@ : # ::: @:: :@ :::: ::
     | @: :: : : :: @@:: @:: ::: @: : : : ::: ::: : ::@ : # ::: @:: :@ :::: ::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   133.8

The first hump is peak optimization, the second hump is peak LTO. So in this case peak memory usage is moved from the optimization stage to the LTO stage, but it ultimately does not make much of a difference.

@alexcrichton
Copy link
Member

Ah bummer! In any case this looks good to me and it looks like graphs are indeed confirming a shift in peaks, so this seems good to land to me.

r=me with a rebase!

Instead of only determining whether some form of LTO is necessary,
determine whether thin, fat or no LTO is necessary. I've rewritten
the conditions in a way that I think is more obvious, i.e. specified
LTO type + additional preconditions.
These are going to have different intermediate artifacts, so
create separate codepaths for them.
Fat LTO merges into one module, so only return one module.
@nikic
Copy link
Contributor Author

nikic commented Dec 4, 2018

@bors r=alexcrichton

@bors
Copy link
Contributor

bors commented Dec 4, 2018

📌 Commit 96cc381285c8b1d83aea776282232022ed949fd7 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 4, 2018
@rust-highfive

This comment has been minimized.

Instead of keeping all modules in memory until thin LTO and only
serializing them then, serialize the module immediately after
it finishes optimizing.
@nikic
Copy link
Contributor Author

nikic commented Dec 4, 2018

@bors r=alexcrichton

Rebase mistake with submodules...

@bors
Copy link
Contributor

bors commented Dec 4, 2018

📌 Commit 8128d0d has been approved by alexcrichton

@bors
Copy link
Contributor

bors commented Dec 7, 2018

⌛ Testing commit 8128d0d with merge f504d3f...

bors added a commit that referenced this pull request Dec 7, 2018