Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce compile time of variadic functions #243

Merged
merged 6 commits into from
Dec 9, 2015
Merged

Conversation

dean0x7d
Copy link
Contributor

@dean0x7d dean0x7d commented Dec 6, 2015

The proposed PR would reduce the compile time of variadic templates like fmt::print and fmt::format. The improvement on bloat-test is small but measurable (reduced from 35 to 33 seconds on my system), but that doesn't tell the whole story.

In order to get a more detailed measurement I modified bloat-test to generate tests for any number of arguments. The resulting variadic-test can be found here. When invoked with make variadic-test ARGS="1 25 10" it will generate fmt::print statements with 1 <= x < 25 arguments and it will compile each with 10 translation units.

Results

The following figures show the results for Clang and GCC, before and after the proposed PR changes. The data is gathered from make variadic-test ARGS="1 25 10". Clang 7 was tested on OS X 10.11, while GCC 5.2 was on Antergos (Arch).

Compile time

appleclang_time

gcc5_time

Binary size

appleclang_size

gcc5_size

The Good

Clang benefits the most. With the current implementation there is a weird spike between 8 and 12 arguments in release mode. This spike gets ironed out with the proposed changes and there is a nice overall speed up.

GCC also sees some strange compile time spikes at 20 and 22 arguments - these show up consistently for me. The proposed changes clear that up and there is an overall speedup, although a bit hard to see in the figure because the two spikes mess with the scale.

There is a small binary size improvement for GCC with > 16 arguments in release mode.

Debug mode sees benefits across the board. Not that important, but nice.

The Bad

Both GCC and Clang see a binary size regression at exactly 16 arguments (clearly visible in the figures). This is because the proposed implementation must generate all unpacked types for >=16 arguments, while currently 16 is a special case with part packed and part unpacked types.

The Ugly

The proposed implementation replaces the ArgArray type from Value[]/Arg[] to MakeValue[]/MakeArg[]. When the argument array is passed, it's implicitly converted in the chain MakeValue[] -> MakeValue* -> Value*, and Value* is then used as Value[]. This works fine because MakeValue and Value are equivalent, standard layout types, but I feel bad writing it...
If Value and MakeValue change so they are not equivalent, the code will still compile, but it will not work correctly. I added tests to trigger in this case.

Alternatives

I attempted a few alternatives for the ugly bit, but they didn't fare too well. This

MakeValue<Char>[] value = {args...};

could be avoided by writing this

Value[] = {make_arg<Char>(args, is_packed_tag{})...};

where is_packed_tag would select if a Value or Arg should be created. Unfortunately, this results in a huge increase in binary size on GCC, which refuses to inline more than 6 or 7 arguments.

The array could be placed into a struct:

template <unsigned N>
struct ArgArray {
    Value data[N];

    template <typename... Args>
    ArgArray(const Args & ... args) : data{make_arg<Char>(args, is_packed_tag{})...} {}
}

This restores proper inlining on GCC, but the compile time becomes much worse.

Epilogue

This started because I noticed that compile times were slower on Clang than on GCC which is not something I usually see. I may have fallen into a rabbit hole, but it's been interesting. There are tradeoffs as outlined above, so I'd understand if the proposed changes aren't completely desirable, but I hope at least some part will be useful.

@Spartan322
Copy link

It appears that the defects on Clang and GCC in Release Binary Size may be worth it for the compile time, especially with the erratic release time of Clang and GCC (according to the benchmark graphs, but how many tests were performed to get those graphs?)

@dean0x7d
Copy link
Contributor Author

dean0x7d commented Dec 6, 2015

@Spartan322 In general, each data point is measured 3 times, but for those outliers I did a lot of extra passes and different configurations and it always returned the same, at least on my system. The benchmark is here if anyone wants to try it. The Clang results seem reasonable to me: it probably switches to a different optimization path after certain threshold. But GCC really is erratic.

@dean0x7d
Copy link
Contributor Author

dean0x7d commented Dec 6, 2015

Extra Benchmarks

I think this might be interesting. Here is the data for all the formatting libraries. C++ Format does really well:

  • Compile time is close to IOStreams for 1 to 6 arguments, matches it for 7-9 and outpaces it after that.
  • Up to 16 arguments, binary size is only a tiny bit above printf.

all_gcc5_time

all_gcc5_size

@vitaut
Copy link
Contributor

vitaut commented Dec 6, 2015

Thanks for such a detailed PR. This looks like a nice improvement and I think that regression in binary size for 16 arguments is not a problem because it is not a very common case. I'll probably have more comments and/or questions after I look in more details at the implementation and do some tests.

@vitaut
Copy link
Contributor

vitaut commented Dec 7, 2015

@dean0x7d I looked a bit into the implementation and have a few questions:

  1. Do I understand correctly that the key to the compile time improvement is replacing template recursion with array initialization in dean0x7d@f942856?
  2. What is the reason for switching from inheritance to composition in dean0x7d@8a9ad00?

@dean0x7d
Copy link
Contributor Author

dean0x7d commented Dec 7, 2015

  1. Yes, removing template recursion helps a lot because it reduces the number of template instantiations and the code generated before optimizations. The previous commit (1383efb) also helps by removing the complete set of set_types template instantiations.

  2. I initially wanted to avoid adding MakeArg by using aggregate construction with the POD Arg. But aggregate construction wasn't possible with inheritance so I made the switch to composition. Later, I ended up needing MakeArg after all, but I kept the composition since I already had it done and I thought it was nicer to write

    arg.value = x;

    than

    Value &value = arg;
    value = x;

    As far as I can see, the only downside with composition is that arg.int_value becomes arg.value.int_value.

@vitaut vitaut merged commit 98ed65d into fmtlib:master Dec 9, 2015
@vitaut
Copy link
Contributor

vitaut commented Dec 9, 2015

Accidentally merged this PR into master, but "undid" it by moving to the variadic branch for now. Unfortunately I don't see a way to reopen the merged PR. Sorry for the mess, I think I urgently need more coffee %).

@dean0x7d
Copy link
Contributor Author

dean0x7d commented Dec 9, 2015

No worries. Let me know if I should open a new PR for any changes.

@vitaut
Copy link
Contributor

vitaut commented Dec 10, 2015

I've done a few tests myself using gcc 4.8.4 confirming the improvement in compile time for large number of arguments:

optimized

There are no weird spikes at 20 and 22 in this version of gcc.

Unfortunately the results for smaller number of arguments which I care more about are somewhat erratic:

small

and vary noticeably between the runs with no clear winner.

Overall I think it's a nice improvement and will be happy to merge this in. But I'd prefer to keep the inheritance of Arg from Value if it is not essential for this implementation. This will avoid specifying .value too many times. Is it possible to revert this change in some clean way?

@dean0x7d
Copy link
Contributor Author

I can do an interactive rebase to get rid of the composition (and also squash that alias template fix). I'll submit a new PR.

I have an older Ubuntu with GCC 4.8 on a virtual machine. I'll try to see what's up with that strange spike at 5 arguments.

It's entirely possible that the benchmark is a bit too artificial (consists only of fmt::print calls and nothing else) which causes some strange behavior with the compiler optimization heuristics. Following some benchmarking advice, the variadic-test has a use_clobber flag, but it's currently disabled. With use_clobber = True an asm clobber is placed between each fmt::print statement. The idea is that this will make the code more realistic to the compiler (it will see more than just print statements with invariant arguments). Testing it on GCC 5.2, it did help both the 'current' and 'proposed' cases, but it helped 'proposed' more. I wasn't really sure about it, so I left it off. But just FYI if you want to try it.

@vitaut
Copy link
Contributor

vitaut commented Dec 10, 2015

I can do an interactive rebase to get rid of the composition (and also squash that alias template fix). I'll submit a new PR.

Sounds good.

I have an older Ubuntu with GCC 4.8 on a virtual machine. I'll try to see what's up with that strange spike at 5 arguments.

Don't worry about it, it's just an artifact of this particular run. I've rerun the test on 5 arguments a few times and there was no spike in other runs.

Thanks for the info about the use_clobber flag, I'll check it out.

@dean0x7d
Copy link
Contributor Author

I reworked this to keep the inheritance, but there are a couple of issues I didn't realize before I started. The modified variadic2 branch is here.

First off, the tests that I added earlier are now failing, specifically:

6: Value of: std::is_pod<Arg>::value
6:   Actual: false
6: Expected: true

Basically, Arg was never strictly a POD type because it inherits from a base class. It only becomes a strict POD with composition. This also means that the existing comment right above Arg is not quite right:

// A formatting argument. It is a POD type to allow storage in
// internal::MemoryBuffer.
struct Arg : Value {
  Type type;
};

By extension MakeArg is not a standard layout type for the same reason:

6: Value of: std::is_standard_layout<MakeArg>::value
6:   Actual: false
6: Expected: true

Without these two properties, Arg may be equivalent to MakeArg on all major compilers, but it's not really guaranteed by the language and may break. The Arg : Value inheritance definitely can't stay.

If the inheritance is mainly to avoid repeating value, maybe the fields could be renamed to something like arg.value.int_?

@vitaut
Copy link
Contributor

vitaut commented Dec 11, 2015

Good point about POD. I think the correct term here is trivially copyable although I need to check if Arg satisfies all the requirements.

If the inheritance is mainly to avoid repeating value, maybe the fields could be renamed to something like arg.value.int_?

That's an option, but ideally it would be nice to find a way not to rely on the "ugly" chain of conversions.

I did a quick and dirty implementation in https://github.com/cppformat/cppformat/tree/variadic-test to test an alternative and got similar compile time improvement on gcc:

variadic-test

but haven't tested clang yet.

The compile code size for large (>16) number of arguments is not as good as in your implementation, but better than in the current one.

@dean0x7d
Copy link
Contributor Author

I experimented a bit more. The results are below:

  • baseline: current code on master
  • v1: the implementation in this PR
  • v2: Make typedef based on your implementation
  • v3: make function. Similar to v2 but with a function instead of a typedef. Keeps the inheritance and there is no conversion chain.

appleclang_time

All 3 versions solve the 8-12 arg slowdown on Clang, but v2 has a negative effect for >=16 args. I had to cut off the last point (42 seconds) to keep a reasonable scale. Note that all versions a have negligible effect (< 5%) for 4 or less arguments, but it's pretty nice for 6-12.

appleclang_size

Versions 2 and 3 see a small regression (~2.5 %) for >= 16 args, but I don't think it's too bad.

gcc5_time

GCC doesn't have any trouble with v2 like Clang did.

gcc5_size

There's a small regression in binary size for v3. As far as I can tell from the assembly, this is because the inlining behavior changes for certain arg counts. Actually, v3 is close to what I initially did but dropped it because of this binary size regression.

Let me know what you think.

@vitaut
Copy link
Contributor

vitaut commented Dec 15, 2015

Thanks for another detailed analysis. I'd go with v3 because it solves the slowdown on Clang and doesn't rely on the chain of conversions. Since the binary size regression is not too big and only affects a large number of arguments (>= 9), I wouldn't worry about it.

I think you are right about inlining because making ArgArray<N, false>::make always_inline significantly improves binary sizes for >= 16 arguments on gcc (sometimes almost 2x reduction in size). Unfortunately doing the same for ArgArray<N, true>::make makes things worse.

There is also a question of API safety. In the current version one has to pass args... once to make_arg_list which handles both type and ArgList construction:

    typename fmt::internal::ArgArray<sizeof...(Args)>::Type array;
    call(FMT_FOR_EACH(FMT_GET_ARG_NAME, __VA_ARGS__),
      fmt::internal::make_arg_list<
        fmt::BasicFormatter<Char> >(array, args...));

while in the new version types and ArgList are handled separately making it possible to pass different
arguments by mistake:

    typename ArgArray::Type array{
      ArgArray::template make<fmt::BasicFormatter<Char> >(args)...};
    call(FMT_FOR_EACH(FMT_GET_ARG_NAME, __VA_ARGS__),
      fmt::ArgList(fmt::internal::make_type(args...), array));

But hopefully this can be addressed separately.

Anyway, if you agree with using v3, please submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants