Reduce compile time of variadic functions #243

dean0x7d · 2015-12-06T17:04:25Z

The proposed PR would reduce the compile time of variadic templates like fmt::print and fmt::format. The improvement on bloat-test is small but measurable (reduced from 35 to 33 seconds on my system), but that doesn't tell the whole story.

In order to get a more detailed measurement I modified bloat-test to generate tests for any number of arguments. The resulting variadic-test can be found here. When invoked with make variadic-test ARGS="1 25 10" it will generate fmt::print statements with 1 <= x < 25 arguments and it will compile each with 10 translation units.

Results

The following figures show the results for Clang and GCC, before and after the proposed PR changes. The data is gathered from make variadic-test ARGS="1 25 10". Clang 7 was tested on OS X 10.11, while GCC 5.2 was on Antergos (Arch).

Compile time

Binary size

The Good

Clang benefits the most. With the current implementation there is a weird spike between 8 and 12 arguments in release mode. This spike gets ironed out with the proposed changes and there is a nice overall speed up.

GCC also sees some strange compile time spikes at 20 and 22 arguments - these show up consistently for me. The proposed changes clear that up and there is an overall speedup, although a bit hard to see in the figure because the two spikes mess with the scale.

There is a small binary size improvement for GCC with > 16 arguments in release mode.

Debug mode sees benefits across the board. Not that important, but nice.

The Bad

Both GCC and Clang see a binary size regression at exactly 16 arguments (clearly visible in the figures). This is because the proposed implementation must generate all unpacked types for >=16 arguments, while currently 16 is a special case with part packed and part unpacked types.

The Ugly

The proposed implementation replaces the ArgArray type from Value[]/Arg[] to MakeValue[]/MakeArg[]. When the argument array is passed, it's implicitly converted in the chain MakeValue[] -> MakeValue* -> Value*, and Value* is then used as Value[]. This works fine because MakeValue and Value are equivalent, standard layout types, but I feel bad writing it...
If Value and MakeValue change so they are not equivalent, the code will still compile, but it will not work correctly. I added tests to trigger in this case.

Alternatives

I attempted a few alternatives for the ugly bit, but they didn't fare too well. This

MakeValue<Char>[] value = {args...};

could be avoided by writing this

Value[] = {make_arg<Char>(args, is_packed_tag{})...};

where is_packed_tag would select if a Value or Arg should be created. Unfortunately, this results in a huge increase in binary size on GCC, which refuses to inline more than 6 or 7 arguments.

The array could be placed into a struct:

template <unsigned N>
struct ArgArray {
    Value data[N];

    template <typename... Args>
    ArgArray(const Args & ... args) : data{make_arg<Char>(args, is_packed_tag{})...} {}
}

This restores proper inlining on GCC, but the compile time becomes much worse.

Epilogue

This started because I noticed that compile times were slower on Clang than on GCC which is not something I usually see. I may have fallen into a rabbit hole, but it's been interesting. There are tradeoffs as outlined above, so I'd understand if the proposed changes aren't completely desirable, but I hope at least some part will be useful.

Spartan322 · 2015-12-06T17:25:22Z

It appears that the defects on Clang and GCC in Release Binary Size may be worth it for the compile time, especially with the erratic release time of Clang and GCC (according to the benchmark graphs, but how many tests were performed to get those graphs?)

dean0x7d · 2015-12-06T17:46:02Z

@Spartan322 In general, each data point is measured 3 times, but for those outliers I did a lot of extra passes and different configurations and it always returned the same, at least on my system. The benchmark is here if anyone wants to try it. The Clang results seem reasonable to me: it probably switches to a different optimization path after certain threshold. But GCC really is erratic.

dean0x7d · 2015-12-06T18:32:47Z

Extra Benchmarks

I think this might be interesting. Here is the data for all the formatting libraries. C++ Format does really well:

Compile time is close to IOStreams for 1 to 6 arguments, matches it for 7-9 and outpaces it after that.
Up to 16 arguments, binary size is only a tiny bit above printf.

vitaut · 2015-12-06T22:26:04Z

Thanks for such a detailed PR. This looks like a nice improvement and I think that regression in binary size for 16 arguments is not a problem because it is not a very common case. I'll probably have more comments and/or questions after I look in more details at the implementation and do some tests.

ArgArray construction depends on this behavior. Note that this is only relevant in the C++11 code path.

Alias templates may not be supported even if variadic templates are.

vitaut · 2015-12-07T17:05:38Z

@dean0x7d I looked a bit into the implementation and have a few questions:

Do I understand correctly that the key to the compile time improvement is replacing template recursion with array initialization in dean0x7d@f942856?
What is the reason for switching from inheritance to composition in dean0x7d@8a9ad00?

dean0x7d · 2015-12-07T18:34:03Z

Yes, removing template recursion helps a lot because it reduces the number of template instantiations and the code generated before optimizations. The previous commit (1383efb) also helps by removing the complete set of set_types template instantiations.
I initially wanted to avoid adding MakeArg by using aggregate construction with the POD Arg. But aggregate construction wasn't possible with inheritance so I made the switch to composition. Later, I ended up needing MakeArg after all, but I kept the composition since I already had it done and I thought it was nicer to write
```
arg.value = x;
```
than
```
Value &value = arg;
value = x;
```
As far as I can see, the only downside with composition is that arg.int_value becomes arg.value.int_value.

vitaut · 2015-12-09T17:06:26Z

Accidentally merged this PR into master, but "undid" it by moving to the variadic branch for now. Unfortunately I don't see a way to reopen the merged PR. Sorry for the mess, I think I urgently need more coffee %).

dean0x7d · 2015-12-09T17:40:24Z

No worries. Let me know if I should open a new PR for any changes.

vitaut · 2015-12-10T16:56:48Z

I've done a few tests myself using gcc 4.8.4 confirming the improvement in compile time for large number of arguments:

There are no weird spikes at 20 and 22 in this version of gcc.

Unfortunately the results for smaller number of arguments which I care more about are somewhat erratic:

and vary noticeably between the runs with no clear winner.

Overall I think it's a nice improvement and will be happy to merge this in. But I'd prefer to keep the inheritance of Arg from Value if it is not essential for this implementation. This will avoid specifying .value too many times. Is it possible to revert this change in some clean way?

dean0x7d · 2015-12-10T18:07:02Z

I can do an interactive rebase to get rid of the composition (and also squash that alias template fix). I'll submit a new PR.

I have an older Ubuntu with GCC 4.8 on a virtual machine. I'll try to see what's up with that strange spike at 5 arguments.

It's entirely possible that the benchmark is a bit too artificial (consists only of fmt::print calls and nothing else) which causes some strange behavior with the compiler optimization heuristics. Following some benchmarking advice, the variadic-test has a use_clobber flag, but it's currently disabled. With use_clobber = True an asm clobber is placed between each fmt::print statement. The idea is that this will make the code more realistic to the compiler (it will see more than just print statements with invariant arguments). Testing it on GCC 5.2, it did help both the 'current' and 'proposed' cases, but it helped 'proposed' more. I wasn't really sure about it, so I left it off. But just FYI if you want to try it.

vitaut · 2015-12-10T18:21:34Z

I can do an interactive rebase to get rid of the composition (and also squash that alias template fix). I'll submit a new PR.

Sounds good.

I have an older Ubuntu with GCC 4.8 on a virtual machine. I'll try to see what's up with that strange spike at 5 arguments.

Don't worry about it, it's just an artifact of this particular run. I've rerun the test on 5 arguments a few times and there was no spike in other runs.

Thanks for the info about the use_clobber flag, I'll check it out.

dean0x7d · 2015-12-11T00:33:53Z

I reworked this to keep the inheritance, but there are a couple of issues I didn't realize before I started. The modified variadic2 branch is here.

First off, the tests that I added earlier are now failing, specifically:

6: Value of: std::is_pod<Arg>::value
6:   Actual: false
6: Expected: true

Basically, Arg was never strictly a POD type because it inherits from a base class. It only becomes a strict POD with composition. This also means that the existing comment right above Arg is not quite right:

// A formatting argument. It is a POD type to allow storage in
// internal::MemoryBuffer.
struct Arg : Value {
  Type type;
};

By extension MakeArg is not a standard layout type for the same reason:

6: Value of: std::is_standard_layout<MakeArg>::value
6:   Actual: false
6: Expected: true

Without these two properties, Arg may be equivalent to MakeArg on all major compilers, but it's not really guaranteed by the language and may break. The Arg : Value inheritance definitely can't stay.

If the inheritance is mainly to avoid repeating value, maybe the fields could be renamed to something like arg.value.int_?

vitaut · 2015-12-11T16:52:03Z

Good point about POD. I think the correct term here is trivially copyable although I need to check if Arg satisfies all the requirements.

If the inheritance is mainly to avoid repeating value, maybe the fields could be renamed to something like arg.value.int_?

That's an option, but ideally it would be nice to find a way not to rely on the "ugly" chain of conversions.

I did a quick and dirty implementation in https://github.com/cppformat/cppformat/tree/variadic-test to test an alternative and got similar compile time improvement on gcc:

but haven't tested clang yet.

The compile code size for large (>16) number of arguments is not as good as in your implementation, but better than in the current one.

dean0x7d · 2015-12-13T14:31:41Z

I experimented a bit more. The results are below:

baseline: current code on master
v1: the implementation in this PR
v2: Make typedef based on your implementation
v3: make function. Similar to v2 but with a function instead of a typedef. Keeps the inheritance and there is no conversion chain.

All 3 versions solve the 8-12 arg slowdown on Clang, but v2 has a negative effect for >=16 args. I had to cut off the last point (42 seconds) to keep a reasonable scale. Note that all versions a have negligible effect (< 5%) for 4 or less arguments, but it's pretty nice for 6-12.

Versions 2 and 3 see a small regression (~2.5 %) for >= 16 args, but I don't think it's too bad.

GCC doesn't have any trouble with v2 like Clang did.

There's a small regression in binary size for v3. As far as I can tell from the assembly, this is because the inlining behavior changes for certain arg counts. Actually, v3 is close to what I initially did but dropped it because of this binary size regression.

Let me know what you think.

vitaut · 2015-12-15T15:33:48Z

Thanks for another detailed analysis. I'd go with v3 because it solves the slowdown on Clang and doesn't rely on the chain of conversions. Since the binary size regression is not too big and only affects a large number of arguments (>= 9), I wouldn't worry about it.

I think you are right about inlining because making ArgArray<N, false>::make always_inline significantly improves binary sizes for >= 16 arguments on gcc (sometimes almost 2x reduction in size). Unfortunately doing the same for ArgArray<N, true>::make makes things worse.

There is also a question of API safety. In the current version one has to pass args... once to make_arg_list which handles both type and ArgList construction:

    typename fmt::internal::ArgArray<sizeof...(Args)>::Type array;
    call(FMT_FOR_EACH(FMT_GET_ARG_NAME, __VA_ARGS__),
      fmt::internal::make_arg_list<
        fmt::BasicFormatter<Char> >(array, args...));

while in the new version types and ArgList are handled separately making it possible to pass different
arguments by mistake:

    typename ArgArray::Type array{
      ArgArray::template make<fmt::BasicFormatter<Char> >(args)...};
    call(FMT_FOR_EACH(FMT_GET_ARG_NAME, __VA_ARGS__),
      fmt::ArgList(fmt::internal::make_type(args...), array));

But hopefully this can be addressed separately.

Anyway, if you agree with using v3, please submit a PR.

dean0x7d force-pushed the variadic branch from 41255ca to 062064d Compare December 6, 2015 17:57

vitaut force-pushed the master branch from a26af2b to 5c76d10 Compare December 6, 2015 22:17

dean0x7d added 6 commits December 6, 2015 23:45

Compose Arg instead of inheriting from Value

8a9ad00

Add MakeArg constructor

1383efb

Replace template recursion with variadic array initialization

f942856

Test that arrays of Value/Arg and MakeValue/MakeArg are equivalent

6ea14f0

ArgArray construction depends on this behavior. Note that this is only relevant in the C++11 code path.

Use MakeArg constructor instead of manually constructing Arg

983b883

Fix compilation in C++0x mode

98ed65d

Alias templates may not be supported even if variadic templates are.

dean0x7d force-pushed the variadic branch from 062064d to 98ed65d Compare December 6, 2015 22:46

vitaut merged commit 98ed65d into fmtlib:master Dec 9, 2015

dean0x7d mentioned this pull request Dec 19, 2015

Reduce compile time of variadic functions (2) #249

Merged

vitaut mentioned this pull request Dec 21, 2015

Reduce size of variadic-test result data fmtlib/format-benchmark#3

Merged

dean0x7d deleted the variadic branch February 11, 2016 00:25

vitaut mentioned this pull request Jun 25, 2017

fmtlib update seanmiddleditch/formatxx#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce compile time of variadic functions #243

Reduce compile time of variadic functions #243

dean0x7d commented Dec 6, 2015

Spartan322 commented Dec 6, 2015

dean0x7d commented Dec 6, 2015

dean0x7d commented Dec 6, 2015

vitaut commented Dec 6, 2015

vitaut commented Dec 7, 2015

dean0x7d commented Dec 7, 2015

vitaut commented Dec 9, 2015

dean0x7d commented Dec 9, 2015

vitaut commented Dec 10, 2015

dean0x7d commented Dec 10, 2015

vitaut commented Dec 10, 2015

dean0x7d commented Dec 11, 2015

vitaut commented Dec 11, 2015

dean0x7d commented Dec 13, 2015

vitaut commented Dec 15, 2015

Reduce compile time of variadic functions #243

Reduce compile time of variadic functions #243

Conversation

dean0x7d commented Dec 6, 2015

Results

Compile time

Binary size

The Good

The Bad

The Ugly

Alternatives

Epilogue

Spartan322 commented Dec 6, 2015

dean0x7d commented Dec 6, 2015

dean0x7d commented Dec 6, 2015

Extra Benchmarks

vitaut commented Dec 6, 2015

vitaut commented Dec 7, 2015

dean0x7d commented Dec 7, 2015

vitaut commented Dec 9, 2015

dean0x7d commented Dec 9, 2015

vitaut commented Dec 10, 2015

dean0x7d commented Dec 10, 2015

vitaut commented Dec 10, 2015

dean0x7d commented Dec 11, 2015

vitaut commented Dec 11, 2015

dean0x7d commented Dec 13, 2015

vitaut commented Dec 15, 2015