Documenting most recent bottlenecks detected #127

springmeyer · 2018-08-10T19:33:53Z

@apendleton provided a cache_bench benchmark which exercises cxxcache.RocksDBCache and cxxcache.coalesce.

Using latest master at fec8605 (using node v8.11.3) here are the things I notice when running the benchmark locally on my OS X machine:

The benchmark takes about 22-24s
CPU usage hovers around 80-110% (so using all of one core) which makes since given the benchmark synchronizes calls to cxxcache.coalesce using queue(1) and therefore is not able to utilize the threadpool much
Memory usage consistently grows after the initial data load during the bench run, from 400 -> 600 MB over the course of 15-20 seconds and until exciting
The main event loop (aka main thread) is very busy, with 44% in uv_work_done
10% of the time spent in uv_work_done is garbage collection of v8 objects (initiated by node::NodePlatform::FlushForegroundTasksInteral())
The majority of the 44% busyness of the main loop of uv_work_done is due to the task of returning data from the threadpool to the main event loop, which is done in the carmen::coalesceAfter function.
- The carmen::coalesceAfter looks to be bottlenecked on v8::Object::Set which is called inside carmen::contextToArray and carmen::coverToObject to convert the carmen data into JS objects.
The worker threads in the C++ threadpool (where carmen::coalesceMulti does its critical work) appear to be not very busy, which is surprising. Ideally the main event loop is mostly idle (the profiler mostly sees the CPU executing in uv__io_poll which signifies waiting for work) and the threads are nearly 100% busy. But in this case the opposite is true: the main event loop is busier than the threads.

Why are the threads so idle?

One hypothesis is that the threads are idle due to the queue(1); being used in the bench.js script causing not much work to be sent to the threads meaning that each time work is sent it gets sent to a different thread and the work is fanned out and returns quickly. This appears to not be the case. If I run the benchmark with a threadpool size of the 1 the profiling output (and the script timing) is similar:

UV_THREADPOOL_SIZE=1 node cache_bench/bench.js

So, rather it seems like the problem is that the main event loop is doing too much work. Normally you want the hard work to be done in the threads to avoid blocking the main loop, but it appears that is not happening here (on balance). The fact that this balance is so off in the benchmark is concerning because in real-world usage a lot of other functions are going to need to execute on the main event loop, so carmen-cache taking up so much time is of great concern.

Profiling in Activity monitor shows the significant cost of the JS object conversion with carmen::coalesceAfter:

If I change the queue(1); to queue(); in the benchmark (to allow more async usage of the threadpool), this seems to further confirm that we've got a bad bottleneck on the JS conversion in the main event loop. Running UV_THREADPOOL_SIZE=16 node cache_bench/bench.js finishes faster (in 14s) but at the cost of nearly entirely jamming up the main event loop with JS object conversion, with > 93% in carmen::coalesceAfter. In all my profiling of async C++ code I've never seen this much of the main event loop taken up by object conversion from C++ to JS (even in a micro benchmark like this):

At the time this profile was taken the threads were completely idle without any work to do, likely because the main event loop could not dispatch it, given it was blocked on object conversion. This is bad.

How likely is this to be impacting production?

I don't know. Surely not as severely as this micro benchmark, but profiling in production would show the level of severity of this bottleneck, so I think that is worth doing.

Potential solutions

A solution to this problem would involve:

🍒 Reducing the # of objects sent back from C++ to JS
🌸 Reducing the cost of passing objects from C++ to JS

As far as 🍒 - we currently limit to 40 features to reduce this cost. Could we possibly limit further? To 20 features? Looks like the 40 cutoff was added in #65 perhaps?

As far as 🌸 two ideas come to mind. Instead of converting C++ objects to JS JSON-like objects (which is clearly crazy expensive) we could either:

Pass C++ objects directly into JS. Those C++ objects would use pointers internally to avoid copying of the data. They would expose methods on them to get access to the underlying feature properties. So you would need to do context.getRelev() instead of context.relev in JS-land
Serialize using some small, efficient format like protobuf or the like. This way we'd serialize to protobuf in C++ and then decode lazily in JS to get access to the properties. The protobuf message could be efficiently passed from C++ to JS using this extremely fast and efficient method: Enhancement: pass data created in threadpool back to main JS thread in zero-copy way node-cpp-skel#69

The text was updated successfully, but these errors were encountered:

springmeyer · 2018-08-10T19:55:02Z

Also should mention that the above is a braindump because I'm stepping away from this work after pushing on #126.

I'll add that I did get as far as confirming that -- if the JS conversion bottleneck is removed by trying the Pass C++ objects directly into JS method, -- the performance of the benchmark doubles and the main event loop is no longer blocked but idle.

My comparison is not quite fair since I only implemented the object passing and the actual access to the object properties is not yet implemented. So all the unit tests are broken. However the cache_bench benchmark does not actually depend on access to the properties like scoredist or maxrelev so I was able to compare decently using the benchmark.

Here is the unfinished branch that starts passing the C++ features without copying into JS land: https://github.com/mapbox/carmen-cache/compare/pass-by-object-rather-than-json.

The next step here, if someone where to take this one, would be adding methods to that Features object to actually be able to get the data off it. And then writing a benchmark that includes both the call to coalesce and the usage of the properties, to see if this is actually faster. My hunch is that the additional access to the properties would not be free but it would be less than paying the cost to convert them all at once.

springmeyer · 2018-12-20T18:12:23Z

Update since some time has passed: I think this still applies: that the overhead of passing objects between JS and C++ is non-trivial.

apendleton · 2019-01-02T15:50:29Z

Agreed. There are probably some intermediate improvements that can be made (for example: not allocating new strings for the keys of all the JS objects we return -- they all have an "id" key so we should just make one "id" string rather than one per object returned), but ultimately I think the solution will be to move the API boundary so we have to pass less stuff back and forth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting most recent bottlenecks detected #127

Documenting most recent bottlenecks detected #127

springmeyer commented Aug 10, 2018

springmeyer commented Aug 10, 2018 •

edited

Loading

springmeyer commented Dec 20, 2018

apendleton commented Jan 2, 2019

Documenting most recent bottlenecks detected #127

Documenting most recent bottlenecks detected #127

Comments

springmeyer commented Aug 10, 2018

Why are the threads so idle?

How likely is this to be impacting production?

Potential solutions

springmeyer commented Aug 10, 2018 • edited Loading

springmeyer commented Dec 20, 2018

apendleton commented Jan 2, 2019

springmeyer commented Aug 10, 2018 •

edited

Loading