Skip to content
This repository has been archived by the owner on Oct 30, 2021. It is now read-only.

Documenting most recent bottlenecks detected #127

Open
springmeyer opened this issue Aug 10, 2018 · 3 comments
Open

Documenting most recent bottlenecks detected #127

springmeyer opened this issue Aug 10, 2018 · 3 comments

Comments

@springmeyer
Copy link
Contributor

@apendleton provided a cache_bench benchmark which exercises cxxcache.RocksDBCache and cxxcache.coalesce.

Using latest master at fec8605 (using node v8.11.3) here are the things I notice when running the benchmark locally on my OS X machine:

  • The benchmark takes about 22-24s
  • CPU usage hovers around 80-110% (so using all of one core) which makes since given the benchmark synchronizes calls to cxxcache.coalesce using queue(1) and therefore is not able to utilize the threadpool much
  • Memory usage consistently grows after the initial data load during the bench run, from 400 -> 600 MB over the course of 15-20 seconds and until exciting
  • The main event loop (aka main thread) is very busy, with 44% in uv_work_done
  • 10% of the time spent in uv_work_done is garbage collection of v8 objects (initiated by node::NodePlatform::FlushForegroundTasksInteral())
  • The majority of the 44% busyness of the main loop of uv_work_done is due to the task of returning data from the threadpool to the main event loop, which is done in the carmen::coalesceAfter function.
    • The carmen::coalesceAfter looks to be bottlenecked on v8::Object::Set which is called inside carmen::contextToArray and carmen::coverToObject to convert the carmen data into JS objects.
  • The worker threads in the C++ threadpool (where carmen::coalesceMulti does its critical work) appear to be not very busy, which is surprising. Ideally the main event loop is mostly idle (the profiler mostly sees the CPU executing in uv__io_poll which signifies waiting for work) and the threads are nearly 100% busy. But in this case the opposite is true: the main event loop is busier than the threads.

Why are the threads so idle?

One hypothesis is that the threads are idle due to the queue(1); being used in the bench.js script causing not much work to be sent to the threads meaning that each time work is sent it gets sent to a different thread and the work is fanned out and returns quickly. This appears to not be the case. If I run the benchmark with a threadpool size of the 1 the profiling output (and the script timing) is similar:

UV_THREADPOOL_SIZE=1 node cache_bench/bench.js

So, rather it seems like the problem is that the main event loop is doing too much work. Normally you want the hard work to be done in the threads to avoid blocking the main loop, but it appears that is not happening here (on balance). The fact that this balance is so off in the benchmark is concerning because in real-world usage a lot of other functions are going to need to execute on the main event loop, so carmen-cache taking up so much time is of great concern.

Profiling in Activity monitor shows the significant cost of the JS object conversion with carmen::coalesceAfter:

screen shot 2018-08-10 at 12 16 40 pm

If I change the queue(1); to queue(); in the benchmark (to allow more async usage of the threadpool), this seems to further confirm that we've got a bad bottleneck on the JS conversion in the main event loop. Running UV_THREADPOOL_SIZE=16 node cache_bench/bench.js finishes faster (in 14s) but at the cost of nearly entirely jamming up the main event loop with JS object conversion, with > 93% in carmen::coalesceAfter. In all my profiling of async C++ code I've never seen this much of the main event loop taken up by object conversion from C++ to JS (even in a micro benchmark like this):

screen shot 2018-08-10 at 12 22 23 pm

At the time this profile was taken the threads were completely idle without any work to do, likely because the main event loop could not dispatch it, given it was blocked on object conversion. This is bad.

How likely is this to be impacting production?

I don't know. Surely not as severely as this micro benchmark, but profiling in production would show the level of severity of this bottleneck, so I think that is worth doing.

Potential solutions

A solution to this problem would involve:

  • 🍒 Reducing the # of objects sent back from C++ to JS
  • 🌸 Reducing the cost of passing objects from C++ to JS

As far as 🍒 - we currently limit to 40 features to reduce this cost. Could we possibly limit further? To 20 features? Looks like the 40 cutoff was added in #65 perhaps?

As far as 🌸 two ideas come to mind. Instead of converting C++ objects to JS JSON-like objects (which is clearly crazy expensive) we could either:

  • Pass C++ objects directly into JS. Those C++ objects would use pointers internally to avoid copying of the data. They would expose methods on them to get access to the underlying feature properties. So you would need to do context.getRelev() instead of context.relev in JS-land
  • Serialize using some small, efficient format like protobuf or the like. This way we'd serialize to protobuf in C++ and then decode lazily in JS to get access to the properties. The protobuf message could be efficiently passed from C++ to JS using this extremely fast and efficient method: Enhancement: pass data created in threadpool back to main JS thread in zero-copy way node-cpp-skel#69
@springmeyer
Copy link
Contributor Author

springmeyer commented Aug 10, 2018

Also should mention that the above is a braindump because I'm stepping away from this work after pushing on #126.

I'll add that I did get as far as confirming that -- if the JS conversion bottleneck is removed by trying the Pass C++ objects directly into JS method, -- the performance of the benchmark doubles and the main event loop is no longer blocked but idle.

My comparison is not quite fair since I only implemented the object passing and the actual access to the object properties is not yet implemented. So all the unit tests are broken. However the cache_bench benchmark does not actually depend on access to the properties like scoredist or maxrelev so I was able to compare decently using the benchmark.

Here is the unfinished branch that starts passing the C++ features without copying into JS land: https://github.com/mapbox/carmen-cache/compare/pass-by-object-rather-than-json.

The next step here, if someone where to take this one, would be adding methods to that Features object to actually be able to get the data off it. And then writing a benchmark that includes both the call to coalesce and the usage of the properties, to see if this is actually faster. My hunch is that the additional access to the properties would not be free but it would be less than paying the cost to convert them all at once.

@springmeyer
Copy link
Contributor Author

Update since some time has passed: I think this still applies: that the overhead of passing objects between JS and C++ is non-trivial.

@apendleton
Copy link
Contributor

Agreed. There are probably some intermediate improvements that can be made (for example: not allocating new strings for the keys of all the JS objects we return -- they all have an "id" key so we should just make one "id" string rather than one per object returned), but ultimately I think the solution will be to move the API boundary so we have to pass less stuff back and forth.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants