Performance optimizations for meta tag queries #1517

replay · 2019-11-01T21:04:20Z

This PR does 4 things, in 4 different commits, so I recommend reviewing them separately:

It adds benchmarks for executing meta tag queries in a relatively realistic scenario
It limits the number of sub queries which can be launched concurrently to the value of TagQueryWorkers. When building the initial result set, if the query expression includes a meta tag, then we instantiate a sub-query in a go routine for each of the involved meta records. If a meta tag has a really large number of meta records associated with it (f.e. dc=dc1 resulting in host=host1,host=host2,host=host3 etc) then we don't want to start all those go routines for the sub queries at once. So this introduces a gate to limit their concurrency.
After we have built the initial result set based on one query expression, we filter it down based on the other given query expressions. For this we're starting separate go routines which are doing the filtering. But if a query only has one expression then we don't need to filter anything, because we can return the initial result set directly. This prevents the creation of the filter workers if there are no expressions to filter by and it also saves us copying the values from the filter worker input channel into the output channel. This required changing the interface of the tag selector a bit, we're now passing the result channel into .getIds() when calling it and .getIds() is now a blocking method.
Remove a bunch of old code that already isn't used anymore in the current master.

also changes the default value for tag query workers from 50 to 5. actually 50 has always been relatively high, but since we now also create sub-queries from the expressions associated with meta tags it is way too high. also changes the id selector so that it doesn't unnecessarily deduplicate results when it is called by a sub-query, because sub-queries don't evaluate meta tags and only if meta tags get evaluated duplicates are possible.

since we still always need to filter by the from timestamp, we are now doing this in the id selector by calling a new method on the tag query context called newerThanFrom. to make this change work it was necessary to pass the result channel into getIds(), instead of returning it from it. this resulted in a bit of rewiring of the channels and especially where channels get closed. to make it possible to close the result channel when the id selector is finished, getIds() is now a blocking method.

replay · 2019-11-05T12:40:38Z

Rebased onto the latest master

robert-milan

LGTM

replay added 4 commits November 5, 2019 09:40

benchmark of find by meta tag

0d17df7

remove old code that is not used anymore

2ec3d6e

replay force-pushed the performance_optimizations_for_meta_tag_queries branch from 61eeb13 to 2ec3d6e Compare November 5, 2019 12:40

robert-milan approved these changes Nov 6, 2019

View reviewed changes

replay merged commit 16c7bf7 into master Nov 6, 2019

replay deleted the performance_optimizations_for_meta_tag_queries branch November 6, 2019 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimizations for meta tag queries #1517

Performance optimizations for meta tag queries #1517

replay commented Nov 1, 2019

replay commented Nov 5, 2019

robert-milan left a comment

Performance optimizations for meta tag queries #1517

Performance optimizations for meta tag queries #1517

Conversation

replay commented Nov 1, 2019

replay commented Nov 5, 2019

robert-milan left a comment

Choose a reason for hiding this comment