remove excessive request alignment, add MDP optimization and pre-normalisation #951

Dieterbe · 2018-06-26T19:14:29Z

we now no longer excessively align metric requests to the same interval. if metrics have different intervals, we can return them as such
implement MDP-optimization and PN-optimization as described in alignRequests is too coarse grained (for timeshift functions etc) #926 . we introduce the concept of PN-groups and per request MDP field and track the optimizability through the construction of the plan via the Context so we can set the MDP hints on data requests and classify them in PN groups.
functions that now need to work with multiple series (e.g. aggregators), get runtime normalization to deal with different-interval inputs.
remove prometheus query path. I had the choice of working on it to make it work with the new api's, or just removing it. which we wanted to do anyway (remove promql support #1454)
max-points-per-req now accurately works against the number of points fetched. (max-points-per-req-soft / max-points-per-req-hard misleading #1556). Also, if a single target has multiple MetricDefs (e.g. due interval change), you no longer get that extra data for free.
refactor code related to construction of requests and planning. in some places it's quite a bit cleaner now
Note that MDP and PN group gets communicated via models.Req and models.Series. For queries that were fanned out, we typically recontruct the expr.Req out of the models.Series so we can associate the right data to the originating expression from the query. During an in place upgrade, this could result in a MDP- and PN-group aware query initiator receiving data back with these fields set to 0, so it would not be able to associate the data. I suggest to handle this by adding a flag to opt-in to the optimizations, so data requests from old nodes are equivalent to requests that have the optimizations disabled. They will just be treated as non-optimizable. Once an entire cluster is upgraded, we can then enable the flag.

Note that when MDP/PN optimizations are disabled (or for non-MDP-optimizable singles), the behavior of request planning is the same as in master (set highest resolution based on TTL and ready status) except:

after setting the archive etc, we don't align anymore.
we honor max-points-per-req-soft by reducing some series to have a pointcount MDP/2, rather then forcing all of them to be at a lower (shared/equal) resolution.

TODO:

metrics? output medatadata on number of times we've mdp-optimized, pre-normalized or had to runtime normalize?
update models.Req and models.Series json/msgp marshal funcs to include the new fields
make MDP-optimizations opt-in, and PN oputout
(MDP optimizations: reduce to MDP rather than MDP/2)
figure out how to handle our now inability to pre-canonicalize series (because we can't predict runtime normalization)

fitzoh · 2020-01-17T20:18:59Z

Thoughts on moving the prometheus removal to a separate PR?

Dieterbe · 2020-01-17T21:02:26Z

fair enough. is now separate pr: #1613
also #1616 should be merged first, then i can rebase this one.

clearer. more robust.

easier to understand

this will simplify our testing

Dieterbe · 2020-01-29T21:05:45Z

maxPointsPerReqSoft workload benchmark

goal

progressively fetch longer time window of different interval data. track response sizes and latencies. check overhead of maxPointsPerReqSoft when it's pushed to the extreme.
Note that we work with the default config values:

mppr soft: 1M
mppr hard: 20M

how?

pn groups get downgraded in one shot. that should be fast, and thus is not interesting
no pngroups -> lowers individual requests one at a time and repeatedly recomputes the pointcount of the entire plan. more expensive -> interesting

We keep using the same storage-schemas.conf as posted above, and a similar fakemetrics command, but with longer time span to backfill:

fakemetrics schemasbackfill --schemas-file storage-schemas.conf --kafka-mdm-addr localhost:9092 --mpr 1000 --speedup 840 --offset 8d

The main question is, how many points do we fetch per hour worth of data? Well it depends...

without PN (3 archives at their raw intervals): 1000 * 6*60 + 1000 * 4*60 + 1000 * 60 = 660.000
with PN / when all at 60s: 3000*60 = 180.000
softened as much as possible (all series at their coarsest): 2000*60 + 1000*30= 150.000

let's try a spectrum starting at no maxPointsPerReqSoft breach all the way to breach it as significantly as possibly. the hard limit allows us to query 20M/150k=~133 hours of data.
so we can push the query up to 133h, which will be coarsened to meet soft quite extensively, and still just fit into the hard limit

execution

we run this program: https://gist.github.com/Dieterbe/35f7ffb55f1a65a951d64de22d146411 which will query gradually increasing timeframes from 1h to 133h (we also try 134 just to see it fail to meet the Hard limit)

observations

1h data -> nothing special happens. raw data
2h+ -> need to soften . we start going into 60s rollups. up until 6h we're able to keep the points fetched constant around 1M
at 7h we reach the limit of how much we can soften and we start breaking soft limit and growing points-fetched at 150k per hour
at 134 offset, we get bad response errors back in this branch, but not master (!). master allows breaking the hard limit.
the main observation here is that the latency is roughly constant up until 7h window, and beyond that starts growing, as points fetched and returned also grows. in fact, the latency is proportional to num returned points, though i'm not sure why it grows a tad faster than linear.

cpu profiling

I collected a cpu profile in a separate run, and noticed that none of planRequests or PointsFetch functions showed up in the top 50. (even when we were querying timeranges of 130-133). Meaning the performance is definitely good enough. (item 50 on the list was at 0.0.24% flat)
see https://gist.github.com/Dieterbe/35f7ffb55f1a65a951d64de22d146411#gistcomment-3160350
(again most time is spent generating the response body)

Dieterbe · 2020-01-29T21:06:38Z

Conclusions

based on the experiments done so far:

on mixed-resolution data, you can expect the new code to be slower than master, because we return more points, unless we can PN-optimize or MDP consolidation kicks in. slowdowns are caused only by the increased response, nothing else.
on same-resolution data, performance between this branch and master is identical ( I didn't do a separate experiment but we can base that off the PN experiments we did before)
master will happily fetch more points than what maxPointsPerReqHard allows. this PR fixes that(see max-points-per-req-soft / max-points-per-req-hard misleading #1556)

Dieterbe · 2020-01-30T21:15:16Z

Did some more benchmarks....

master shows the same superlinear latency growth (wrt response size) as this branch. I haven't gotten to the bottom of it. knowing that master does the same was enough of a relief for me.
also, on a single node, this branch seems to be a bit faster than master, but on a cluster, the inverse is true. not sure if noise? digging much deeper will require time i don't have right now.
both for this branch, as well as master, a cluster has significantly more latency than a single node. this was really interesting. and something to look into further (outside of scope of this work)

Dieterbe · 2020-02-06T10:55:05Z

fix #926

Dieterbe added this to the 1.0 milestone Aug 22, 2018

Dieterbe added the customer-impacting label Aug 22, 2018

Dieterbe modified the milestones: vnext, sprint-2 Oct 7, 2019

fkaleo modified the milestones: sprint-2, sprint-4 Oct 28, 2019

fkaleo assigned Dieterbe Nov 18, 2019

Dieterbe force-pushed the refactor-alignrequests branch from 62b221b to 44feb3e Compare December 2, 2019 20:58

robert-milan modified the milestones: sprint-4, sprint-5 Dec 9, 2019

robert-milan assigned replay Dec 9, 2019

fkaleo modified the milestones: sprint-5, sprint-6 Jan 6, 2020

Dieterbe force-pushed the refactor-alignrequests branch 6 times, most recently from c525a2d to 1f78d5a Compare January 17, 2020 19:13

Dieterbe changed the title ~~WIP: less coarse alignRequests. see #926~~ remove excessive request alignment, add MDP optimization and pre-normalisation Jan 17, 2020

Dieterbe force-pushed the refactor-alignrequests branch from 1f78d5a to c947cf9 Compare January 17, 2020 21:11

Dieterbe changed the title ~~remove excessive request alignment, add MDP optimization and pre-normalisation~~ [WIP] remove excessive request alignment, add MDP optimization and pre-normalisation Jan 17, 2020

Dieterbe force-pushed the refactor-alignrequests branch 6 times, most recently from ffa1608 to 6b5b654 Compare January 20, 2020 17:39

Dieterbe added 16 commits January 27, 2020 22:46

Req.Equals should also check TTL. we only used this for unit tests

dd85a3d

make maxPointsPerReq{Soft,Hard} explicit args, rather than globals

ae2ff09

clearer. more robust.

do maxPointsPerReq testing same as other tests

ca80978

easier to understand

tests for normalization

163e222

expr.NewReq should take PNGroup and MDP fields also

32d7ce2

this will simplify our testing

unit test for planner optimizations

4c8a573

note

3054f19

refer to http errors by their name as per contribution docs

427bb5f

mt-explain docs

3052eb7

cleanup docs and devdocs

38ba464

allow passing optimizations as query parameters

076a490

asPercent: accommodate an extra case

526b443

sean feedback

635a4bb

asPercent safe again

d5c4228

robert feedback

7198dfb

sean feedback 2

260c7eb

Dieterbe force-pushed the refactor-alignrequests branch from 7abd386 to 260c7eb Compare January 27, 2020 20:51

planRequests() erroring is a user error

2f9369e

Dieterbe changed the title ~~[WIP] remove excessive request alignment, add MDP optimization and pre-normalisation~~ remove excessive request alignment, add MDP optimization and pre-normalisation Jan 29, 2020

robert-milan approved these changes Jan 30, 2020

View reviewed changes

robert-milan merged commit 21d1dcd into master Jan 30, 2020

robert-milan deleted the refactor-alignrequests branch January 30, 2020 10:12

This was referenced Feb 6, 2020

alignRequests is too coarse grained (for timeshift functions etc) #926

Closed

mergeSeries() needs runtime normalizing #1673

Closed

fix: mergeSeries() should normalize its inputs + normalize() bugfix + cleanups #1674

Merged

Dieterbe mentioned this pull request Apr 10, 2020

max-points-per-req-soft / max-points-per-req-hard misleading #1556

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove excessive request alignment, add MDP optimization and pre-normalisation #951

remove excessive request alignment, add MDP optimization and pre-normalisation #951

Dieterbe commented Jun 26, 2018 •

edited

fitzoh commented Jan 17, 2020

Dieterbe commented Jan 17, 2020 •

edited

Dieterbe commented Jan 29, 2020 •

edited

Dieterbe commented Jan 29, 2020

Dieterbe commented Jan 30, 2020 •

edited

Dieterbe commented Feb 6, 2020

remove excessive request alignment, add MDP optimization and pre-normalisation #951

remove excessive request alignment, add MDP optimization and pre-normalisation #951

Conversation

Dieterbe commented Jun 26, 2018 • edited

fitzoh commented Jan 17, 2020

Dieterbe commented Jan 17, 2020 • edited

Dieterbe commented Jan 29, 2020 • edited

maxPointsPerReqSoft workload benchmark

goal

how?

execution

observations

cpu profiling

Dieterbe commented Jan 29, 2020

Conclusions

Dieterbe commented Jan 30, 2020 • edited

Dieterbe commented Feb 6, 2020

Dieterbe commented Jun 26, 2018 •

edited

Dieterbe commented Jan 17, 2020 •

edited

Dieterbe commented Jan 29, 2020 •

edited

Dieterbe commented Jan 30, 2020 •

edited