once a certain amount of cpu is reached, [aggregator] output becomes unreliable #241

Dieterbe · 2017-12-01T22:56:54Z

a user reported that when they have many aggregators, and cpu reaches about 60% (for each of 8 cores), data doesn't come through properly anymore.

IMHO it's not correct and not acceptable for the relay to drop data when cpu is not maxed out. and definitely not so when it does so non-transparantly.

it probably makes sense for similarly configured aggregators to not all trigger their flushing at the same time, to keep things flowing smoothly (to keep enough routines available for other tasks like table dispatching etc without queues forming) and to smoothe cpu utilisation over time. This may help with #241 but i'm not sure

Dieterbe · 2019-04-30T09:12:57Z

think i was able to reproduce this today even with a modest workload.

[init]
# init commands (DEPRECATED)
# see https://github.com/graphite-ng/carbon-relay-ng/blob/master/docs/config.md#Imperatives
cmds = [
'addAgg count regex=s(.*)\..* s$1.count.1 10 20 cache=false',
'addAgg count regex=so(.*)\..* so$1.count.2 10 20 cache=false',
'addAgg count regex=som(.*)\..* som$1.count.3 10 20 cache=false',
'addAgg count regex=some(.*)\..* some$1.count.4 10 20 cache=false',
'addAgg count regex=some.(.*)\..* some.$1.count.5 10 20 cache=false',
'addAgg count regex=some.i(.*)\..* some.i$1.count.6 10 20 cache=false',
'addAgg count regex=some.id(.*)\..* some.id$1.count.7 10 20 cache=false',
'addAgg count regex=some.id.(.*)\..* some.id.$1.count.8 10 20 cache=false',
'addAgg count regex=some.id.o(.*)\..* some.id.o$1.count.9 10 20 cache=false',
'addAgg count regex=some.id.of(.*)\..* some.id.of$1.count.10 10 20 cache=false',
'addAgg count regex=some.id.of.(.*)\..* some.id.of.$1.count.11 10 20 cache=false',
'addAgg count regex=some.id.of.a(.*)\..* some.id.of.a$1.count.12 10 20 cache=false',
'addAgg count regex=some.id.of.a.(.*)\..* some.id.of.a.$1.count.13 10 20 cache=false',
'addAgg count regex=some.id.of.a.m(.*)\..* some.id.of.a.m$1.count.14 10 20 cache=false',
'addAgg count regex=some.id.of.a.me(.*)\..* some.id.of.a.me$1.count.15 10 20 cache=false',
'addAgg count regex=(.*s.*.*o.*a.*m.*)[1-9]*.* $1.count.16 10 20 cache=false',
'addAgg count regex=(.*s.*.*o.*a.*m.*)[1-9]*.* $1.count.17 10 20 cache=false',
'addAgg count regex=(.*o.*.*o.*a.*m.*)[1-9]*.* $1.count.18 10 20 cache=false',
'addAgg count regex=(.*m.*.*o.*a.*m.*)[1-9]*.* $1.count.19 10 20 cache=false',
'addAgg count regex=(.*e.*.*o.*a.*m.*)[1-9]*.* $1.count.20 10 20 cache=false',
'addAgg count regex=(.*[1-9])*.* count-global.$1 10 20 cache=false',
'addRoute sendAllMatch carbon-default  localhost:2003 spool=false pickle=false'
]

fakemetrics feed --carbon-addr localhost:2002 --mpo 10000

generally the relay uses ~50% of an 8core machine here.
not sure yet why suddenly the output dropped https://snapshot.raintank.io/dashboard/snapshot/t1pXo75xZMgESUzFDN5d2IRtq6UsifdI?orgId=2
https://snapshot.raintank.io/dashboard/snapshot/zbXly5gbiiCDgwVfBqkV85asNlu2Bm05?orgId=2

out of possible effects:

raw datapoints missing
aggregate datapoints missing
aggregate datapoints being incorrect (e.g. as if they didn't take all raw data into account for certain points)

this reproduced 2, but not 1 (queryng a few subsets looks correct), nor 3 (they're all 10k or null)

i imagine i'll be able to repro 3 once i crank up cpu usage more.

Dieterbe · 2019-04-30T09:36:14Z

have a WIP branch which shows data incoming into each aggregator and also outgoing.
can confirm sometimes, for one or several aggregators, data comes in as usual, but no outgoing points.
digging deeper..

Dieterbe · 2019-04-30T10:22:58Z

I think I know what the problem is:

if an aggregator's ticker tries to send in the channel, but Aggregator.run() is not ready to receive (e.g. the select happens to be in another clause, receiving an incoming point), the tick is dropped
normally, the occasional dropped tick is not a problem, because upon a subsequent tick that does make it through, all pending aggregations will be processed.
however, as reported in when aggregations run behind, the random limiter acquisitions cause data to be dropped #356, due to random map iteration order, we may process a more recent aggregation before an older one, leading to out of order data. in this case we should see "too old" in the metrictank snapshot and we didn't. not sure why, i manually confirmed that if you send out of order carbon data, the new metric (which was recently renamed) increments fine and the dashboard shows it fine too... ??

I have fixed the out-of-order processing and will also add a buffer, such that we queue up at least a handful of ticks before dropping them, so that people still see their aggregate data come in sooner rather than later

previously, we too aggressively postponed flushing to subsequent ticks (whenever we happened to be not ready to receive the current tick) Now, we will always process flushes timely, except when the relay is severely overloaded. This means end users will see their aggregate data quicker. See #241 (comment)

See grafana/carbon-relay-ng#241 (comment)

Dieterbe mentioned this issue Apr 30, 2019

when aggregations run behind, the random limiter acquisitions cause data to be dropped #356

Closed

Dieterbe mentioned this issue Apr 30, 2019

Some aggregator improvements and better stats #361

Merged

Dieterbe assigned robert-milan May 13, 2019

robert-milan mentioned this issue May 21, 2019

Roadmap grafana/metrictank#1319

Open

27 tasks

Dieterbe mentioned this issue Sep 4, 2019

tracking of how many points were included in each computed aggregate #380

Closed

Dieterbe added the customer-impacting label Oct 7, 2019

hnakamur added a commit to hnakamur/go-carbon-fakemetrics-docker-compose that referenced this issue Mar 3, 2020

Add fakemetrics and modify carbon-relay-ng config

c644cfa

See grafana/carbon-relay-ng#241 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

once a certain amount of cpu is reached, [aggregator] output becomes unreliable #241

once a certain amount of cpu is reached, [aggregator] output becomes unreliable #241

Dieterbe commented Dec 1, 2017 •

edited

Dieterbe commented Apr 30, 2019

Dieterbe commented Apr 30, 2019

Dieterbe commented Apr 30, 2019 •

edited

once a certain amount of cpu is reached, [aggregator] output becomes unreliable #241

once a certain amount of cpu is reached, [aggregator] output becomes unreliable #241

Comments

Dieterbe commented Dec 1, 2017 • edited

Dieterbe commented Apr 30, 2019

Dieterbe commented Apr 30, 2019

Dieterbe commented Apr 30, 2019 • edited

Dieterbe commented Dec 1, 2017 •

edited

Dieterbe commented Apr 30, 2019 •

edited