Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

once a certain amount of cpu is reached, [aggregator] output becomes unreliable #241

Open
Dieterbe opened this issue Dec 1, 2017 · 3 comments
Assignees

Comments

@Dieterbe
Copy link
Contributor

Dieterbe commented Dec 1, 2017

a user reported that when they have many aggregators, and cpu reaches about 60% (for each of 8 cores), data doesn't come through properly anymore.

IMHO it's not correct and not acceptable for the relay to drop data when cpu is not maxed out. and definitely not so when it does so non-transparantly.

Dieterbe added a commit that referenced this issue Apr 23, 2019
it probably makes sense for similarly configured aggregators to
not all trigger their flushing at the same time, to keep things
flowing smoothly (to keep enough routines available for other tasks
like table dispatching etc without queues forming) and to smoothe
cpu utilisation over time. This may help with #241 but i'm not sure
@Dieterbe
Copy link
Contributor Author

think i was able to reproduce this today even with a modest workload.

[init]
# init commands (DEPRECATED)
# see https://github.com/graphite-ng/carbon-relay-ng/blob/master/docs/config.md#Imperatives
cmds = [
'addAgg count regex=s(.*)\..* s$1.count.1 10 20 cache=false',
'addAgg count regex=so(.*)\..* so$1.count.2 10 20 cache=false',
'addAgg count regex=som(.*)\..* som$1.count.3 10 20 cache=false',
'addAgg count regex=some(.*)\..* some$1.count.4 10 20 cache=false',
'addAgg count regex=some.(.*)\..* some.$1.count.5 10 20 cache=false',
'addAgg count regex=some.i(.*)\..* some.i$1.count.6 10 20 cache=false',
'addAgg count regex=some.id(.*)\..* some.id$1.count.7 10 20 cache=false',
'addAgg count regex=some.id.(.*)\..* some.id.$1.count.8 10 20 cache=false',
'addAgg count regex=some.id.o(.*)\..* some.id.o$1.count.9 10 20 cache=false',
'addAgg count regex=some.id.of(.*)\..* some.id.of$1.count.10 10 20 cache=false',
'addAgg count regex=some.id.of.(.*)\..* some.id.of.$1.count.11 10 20 cache=false',
'addAgg count regex=some.id.of.a(.*)\..* some.id.of.a$1.count.12 10 20 cache=false',
'addAgg count regex=some.id.of.a.(.*)\..* some.id.of.a.$1.count.13 10 20 cache=false',
'addAgg count regex=some.id.of.a.m(.*)\..* some.id.of.a.m$1.count.14 10 20 cache=false',
'addAgg count regex=some.id.of.a.me(.*)\..* some.id.of.a.me$1.count.15 10 20 cache=false',
'addAgg count regex=(.*s.*.*o.*a.*m.*)[1-9]*.* $1.count.16 10 20 cache=false',
'addAgg count regex=(.*s.*.*o.*a.*m.*)[1-9]*.* $1.count.17 10 20 cache=false',
'addAgg count regex=(.*o.*.*o.*a.*m.*)[1-9]*.* $1.count.18 10 20 cache=false',
'addAgg count regex=(.*m.*.*o.*a.*m.*)[1-9]*.* $1.count.19 10 20 cache=false',
'addAgg count regex=(.*e.*.*o.*a.*m.*)[1-9]*.* $1.count.20 10 20 cache=false',
'addAgg count regex=(.*[1-9])*.* count-global.$1 10 20 cache=false',
'addRoute sendAllMatch carbon-default  localhost:2003 spool=false pickle=false'
]
fakemetrics feed --carbon-addr localhost:2002 --mpo 10000

cng-issue

generally the relay uses ~50% of an 8core machine here.
not sure yet why suddenly the output dropped https://snapshot.raintank.io/dashboard/snapshot/t1pXo75xZMgESUzFDN5d2IRtq6UsifdI?orgId=2
https://snapshot.raintank.io/dashboard/snapshot/zbXly5gbiiCDgwVfBqkV85asNlu2Bm05?orgId=2

out of possible effects:

  1. raw datapoints missing
  2. aggregate datapoints missing
  3. aggregate datapoints being incorrect (e.g. as if they didn't take all raw data into account for certain points)

this reproduced 2, but not 1 (queryng a few subsets looks correct), nor 3 (they're all 10k or null)

i imagine i'll be able to repro 3 once i crank up cpu usage more.

@Dieterbe
Copy link
Contributor Author

have a WIP branch which shows data incoming into each aggregator and also outgoing.
can confirm sometimes, for one or several aggregators, data comes in as usual, but no outgoing points.
digging deeper..

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Apr 30, 2019

I think I know what the problem is:

  • if an aggregator's ticker tries to send in the channel, but Aggregator.run() is not ready to receive (e.g. the select happens to be in another clause, receiving an incoming point), the tick is dropped
  • normally, the occasional dropped tick is not a problem, because upon a subsequent tick that does make it through, all pending aggregations will be processed.
  • however, as reported in when aggregations run behind, the random limiter acquisitions cause data to be dropped #356, due to random map iteration order, we may process a more recent aggregation before an older one, leading to out of order data. in this case we should see "too old" in the metrictank snapshot and we didn't. not sure why, i manually confirmed that if you send out of order carbon data, the new metric (which was recently renamed) increments fine and the dashboard shows it fine too... ??

I have fixed the out-of-order processing and will also add a buffer, such that we queue up at least a handful of ticks before dropping them, so that people still see their aggregate data come in sooner rather than later

Dieterbe added a commit that referenced this issue Apr 30, 2019
previously, we too aggressively postponed flushing to subsequent ticks
(whenever we happened to be not ready to receive the current tick)
Now, we will always process flushes timely, except when the relay
is severely overloaded.
This means end users will see their aggregate data quicker.

See #241 (comment)
Dieterbe added a commit that referenced this issue Apr 30, 2019
previously, we too aggressively postponed flushing to subsequent ticks
(whenever we happened to be not ready to receive the current tick)
Now, we will always process flushes timely, except when the relay
is severely overloaded.
This means end users will see their aggregate data quicker.

See #241 (comment)
hnakamur added a commit to hnakamur/go-carbon-fakemetrics-docker-compose that referenced this issue Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants