-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
once a certain amount of cpu is reached, [aggregator] output becomes unreliable #241
Comments
it probably makes sense for similarly configured aggregators to not all trigger their flushing at the same time, to keep things flowing smoothly (to keep enough routines available for other tasks like table dispatching etc without queues forming) and to smoothe cpu utilisation over time. This may help with #241 but i'm not sure
think i was able to reproduce this today even with a modest workload.
generally the relay uses ~50% of an 8core machine here. out of possible effects:
this reproduced 2, but not 1 (queryng a few subsets looks correct), nor 3 (they're all 10k or null) i imagine i'll be able to repro 3 once i crank up cpu usage more. |
have a WIP branch which shows data incoming into each aggregator and also outgoing. |
I think I know what the problem is:
I have fixed the out-of-order processing and will also add a buffer, such that we queue up at least a handful of ticks before dropping them, so that people still see their aggregate data come in sooner rather than later |
previously, we too aggressively postponed flushing to subsequent ticks (whenever we happened to be not ready to receive the current tick) Now, we will always process flushes timely, except when the relay is severely overloaded. This means end users will see their aggregate data quicker. See #241 (comment)
previously, we too aggressively postponed flushing to subsequent ticks (whenever we happened to be not ready to receive the current tick) Now, we will always process flushes timely, except when the relay is severely overloaded. This means end users will see their aggregate data quicker. See #241 (comment)
a user reported that when they have many aggregators, and cpu reaches about 60% (for each of 8 cores), data doesn't come through properly anymore.
IMHO it's not correct and not acceptable for the relay to drop data when cpu is not maxed out. and definitely not so when it does so non-transparantly.
The text was updated successfully, but these errors were encountered: