Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

MemoryReporter: make call to runtime.ReadMemStats time bound to avoid lost metrics #1494

Merged
merged 10 commits into from
Oct 16, 2019

Conversation

fkaleo
Copy link
Contributor

@fkaleo fkaleo commented Oct 16, 2019

MemoryReporter: make call to runtime.ReadMemStats time bound to avoid lost metrics. If the timeout is reached, use the previous result of the function.
Added a generic decorator to limit function execution time.

Inspired by prometheus/client_golang#568

Fixes #1207

Copy link
Contributor

@robert-milan robert-milan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fkaleo fkaleo merged commit 2715788 into master Oct 16, 2019
@fkaleo fkaleo deleted the memstats_timebound branch October 16, 2019 13:28
mem := runtime.MemStats{}
runtime.ReadMemStats(&mem)
return mem
}, 5*time.Second, 1*time.Minute)
Copy link
Contributor

@Dieterbe Dieterbe Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that all graphite/metrictank systems support down to 1 second resolution.
that's also what metrictank is configured to emit by default.
as such a 5s timeout seems too long for such a setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more discussion we think it will be best to model the timeout after 65% of the set interval. However, we don't know if this is enough time for the rest of the reporters to complete their operations. What are your thoughts on the best way to proceed with fixing this issue? Also of note, once we update it here we will need to update it in a few other projects.

Another option is to launch them all at the same time and wait for results. Any reporter that doesn't return a result within the interval will not get reported for that tick, but all the others will.
We can also add a budget to the launched reporting function and it can decide if it would like to use caching or not. We feel this is the better option, but it does add a bit of overhead which takes away from the overall time allocated for that tick. It still seems best. Thoughts?

@Dieterbe
Copy link
Contributor

Any reporter that doesn't return a result within the interval will not get reported for that tick

So if an operation is consistently slow we will never report data for it ?

I don't know what you mean with budget, but I'm sure you guys can easily implement a decent solution for this. We don't have to debate minutiae

@fkaleo
Copy link
Contributor Author

fkaleo commented Oct 19, 2019

Any reporter that doesn't return a result within the interval will not get reported for that tick

So if an operation is consistently slow we will never report data for it ?

I don't know what you mean with budget, but I'm sure you guys can easily implement a decent solution for this. We don't have to debate minutiae

Actually, I think we are spending way too much time on this given that the fix for readmemstats stopped by GC will be in Go 1.14. I suggest we do nothing more than what was done unless we have a big customer who uses a 1 second resolution timer.

@Dieterbe
Copy link
Contributor

sounds good

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GC resulting in missed stats
3 participants