-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NOPS-1034 added threshold handling for asPercent and divideSeries #165
Conversation
if(result.target && asPercentRegex.test(result.target) || result.target && divideSeriesRegex.test(result.target)){ | ||
const fetchCountPerTimeUnit = result.datapoints.map(item => Number(item[0])); | ||
const isFailing = this.direction === 'above' ? | ||
Number(fetchCountPerTimeUnit[0]) > this.threshold : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: I think I need some more context here in understanding why we're only using the first element of fetchCountPerTimeUnit to determine whether it is failing. Is it because we can be sure that there is only 1 datapoint in the array for divideSeries or asPercent, and that datapoint is the result of the query? From what I understand, if there is more than 1 datapoint in the array, the api does not return the result of the query so the datapoints look like
"datapoints": [[1.0, 1635125460], [null, 1635129060], ...]
so I would have thought that we would need to do the calculations ourselves here, similar to how summarize(sumSeries) is done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I just saw your tests and it answered my question - it is indeed 1 item in the datapoints array
datapoints: [[ 8, 1635125640]]
suggestion: as there's only 1 item in the datapoints array, perhaps we can just change line 54 to
const ratioOfErrorsToSuccesses = Number(result.datapoints[0][0])
to make it more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a slightly different approach. All the health checks I've seen have only one datapoint, but you are correct that this may not always be the case - I added a log to let us know if that's not the case (so we can iterate as needed).
}); | ||
}); | ||
|
||
it('Should be unhealthy if below threshold', function (done) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: would it make sense for it to be healthy (instead of unhealthy) below the threshold? In my mind, if we get the % of errors against the successes, a low value suggests that there's a low failure rate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests checks that the direction: 'below'
works. I doubt the actual functionality would be used because this functionality is generally used for rater of errors to successes
. But sometimes you may want to track ratios of other events.
Hey Ania, this looks good to me! I think we should merge this PR in because it helps identify the ratio of errors to successes and therefore is an improvement on what was there before. For option 2, I'd like to understand why Graphite is considered a weak point |
NOPS-1034 added threshold handling for asPercent and divideSeries
Companion to #164
Graphite datapoints are values stored with their timestamp. But, if there are no events during that timestamp, the value will be
null
and not0
, ie denoting explicit lack of a value. (https://graphite.readthedocs.io/en/latest/terminology.html#graphite-terminology). So in some casesnull
is an expected - and even desirable value. For example if we're checking theerror.count
then0
is ideal as that means no errors are occurring.Things get a little bit interesting when we look at
asPercent
anddivideSeries
. Those two graphite functions are used in the same way - we take asum of errors in _time_
and asum of successes in _time_
and we either express the result as a percentage or division result. I think the desired metrics is the ratio of errors to successes; I'm not quite sure eitherasPercent
ordivideSeries
reflects that, but it's good enough to be used as a valid metric. Anyhow, what this means is practice is most of the timesum of successes = _positive integer_
andsum of errors = null (ie 0)
. Butzero divided by positive integer
is stillzero
(for divideSeries), andzero times 100, then divided by positive integer
is alsozero
(for asPercent).What this PR does
asPercent
anddivideSeries
sumSeries
because I got very confusedWhat this PR does not do
My initial though would be - once we checked that the result of
asPercent
ordivideSeries
wasnull
, we'd run afetch
against the second metric of the check and fail if that'snull
too. I totally forgot you can't doawait
inside amap
. As such we have two options: 1️⃣ merge this PR and call it done, given that's an improvement on what was there before; and 2️⃣ if we are really concerned about the possibility of Graphite being the weak point, we should go through the repos that are using those checks and add an-health
graphiteWorking
check.When reviewing this PR please state your preference for 1️⃣ vs 2️⃣ vs any other thought you might have on this matter.