Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ruler: add more details to failures for TestRulerEvaluationDelay #7217

Merged

Conversation

dimitarvdimitrov
Copy link
Contributor

Related to #4857

This test is flaky and it's always flaky with the same failure - the first sample's value is 4 instead of 2 here:

assert.Equal(t, expectedValue, v.Value)

My hypothesis is that when now.Add(-evaluationDelay) is done the time range [now.Add(-evaluationDelay), now] may becomes so very slightly smaller than evaluationDelay (5m) when taken with the 1s step. This results in sending a query which doesn't quite include the 5m of samples, but 4.99999(9)m of samples, which excludes the first sample with value 2 and returns 4 instead.

I couldn't get the test to fail locally, so I'm changing the assertion type to not abort the test on the first mismatch so that we can investigate the all returned samples. Also some debug logs to verify this hypothesis with the step alignment.

Related to XXX

This test is flaky and it's always flaky with the same failure - the first sample's value is 4 instead of 2. My hypothesis is that when `now.Add(-evaluationDelay)` is done, then the time range `[now.Add(-evaluationDelay), now]` might become so very slightly smaller than `evaluationDelay` when taken with the `1s` step. This results in sending a query which doesn't quite include the `5m` of samples, but `4.99999(9)m` of samples, which excludes the first sample with value `2`

I couldn't get the test to fail locally, so I'm changing the assertion type to not abort the test on the first mismatch so that we can investigate the all returned samples. Also some debug logs to verify this hypothesis with the step alignment.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@@ -293,6 +293,7 @@ func TestRulerEvaluationDelay(t *testing.T) {
require.NoError(t, mimir.WaitSumMetrics(e2e.Greater(ruleEvaluationsAfterPush[0]+float64(samplesToSend)), "cortex_prometheus_rule_evaluations_total"))

// query all results to verify rules have been evaluated correctly
t.Log("querying from ", now.Add(-evaluationDelay), "to", now)
Copy link
Contributor

@gotjosh gotjosh Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach to this would be to add another assertion on the metric cortex_prometheus_last_evaluation_samples instead of cortex_prometheus_rule_evaluations_total.

This metric is incremented before the samples are pushed.

This is not the case for last_evaluation_samples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the problem is that we ignore the sample even though it was pushed.

Pushed:

2, 4, 6, 8, 10, 12, 14, 16, 18, 20

Queried:

4, 6, 8, 10, 12, 14, 16, 18, 20

The failure happens because the first sample we query is 4 and not 2. TO me that fact that 4 exists means that 2 must exist as well, we just don't get it in the query results.

@dimitarvdimitrov dimitarvdimitrov merged commit f941f47 into main Jan 26, 2024
28 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the dimitar/ruler/TestRulerEvaluationDelay-more-failures branch January 26, 2024 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants