ruler: add more details to failures for TestRulerEvaluationDelay #7217

dimitarvdimitrov · 2024-01-25T11:41:30Z

Related to #4857

This test is flaky and it's always flaky with the same failure - the first sample's value is 4 instead of 2 here:

Line 314 in 9313dfc

assert.Equal(t, expectedValue, v.Value)

My hypothesis is that when now.Add(-evaluationDelay) is done the time range [now.Add(-evaluationDelay), now] may becomes so very slightly smaller than evaluationDelay (5m) when taken with the 1s step. This results in sending a query which doesn't quite include the 5m of samples, but 4.99999(9)m of samples, which excludes the first sample with value 2 and returns 4 instead.

I couldn't get the test to fail locally, so I'm changing the assertion type to not abort the test on the first mismatch so that we can investigate the all returned samples. Also some debug logs to verify this hypothesis with the step alignment.

Related to XXX This test is flaky and it's always flaky with the same failure - the first sample's value is 4 instead of 2. My hypothesis is that when `now.Add(-evaluationDelay)` is done, then the time range `[now.Add(-evaluationDelay), now]` might become so very slightly smaller than `evaluationDelay` when taken with the `1s` step. This results in sending a query which doesn't quite include the `5m` of samples, but `4.99999(9)m` of samples, which excludes the first sample with value `2` I couldn't get the test to fail locally, so I'm changing the assertion type to not abort the test on the first mismatch so that we can investigate the all returned samples. Also some debug logs to verify this hypothesis with the step alignment. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

gotjosh · 2024-01-25T12:32:28Z

integration/ruler_test.go

@@ -293,6 +293,7 @@ func TestRulerEvaluationDelay(t *testing.T) {
 	require.NoError(t, mimir.WaitSumMetrics(e2e.Greater(ruleEvaluationsAfterPush[0]+float64(samplesToSend)), "cortex_prometheus_rule_evaluations_total"))

 	// query all results to verify rules have been evaluated correctly
+	t.Log("querying from ", now.Add(-evaluationDelay), "to", now)


Another approach to this would be to add another assertion on the metric cortex_prometheus_last_evaluation_samples instead of cortex_prometheus_rule_evaluations_total.

This metric is incremented before the samples are pushed.

This is not the case for last_evaluation_samples.

I suspect that the problem is that we ignore the sample even though it was pushed.

Pushed:

2, 4, 6, 8, 10, 12, 14, 16, 18, 20

Queried:

4, 6, 8, 10, 12, 14, 16, 18, 20

The failure happens because the first sample we query is 4 and not 2. TO me that fact that 4 exists means that 2 must exist as well, we just don't get it in the query results.

dimitarvdimitrov requested a review from a team as a code owner January 25, 2024 11:41

dimitarvdimitrov mentioned this pull request Jan 25, 2024

Flaky test TestRulerEvaluationDelay #4857

Open

gotjosh approved these changes Jan 25, 2024

View reviewed changes

gotjosh reviewed Jan 25, 2024

View reviewed changes

dimitarvdimitrov merged commit f941f47 into main Jan 26, 2024
28 checks passed

dimitarvdimitrov deleted the dimitar/ruler/TestRulerEvaluationDelay-more-failures branch January 26, 2024 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ruler: add more details to failures for TestRulerEvaluationDelay #7217

ruler: add more details to failures for TestRulerEvaluationDelay #7217

dimitarvdimitrov commented Jan 25, 2024

gotjosh Jan 25, 2024 •

edited

Loading

dimitarvdimitrov Jan 25, 2024

ruler: add more details to failures for TestRulerEvaluationDelay #7217

ruler: add more details to failures for TestRulerEvaluationDelay #7217

Conversation

dimitarvdimitrov commented Jan 25, 2024

gotjosh Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

dimitarvdimitrov Jan 25, 2024

Choose a reason for hiding this comment

gotjosh Jan 25, 2024 •

edited

Loading