Adopt common timeout/retry/backoff strategy for system tests. #1619

tseaver · 2016-03-16T17:40:01Z

We have a number of sporadic system test failures (#1085, #1089, #1100, #1104) due to "eventual consistency" in the tested APIs. We have adopted a set of ad-hoc (and obviously not 100% reliable) strategies for working around the delays involved.

The point of this issue is to adopt a consistent policy / mechanism across all system tests for addressing delays caused by eventual consistency.

@jonparrott can you comment here, collecting your thoughts expressed elsewhere?

theacodes · 2016-03-16T17:41:53Z

In python-docs-samples we're using a combination of flaky and not running flaky tests on travis (we have a jenkins instance).

tseaver · 2016-03-16T17:44:57Z

#1391 is related.

dhermes · 2016-03-16T20:08:22Z

I have an old branch around from #535 when I made the first stab at this.

tseaver · 2016-03-16T20:28:46Z

Maybe one of the following libraries would be useful:

daspecster · 2016-08-03T01:08:47Z

So now, in the systems-tests path we have retry.py.

from retry import Retry

@Retry(SomeException, tries=3, delay=30)
def method_to_retry():
    ...

tseaver · 2016-08-03T20:26:38Z

I added classes to handle two other cases in #2050:

Retry until method result passes a predicate.
Retry until method's __self__ passes a predicate.

dhermes · 2016-08-09T20:30:01Z

@tseaver What about this issue? And the novel failures?

tseaver · 2016-08-09T20:57:47Z

I think we should create separate issues for the failures we still see, and resolve them one by one as we add backoff/retry for them. E.g., today's BQ failure should be its own issue, rather than reopening #1104. Likewise todays happybase failure should be a separate issue.

dhermes · 2016-08-09T20:59:28Z

@tseaver 500, 502 and 503 are just unavoidable for a general service. gcloud-node just retries them by default

tseaver · 2016-08-09T21:03:59Z

@dhermes #2075 tracks the BQ failure. #2076 tracks the Happybase one. I assigned the first to me, and the second to you, somewhat arbitrarily.

tseaver added the testing label Mar 16, 2016

tseaver mentioned this issue Mar 16, 2016

Add system test for 'logger.log_text' and 'logger.log_struct'. #1609

Merged

tseaver mentioned this issue Mar 18, 2016

Investigate novel bigquery system test failures #1104

Closed

This was referenced Mar 30, 2016

Add exponential backoff for deletion failures. #1683

Merged

Add system tests for logging w/ insert_id/severty/http_request metdata. #1682

Merged

tseaver mentioned this issue Jul 1, 2016

Fix GAX logging system tests #1952

Merged

dhermes mentioned this issue Aug 1, 2016

Playing with retries. #2040

Merged

tseaver closed this as completed Aug 9, 2016

dhermes added the flaky label Aug 11, 2016

JustinBeckwith assigned tseaver Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adopt common timeout/retry/backoff strategy for system tests. #1619

Adopt common timeout/retry/backoff strategy for system tests. #1619

tseaver commented Mar 16, 2016

theacodes commented Mar 16, 2016

tseaver commented Mar 16, 2016

dhermes commented Mar 16, 2016

tseaver commented Mar 16, 2016

daspecster commented Aug 3, 2016

tseaver commented Aug 3, 2016

dhermes commented Aug 9, 2016

tseaver commented Aug 9, 2016

dhermes commented Aug 9, 2016

tseaver commented Aug 9, 2016

Adopt common timeout/retry/backoff strategy for system tests. #1619

Adopt common timeout/retry/backoff strategy for system tests. #1619

Comments

tseaver commented Mar 16, 2016

theacodes commented Mar 16, 2016

tseaver commented Mar 16, 2016

dhermes commented Mar 16, 2016

tseaver commented Mar 16, 2016

daspecster commented Aug 3, 2016

tseaver commented Aug 3, 2016

dhermes commented Aug 9, 2016

tseaver commented Aug 9, 2016

dhermes commented Aug 9, 2016

tseaver commented Aug 9, 2016