Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt common timeout/retry/backoff strategy for system tests. #1619

Closed
tseaver opened this issue Mar 16, 2016 · 10 comments
Closed

Adopt common timeout/retry/backoff strategy for system tests. #1619

tseaver opened this issue Mar 16, 2016 · 10 comments
Assignees
Labels

Comments

@tseaver
Copy link
Contributor

tseaver commented Mar 16, 2016

From #1609 (comment)

We have a number of sporadic system test failures (#1085, #1089, #1100, #1104) due to "eventual consistency" in the tested APIs. We have adopted a set of ad-hoc (and obviously not 100% reliable) strategies for working around the delays involved.

The point of this issue is to adopt a consistent policy / mechanism across all system tests for addressing delays caused by eventual consistency.

@jonparrott can you comment here, collecting your thoughts expressed elsewhere?

@theacodes
Copy link
Contributor

In python-docs-samples we're using a combination of flaky and not running flaky tests on travis (we have a jenkins instance).

@tseaver
Copy link
Contributor Author

tseaver commented Mar 16, 2016

#1391 is related.

@dhermes
Copy link
Contributor

dhermes commented Mar 16, 2016

I have an old branch around from #535 when I made the first stab at this.

@tseaver
Copy link
Contributor Author

tseaver commented Mar 16, 2016

Maybe one of the following libraries would be useful:

@daspecster
Copy link
Contributor

So now, in the systems-tests path we have retry.py.

from retry import Retry

@Retry(SomeException, tries=3, delay=30)
def method_to_retry():
    ...

@tseaver
Copy link
Contributor Author

tseaver commented Aug 3, 2016

I added classes to handle two other cases in #2050:

  • Retry until method result passes a predicate.
  • Retry until method's __self__ passes a predicate.

@dhermes
Copy link
Contributor

dhermes commented Aug 9, 2016

@tseaver What about this issue? And the novel failures?

@tseaver
Copy link
Contributor Author

tseaver commented Aug 9, 2016

I think we should create separate issues for the failures we still see, and resolve them one by one as we add backoff/retry for them. E.g., today's BQ failure should be its own issue, rather than reopening #1104. Likewise todays happybase failure should be a separate issue.

@dhermes
Copy link
Contributor

dhermes commented Aug 9, 2016

@tseaver 500, 502 and 503 are just unavoidable for a general service. gcloud-node just retries them by default

@tseaver
Copy link
Contributor Author

tseaver commented Aug 9, 2016

@dhermes #2075 tracks the BQ failure. #2076 tracks the Happybase one. I assigned the first to me, and the second to you, somewhat arbitrarily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants