Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add retries & update ping/go interop test #32

Merged
merged 12 commits into from
Aug 29, 2022

Conversation

laurentsenta
Copy link
Collaborator

@laurentsenta laurentsenta commented Aug 23, 2022

  • Add a retry the build step
    • Note we use a few lines of shell instead of adding a new, external, dependency.
  • Add a status check on Testground setup
  • Reduce the run timeout
    • Test succeeds in ~ 2 min at the moment, this timeout means we're losing at worst 4 min when we detect a breaking change that hangs the test.
  • Update the ping/go/compat and add the corresponding go.v0.22.mod file to fix the test.
  • Reorder the go.toml and rust.toml resources
  • Add a trigger on push & PR
    • This checks the interop tests on pull requests.

@laurentsenta laurentsenta force-pushed the feat/interop-retries branch 3 times, most recently from d657743 to 1e6905b Compare August 23, 2022 12:21
@laurentsenta
Copy link
Collaborator Author

@marten-seemann @mxinden Ready for review,

In the result I shared above,
the largest test takes more than 30m on failures. That's the one where we build and test 13 versions together.

I would like to:

  • enable the cross-versions test on PRs for go-libp2p (again) and for rust-libp2p.
  • enable the "small" cross-implementation tests on PR.
    • this one tests go mater + rust master + the current branch

We could call the "large" cross-implementation tests nightly.

Next I'll work on stability, Testground releases, and caching to get all these stats below 10 min.

@laurentsenta
Copy link
Collaborator Author

laurentsenta commented Aug 25, 2022

I ran 200 tests tonight and got 2 errors:

  • still a 404 during a package download
  • an EOF during testground setup

I added a few more temporary tweaks.

I'll re-run 400 tests (100 for every configuration) which should take ~ 20 hours, I hope to get below 1% error rate for all the tests except "all interop".

@marten-seemann
Copy link
Contributor

I ran 200 tests tonight and got 2 errors:

That seems very high. That's a 1% error rate on 3 retries, which would suggest that every single run has a 22% probability of failing. Do we have any idea why that is?

@laurentsenta
Copy link
Collaborator Author

laurentsenta commented Aug 25, 2022

I ran 200 tests tonight and got 2 errors:

That seems very high. That's a 1% error rate on 3 retries, which would suggest that every single run has a 22% probability of failing. Do we have any idea why that is?

My first intuition is that this simplification is incorrect:

  • we didn't retry the testground setup,
  • 404 errors during a package download are not independent events (probabilistically speaking).

We observed 0.5% failed workflows with the testground setup, so I added a (temporary) retry on the testground install.

We observed 0.5% failed workflows during the build because goproxy.io started throwing 404s errors for a package. The build step failed 3 times in a row, but the probability of having a 404 on attempts 2 and 3 is much MUCH higher when you got a 404 on attempt 1.

@marten-seemann
Copy link
Contributor

Why are we getting 404s in the first place? That's different from a connection timeout, and seems to indicate a more fundamental problem.

@laurentsenta
Copy link
Collaborator Author

@marten-seemann To me that's a case of our proxy (proxy.golang.org btw, not goproxy.io) not having a 100% uptime SLA, but you might have more insight on this:

That's the run:
https://github.com/laurentsenta/test-plans/runs/8000694045?check_suite_focus=true

go: github.com/yuin/goldmark@v1.4.1: reading https://proxy.golang.org/github.com/yuin/goldmark/@v/v1.4.1.info: 404 Not Found
	server response: not found: temporarily unavailable

@marten-seemann
Copy link
Contributor

That's annoying, in this case this shouldn't even be a 4xx error, but rather a 5xx error.
I'm still wondering why this occurs that frequently. We have other CI builds that download a lot of dependencies (every run of go-libp2p, kubo, lotus etc. will), and they don't seem to be plagued with those issues.

@laurentsenta
Copy link
Collaborator Author

laurentsenta commented Aug 26, 2022

@marten-seemann Agreed, we might have caught a very unlikely error with the goproxy 404.

Results for the tests I run yesterday:

  • go versions interop: 100 / 100 successes
  • rust versions interop: 100 / 100 successes
  • rust - go interop (only masters): 100 / 100 successes
  • rust - go interop (all tests): 94 / 100 successes
    • Reason: I added a 25 min build timeout, as a safeguard, and it was reached 6 times (see actions).

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Not sure my review should have much weight for the Go related changes.

testground build composition \
-f ${{ inputs.composition_file }} \
--wait && exit 0;
sleep 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to sleep in between retries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry might happen almost instantaneously because of the docker caching. The goal here is to wait some time to give remote services a chance to improve the situation (like the 404 above).

I'll merge as is, but there will be at least one follow-up PR for improvements, where we can remove this sleep if we don't want it.

@laurentsenta
Copy link
Collaborator Author

@mxinden I updated with v0.47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants