-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add retries & update ping/go interop test #32
feat: add retries & update ping/go interop test #32
Conversation
d657743
to
1e6905b
Compare
1e6905b
to
ca68ffc
Compare
@marten-seemann @mxinden Ready for review, In the result I shared above, I would like to:
We could call the "large" cross-implementation tests nightly. Next I'll work on stability, Testground releases, and caching to get all these stats below 10 min. |
I ran 200 tests tonight and got 2 errors:
I added a few more temporary tweaks. I'll re-run 400 tests (100 for every configuration) which should take ~ 20 hours, I hope to get below 1% error rate for all the tests except "all interop". |
That seems very high. That's a 1% error rate on 3 retries, which would suggest that every single run has a 22% probability of failing. Do we have any idea why that is? |
My first intuition is that this simplification is incorrect:
We observed 0.5% failed workflows with the testground setup, so I added a (temporary) retry on the testground install. We observed 0.5% failed workflows during the build because goproxy.io started throwing 404s errors for a package. The build step failed 3 times in a row, but the probability of having a 404 on attempts 2 and 3 is much MUCH higher when you got a 404 on attempt 1. |
Why are we getting 404s in the first place? That's different from a connection timeout, and seems to indicate a more fundamental problem. |
@marten-seemann To me that's a case of our proxy (proxy.golang.org btw, not goproxy.io) not having a 100% uptime SLA, but you might have more insight on this: That's the run:
|
That's annoying, in this case this shouldn't even be a 4xx error, but rather a 5xx error. |
@marten-seemann Agreed, we might have caught a very unlikely error with the goproxy 404. Results for the tests I run yesterday:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Not sure my review should have much weight for the Go related changes.
testground build composition \ | ||
-f ${{ inputs.composition_file }} \ | ||
--wait && exit 0; | ||
sleep 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to sleep in between retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry might happen almost instantaneously because of the docker caching. The goal here is to wait some time to give remote services a chance to improve the situation (like the 404 above).
I'll merge as is, but there will be at least one follow-up PR for improvements, where we can remove this sleep if we don't want it.
@mxinden I updated with v0.47 |
ping/go/compat
and add the correspondinggo.v0.22.mod
file to fix the test.go.toml
andrust.toml
resources