feat: add retries & update ping/go interop test #32

laurentsenta · 2022-08-23T12:12:44Z

Add a retry the build step
- Note we use a few lines of shell instead of adding a new, external, dependency.
Add a status check on Testground setup
Reduce the run timeout
- Test succeeds in ~ 2 min at the moment, this timeout means we're losing at worst 4 min when we detect a breaking change that hangs the test.
Update the ping/go/compat and add the corresponding go.v0.22.mod file to fix the test.
Reorder the go.toml and rust.toml resources
Add a trigger on push & PR
- This checks the interop tests on pull requests.

laurentsenta · 2022-08-24T17:01:22Z

Results

Rust

no custom (master): https://github.com/laurentsenta/test-plans/actions/runs/2920569374 (15m)
+ demo/pass: https://github.com/laurentsenta/test-plans/actions/runs/2920569458 (17m)
+ demo/break: https://github.com/laurentsenta/test-plans/actions/runs/2920569546 (23m)

Go

no custom (master): https://github.com/laurentsenta/test-plans/actions/runs/2920570006 (16m)
+ demo/pass: https://github.com/laurentsenta/test-plans/actions/runs/2920569664 (18m)
+ demo/break: https://github.com/laurentsenta/test-plans/actions/runs/2920569734 (19m)

Rust + Go

2 tests here:

rust master + go master + a custom
all rust, including master + all go, including master + a custom

Results:

no custom (masters): https://github.com/laurentsenta/test-plans/actions/runs/2920569822 (15m, 30m)
+ rust - demo/pass: https://github.com/laurentsenta/test-plans/actions/runs/2920569900 (15m, 28m)
+ rust - demo/break: https://github.com/laurentsenta/test-plans/actions/runs/2920569946 (26m, 33m)
+ go - demo/pass: https://github.com/laurentsenta/test-plans/actions/runs/2920570079 (19m, 27m)
+ go - demo/break: https://github.com/laurentsenta/test-plans/actions/runs/2920570169 (22m, 40m)

laurentsenta · 2022-08-24T17:26:41Z

@marten-seemann @mxinden Ready for review,

In the result I shared above,
the largest test takes more than 30m on failures. That's the one where we build and test 13 versions together.

I would like to:

enable the cross-versions test on PRs for go-libp2p (again) and for rust-libp2p.
enable the "small" cross-implementation tests on PR.
- this one tests go mater + rust master + the current branch

We could call the "large" cross-implementation tests nightly.

Next I'll work on stability, Testground releases, and caching to get all these stats below 10 min.

laurentsenta · 2022-08-25T09:15:52Z

I ran 200 tests tonight and got 2 errors:

still a 404 during a package download
an EOF during testground setup

I added a few more temporary tweaks.

I'll re-run 400 tests (100 for every configuration) which should take ~ 20 hours, I hope to get below 1% error rate for all the tests except "all interop".

marten-seemann · 2022-08-25T10:51:01Z

I ran 200 tests tonight and got 2 errors:

That seems very high. That's a 1% error rate on 3 retries, which would suggest that every single run has a 22% probability of failing. Do we have any idea why that is?

laurentsenta · 2022-08-25T12:21:15Z

I ran 200 tests tonight and got 2 errors:

That seems very high. That's a 1% error rate on 3 retries, which would suggest that every single run has a 22% probability of failing. Do we have any idea why that is?

My first intuition is that this simplification is incorrect:

we didn't retry the testground setup,
404 errors during a package download are not independent events (probabilistically speaking).

We observed 0.5% failed workflows with the testground setup, so I added a (temporary) retry on the testground install.

We observed 0.5% failed workflows during the build because goproxy.io started throwing 404s errors for a package. The build step failed 3 times in a row, but the probability of having a 404 on attempts 2 and 3 is much MUCH higher when you got a 404 on attempt 1.

marten-seemann · 2022-08-25T12:34:49Z

Why are we getting 404s in the first place? That's different from a connection timeout, and seems to indicate a more fundamental problem.

laurentsenta · 2022-08-25T12:48:52Z

@marten-seemann To me that's a case of our proxy (proxy.golang.org btw, not goproxy.io) not having a 100% uptime SLA, but you might have more insight on this:

That's the run:
https://github.com/laurentsenta/test-plans/runs/8000694045?check_suite_focus=true

go: github.com/yuin/goldmark@v1.4.1: reading https://proxy.golang.org/github.com/yuin/goldmark/@v/v1.4.1.info: 404 Not Found
	server response: not found: temporarily unavailable

marten-seemann · 2022-08-25T12:53:31Z

That's annoying, in this case this shouldn't even be a 4xx error, but rather a 5xx error.
I'm still wondering why this occurs that frequently. We have other CI builds that download a lot of dependencies (every run of go-libp2p, kubo, lotus etc. will), and they don't seem to be plagued with those issues.

laurentsenta · 2022-08-26T09:16:38Z

@marten-seemann Agreed, we might have caught a very unlikely error with the goproxy 404.

Results for the tests I run yesterday:

go versions interop: 100 / 100 successes
rust versions interop: 100 / 100 successes
rust - go interop (only masters): 100 / 100 successes
rust - go interop (all tests): 94 / 100 successes
- Reason: I added a 25 min build timeout, as a safeguard, and it was reached 6 times (see actions).

mxinden

This looks good to me. Not sure my review should have much weight for the Go related changes.

mxinden · 2022-08-29T07:09:19Z

.github/workflows/run-composition.yml

+            testground build composition                        \
+              -f ${{ inputs.composition_file }}                 \
+              --wait && exit 0;
+            sleep 10


Why do we need to sleep in between retries?

The retry might happen almost instantaneously because of the docker caching. The goal here is to wait some time to give remote services a chance to improve the situation (like the 404 above).

I'll merge as is, but there will be at least one follow-up PR for improvements, where we can remove this sleep if we don't want it.

laurentsenta · 2022-08-29T14:19:24Z

@mxinden I updated with v0.47

laurentsenta added 3 commits August 23, 2022 13:39

.github/workflows/run-composition.yml: retry the build command

f0202d6

.github/workflows/run-composition.yml: reduce run timeout for now

1d83fef

.github/workflows/*: run on push and PR

a09f096

laurentsenta force-pushed the feat/interop-retries branch 3 times, most recently from d657743 to 1e6905b Compare August 23, 2022 12:21

laurentsenta added 4 commits August 24, 2022 16:00

.github/actions: check testground health

affa56d

ping/_composition/*: reorder lib versions

0630c52

ping/: update with go-libp2p v0.22

21fd8d6

chore: gofmt

ca68ffc

laurentsenta force-pushed the feat/interop-retries branch from 1e6905b to ca68ffc Compare August 24, 2022 14:04

.github/workflows: allow check test-plans during PR and push

f223023

.github/workflows/*: introduce ping-interop-latest

3883920

laurentsenta requested review from galargh, marten-seemann and mxinden August 24, 2022 17:26

marten-seemann approved these changes Aug 24, 2022

View reviewed changes

laurentsenta added 2 commits August 25, 2022 10:25

.github/workflow: tweak retry + timeout

940e4b0

.github/actions/setup-testground: retry install

27b2a83

mxinden approved these changes Aug 29, 2022

View reviewed changes

ping/rust: update with v0.47.0 release

7985104

laurentsenta merged commit d872355 into libp2p:master Aug 29, 2022

laurentsenta deleted the feat/interop-retries branch August 29, 2022 14:34

This was referenced Aug 29, 2022

ping/_composition/go.toml: disable go custom & master branches #30

Closed

Use CI Caching during Interop Tests #31

Closed

ci: add cross-version interop libp2p/go-libp2p#1725

Merged

EPIC: interop testing for go and rust libp2p - Ping Test #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add retries & update ping/go interop test #32

feat: add retries & update ping/go interop test #32

laurentsenta commented Aug 23, 2022 •

edited

Loading

laurentsenta commented Aug 24, 2022

laurentsenta commented Aug 24, 2022

laurentsenta commented Aug 25, 2022 •

edited

Loading

marten-seemann commented Aug 25, 2022

laurentsenta commented Aug 25, 2022 •

edited

Loading

marten-seemann commented Aug 25, 2022

laurentsenta commented Aug 25, 2022

marten-seemann commented Aug 25, 2022

laurentsenta commented Aug 26, 2022 •

edited

Loading

mxinden left a comment

mxinden Aug 29, 2022

laurentsenta Aug 29, 2022

laurentsenta commented Aug 29, 2022

feat: add retries & update ping/go interop test #32

feat: add retries & update ping/go interop test #32

Conversation

laurentsenta commented Aug 23, 2022 • edited Loading

laurentsenta commented Aug 24, 2022

Results

Rust

Go

Rust + Go

laurentsenta commented Aug 24, 2022

laurentsenta commented Aug 25, 2022 • edited Loading

marten-seemann commented Aug 25, 2022

laurentsenta commented Aug 25, 2022 • edited Loading

marten-seemann commented Aug 25, 2022

laurentsenta commented Aug 25, 2022

marten-seemann commented Aug 25, 2022

laurentsenta commented Aug 26, 2022 • edited Loading

mxinden left a comment

Choose a reason for hiding this comment

mxinden Aug 29, 2022

Choose a reason for hiding this comment

laurentsenta Aug 29, 2022

Choose a reason for hiding this comment

laurentsenta commented Aug 29, 2022

laurentsenta commented Aug 23, 2022 •

edited

Loading

laurentsenta commented Aug 25, 2022 •

edited

Loading

laurentsenta commented Aug 25, 2022 •

edited

Loading

laurentsenta commented Aug 26, 2022 •

edited

Loading