ci: ci: create gh workflow that runs go tests #11762

galargh · 2024-03-20T18:40:54Z

Related Issues

Proposed Changes

This PR creates a new GitHub Actions workflow (test.yml) that performs the following jobs previously handled exclusively by CircleCI:

test
test-conformance
test-itest-*
test-unit-*

Additional Info

You can find an example run of the workflow at: https://github.com/filecoin-project/lotus/actions/runs/8376141826?pr=11762

Unlike the CircleCI test jobs, the ones from the newly added workflow do not have to wait for the build (make lotus deps) to finish. Instead, the test jobs call make deps as needed (we didn't find an instance where make lotus was needed). This results in some work duplication but reduces the overall workflow runtime by ~3 minutes.

We did use a configuration matrix in the newly added workflow because the number of jobs executed by the workflow is huge. We were able to use a single job "template" since all the test jobs are executed in almost exactly the same manner. In particular, we didn't split out test-conformance definition like it was done in CircleCI.

The newly added workflow makes use of 3 helper actions:

install-ubuntu-deps: which installs ocl-icd-opencl-dev libhwloc-dev pkg-config on the runner
install-go: which installs go on the runner (it uses the version it finds in the go.mod file)
start-yugabytedb: which starts yugabytedb docker container and waits for the DB to start running

The workflow's jobs run on a combination of self-hosted and hosted runners. We use self-hosted runners for two reasons: for resource-intensive jobs (2xlarge and 4xlarge) and to increase runner availability (large and xlarge). We can only use a limited number of hosted runners concurrently (60).

How did we decide what jobs to run on which runners? We used the largest runners - 4xlarge (16 CPU, 32 RAM) - only for some of the jobs that used to run on 2xlarge (16 CPU, 32 RAM) in CircleCI; namely - itest-deals_concurrent, itest-sector_pledge, and itest-worker. We used 2xlarge (8 CPU, 16 RAM) for jobs that, for whatever reason, we saw failing on hosted runners. These include itest-gateway, itest-sector_import_full, itest-sector_import_simple, itest-wdpost, unit-storage. Our assumption here is that more resources could help reduce the flakiness but it is to be verified in practice. Finally, we used xlarge (4 CPU, 8 RAM) for 42 jobs (half of overall 84) that were the quickest in one of the test runs where we run almost everything on xlarge runners. In other words, those jobs scheduled to run on xlarge now should be fine running on hosted runners too (or large (2 CPU, 4 RAM) to reduce the cost; we're sticking with xlarge not to accidentally introduce more flakiness during the evaluation period). Everything else runs on the default GitHub hosted runners (4 CPU, 16 RAM). This entire assignment is a subject to change but we do have to start somewhere and it seems to us like a sensible spot.

We decided to cache Proof Parameters because trying to download them from many jobs at the same time resulted in a number of connection closed errors. The job that ensures the proof parameters are cached properly adds only seconds of overhead to the workflow.

We decided to generate the test job matrix on the fly (instead of pre-generating it like it was for CircleCI). The job matrix generation is executed on every workflow run. It uses bash tools to combine JSON inputs into a final matrix. We evaluated keeping matrix generation as a Go script, but, in our opinion, it made it harder to reason about what the generation does and how the final matrix is configured. It also added a little bit of extra overhead required for Go setup (~40s vs ~10s).

The newly added workflow is intended to run alongside its' CircleCI counterpart for at least 1-2 weeks. After that period, we want to evaluate its success rate and execution time. Based on this information, we'll either remove the applicable CircleCI jobs or apply necessary fixes to the new workflow and repeat the evaluation.

During testing, we have seen one instance of self-hosted runners failing to be scheduled (https://github.com/filecoin-project/lotus/actions/runs/8362652752/attempts/2). This was caused by the self-hosted runners' setup being rate-limited by the AWS Parameter Store. We'll apply for a higher quota (https://aws.amazon.com/about-aws/whats-new/2023/07/aws-systems-manager-parameter-store-api-limit/) and evaluate whether a patch that introduces retries for that error is in order.

We have also noticed one instance where make deps failed due to rate limiting applied by GitHub. Since then we provisioned the step with a proper GitHub Token which should make it less prone to failure, but we'll also watch if this issue reappears during normal usage. If so, we'd propose restoring caching for the make dep result. This would add only <30 seconds of overhead to the workflow.

We have seen some test flakiness around tests like itest-sector_import_simple, itest-deals_concurrent, itest-sector_import_full, itest-gateway, itest-deals_anycid. The flakiness happened at different stages of the workflow development, and it might or might not have been related to the new setup. At this stage, we think it would be the most beneficial to start running the CircleCI and GitHub Actions workflows alongside each other as this will provide better data on which tests and how often flake and whether it's platform-specific or not.

We did not install statediff because we couldn't install it successfully. It used to be used only in the conformance test job, which was the only one that gathered coverage data, but it seemed to be unused.

Questions

Is statediff still needed for conformance?
Is coverage still needed for conformance?

Checklist

Before you mark the PR ready for review, please make sure that:

Commits have a clear commit message.
PR title is in the form of of <PR type>: <area>: <change being made>
- example: fix: mempool: Introduce a cache for valid signatures
- PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
- area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
If the PR affects users (e.g., new feature, bug fix, system requirements change), update the CHANGELOG.md and add details to the UNRELEASED section.
New features have usage guidelines and / or documentation updates in
- Lotus Documentation
- Discussion Tutorials
Tests exist for new functionality or change in behavior
CI is green

This reverts commit d1da190.

Stebalien · 2024-03-21T19:20:10Z

This was caused by the self-hosted runners' setup being rate-limited by the AWS Parameter Store.

The only issue I'm seeing is failure to download the proof parameters, which should have been cached, right? See https://github.com/filecoin-project/lotus/actions/runs/8362652752/job/22895146629. Looking at the other jobs in that run, I'm seeing us download the parameters in every single job as well.

We have also noticed one instance where make deps failed due to rate limiting applied by GitHub. Since then we provisioned the step with a proper GitHub Token which should make it less prone to failure, but we'll also watch if this issue reappears during normal usage. If so, we'd propose restoring caching for the make dep result. This would add only <30 seconds of overhead to the workflow.

Personally, I would do this. Otherwise, can we pass a more restricted token? I think we can just disable all permissions (https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#permissions).

.github/workflows/test.yml

galargh · 2024-03-22T15:43:59Z

This was caused by the self-hosted runners' setup being rate-limited by the AWS Parameter Store.

The only issue I'm seeing is failure to download the proof parameters, which should have been cached, right? See https://github.com/filecoin-project/lotus/actions/runs/8362652752/job/22895146629. Looking at the other jobs in that run, I'm seeing us download the parameters in every single job as well.

The problem with AWS Parameter Store shows up as jobs just hanging and never getting scheduled. We do have monitoring for that set up and, as mentioned, we will try to make sure that address any issues with runner scheduling that we know of.

We have also noticed one instance where make deps failed due to rate limiting applied by GitHub. Since then we provisioned the step with a proper GitHub Token which should make it less prone to failure, but we'll also watch if this issue reappears during normal usage. If so, we'd propose restoring caching for the make dep result. This would add only <30 seconds of overhead to the workflow.

Personally, I would do this. Otherwise, can we pass a more restricted token? I think we can just disable all permissions (https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#permissions).

Sure, no problem. This makes sense to us too. We added caching for make deps outputs in #11773. We're still making GITHUB_TOKEN available in the env for that step. Is that OK? It is only used to make an authorised request to filecoin-ffi releases as opposed to an unauthorised which is more likely to get rate limited (which is still quite unlikely now that we're caching the whole thing). On PRs that token gets limited permissions by default.

galargh · 2024-03-22T16:14:37Z

The test failures seem consistent between CircleCI, GitHub Actions, and the default branch.

Stebalien · 2024-03-22T19:43:32Z

The problem with AWS Parameter Store shows up as jobs just hanging and never getting scheduled. We do have monitoring for that set up and, as mentioned, we will try to make sure that address any issues with runner scheduling that we know of.

Ah, I see. So this wouldn't show up in the GitHub Actions output?

Sure, no problem. This makes sense to us too. We added caching for make deps outputs in #11773. We're still making GITHUB_TOKEN available in the env for that step. Is that OK? It is only used to make an authorised request to filecoin-ffi releases as opposed to an unauthorised which is more likely to get rate limited (which is still quite unlikely now that we're caching the whole thing). On PRs that token gets limited permissions by default.

Yes, but can we remove all permissions? I think our GitHub tokens get read/write access by default.

Alternatively, we can change either the org default or the repo default to be read-only. Do you think that will break anything?

galargh · 2024-03-25T14:40:14Z

The problem with AWS Parameter Store shows up as jobs just hanging and never getting scheduled. We do have monitoring for that set up and, as mentioned, we will try to make sure that address any issues with runner scheduling that we know of.

Ah, I see. So this wouldn't show up in the GitHub Actions output?

Not really - it would only show up as "The job is waiting for the runner startup" or something along those lines. The quick fix for that is to cancel the job and rerun failed, but hopefully, with all the alerting we have now, even that won't be necessary much.

Sure, no problem. This makes sense to us too. We added caching for make deps outputs in #11773. We're still making GITHUB_TOKEN available in the env for that step. Is that OK? It is only used to make an authorised request to filecoin-ffi releases as opposed to an unauthorised which is more likely to get rate limited (which is still quite unlikely now that we're caching the whole thing). On PRs that token gets limited permissions by default.

Yes, but can we remove all permissions? I think our GitHub tokens get read/write access by default.

Alternatively, we can change either the org default or the repo default to be read-only. Do you think that will break anything?

Oh yeah! Sure! I don't think it should break anything with the workflows we have now, but I'm not 100% sure. I've just changed the default to read-only for the Lotus repository, and I'm going to monitor if everything's behaving as expected. If you don't mind, I'd put off changing the organization default until the work setting up the new workflows here is done. Then, we're going to change the org-wide default and monitor all the repositories for unexpected failures.

.github/workflows/test.yml

galargh added 7 commits March 20, 2024 16:35

ci: create gh action that installs go from go.mod

b413948

ci: create gh action that installs ubuntu dependencies

2ad5142

ci: create gh action that starts yugabyte db

1c76dd7

ci: create gh workflow that runs go tests

ff9771d

ci: test the test workflow

d1da190

ci: delete duplicate line from runner configuration

3678c5a

Revert "ci: test the test workflow"

a4adca6

This reverts commit d1da190.

galargh requested a review from laurentsenta March 20, 2024 18:40

galargh force-pushed the ipdx-gha-test branch 3 times, most recently from c71a9e1 to dbdb080 Compare March 21, 2024 08:47

ci: match the circleci runners more accurately

ce724fa

galargh force-pushed the ipdx-gha-test branch from dbdb080 to ce724fa Compare March 21, 2024 13:54

galargh marked this pull request as ready for review March 21, 2024 17:19

galargh requested review from masih and Stebalien March 21, 2024 17:19

Stebalien reviewed Mar 21, 2024

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

galargh mentioned this pull request Mar 22, 2024

ci: fetch all filecoin-ffi dependencies #11773

Merged

ci: fetch all filecoin-ffi dependencies

ae526ef

Stebalien approved these changes Mar 25, 2024

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

.github/workflows/test.yml Show resolved Hide resolved

rvagg mentioned this pull request Mar 26, 2024

ci: configure itests in build constraints, detect in Actions script #11782

Draft

galargh added 2 commits March 26, 2024 14:28

ci: rename fetch/install job to cache to better match underlying tasks

6f71ed3

ci: set permissions required by the test workflow explicitly

1cf5f2b

galargh merged commit 3f4eaf0 into master Mar 26, 2024
183 of 184 checks passed

galargh deleted the ipdx-gha-test branch March 26, 2024 15:06

galargh mentioned this pull request Mar 26, 2024

fix: ci: allow master main sync to write to the repository #11784

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: ci: create gh workflow that runs go tests #11762

ci: ci: create gh workflow that runs go tests #11762

galargh commented Mar 20, 2024 •

edited

Loading

Stebalien commented Mar 21, 2024

galargh commented Mar 22, 2024

galargh commented Mar 22, 2024

Stebalien commented Mar 22, 2024

galargh commented Mar 25, 2024

ci: ci: create gh workflow that runs go tests #11762

ci: ci: create gh workflow that runs go tests #11762

Conversation

galargh commented Mar 20, 2024 • edited Loading

Related Issues

Proposed Changes

Additional Info

Questions

Checklist

Stebalien commented Mar 21, 2024

galargh commented Mar 22, 2024

galargh commented Mar 22, 2024

Stebalien commented Mar 22, 2024

galargh commented Mar 25, 2024

galargh commented Mar 20, 2024 •

edited

Loading