Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: migrate to nox, run tests in parallel, revamp test infra #632

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

agriyakhetarpal
Copy link
Collaborator

@agriyakhetarpal agriyakhetarpal commented Aug 24, 2024

Description

This PR adds several changes to the testing infrastructure:

  1. Migration from tox's DSL to Pythonic nox, with three sessions: one for linting, one for tests, and one to build a source distribution + wheel and validate them. This is in line with the earlier tox environments, with a few minor changes, and a major one in that SciPy is installed as a dependency while testing instead of having a separate environment for it. The linting session hasn't been activated yet because that will raise several warnings to fix throughout the codebase.
  2. The workflow run now cancels pending/running jobs when new commits are pushed, thereby saving time on repeated pushes. This is enabled through the concurrency: key.
  3. The original testing configuration (as it was before this PR) is retained – without having to have tox -e test and tox-e test-scipy. The tests for macOS and Windows for PyPy are excluded from the GHA matrix instead, because SciPy does not yet provide wheels for PyPy and installing it from source is lengthy (and while still easy, ever so slightly trickier on Windows and macOS as compared to Linux – we could set up a Miniconda environment to test them if needed).
  4. uv is used in the nox sessions and in CI, using https://github.com/yezz123/setup-uv. We don't need any extra caching for our dependencies, because it is faster to download them rather than to retrieve them from a cache.
  5. A [test] set of extras has been added, with a minimal set of test dependencies.
  6. We now use pytest-xdist to run tests in parallel, which completes the tests in 4 seconds on my machine, compared to ~35 seconds in serial mode. This is a ~9x speedup! We can notice similar gains in CI.
  7. I noticed that some tests failed intermittently, which suggested that the testing was not being performed in a deterministic fashion. I added a fixed NumPy random seed as a pytest fixture that runs before every test. This can be improved further if we make more changes to and refactor the tests in the future.
  8. nox -s nightly-tests uses environment variables from uv to download nightly versions of NumPy (and SciPy where applicable) and installs them before testing. Jobs have been included for this purpose in the CI matrix.

There are some more changes needed across the codebase, such as fixing some missing coverage and either using CodeCov or just using an HTML coverage report (see https://hynek.me/articles/ditch-codecov-python/), so that the coverage tracking from the tests is actually made use of. Some of the missing coverage came from things that were associated with Python 2 code, though, so it might be good to remove those entirely once we get to them as a part of linting and autoformatting the entire codebase.

Closes #631 and related to #630

@agriyakhetarpal
Copy link
Collaborator Author

agriyakhetarpal commented Aug 24, 2024

I now see why the PyPy tests were taking a long time – SciPy is being built from the source distribution because it doesn't publish wheels for PyPy, while NumPy does, of course (hence I'm myself answering the question I had). To keep CI time low, I can add another nox session for PyPy (edit: or I'll just keep them in the same session, but maybe add a separate job for better separation of logs?)

@agriyakhetarpal
Copy link
Collaborator Author

agriyakhetarpal commented Aug 24, 2024

It looks like some of the tests have not been entirely deterministic, and failures with differences in precision between macOS Intel vs M-series devices were being reported (only in CI, though, I didn't face any of them locally). Setting a fixed random seed seems to have resolved the problem. This PR enables testing against both Intel and ARM macOS platforms, which we can keep until Intel devices surpass their EoL date.

@agriyakhetarpal agriyakhetarpal marked this pull request as ready for review August 24, 2024 23:55
@agriyakhetarpal agriyakhetarpal changed the title WIP: migrate to nox, parallel tests, revamp test infra migrate to nox, run tests in parallel, revamp test infra Aug 25, 2024
@agriyakhetarpal agriyakhetarpal changed the title migrate to nox, run tests in parallel, revamp test infra feat: migrate to nox, run tests in parallel, revamp test infra Aug 28, 2024
@agriyakhetarpal
Copy link
Collaborator Author

Just added some additional changes that I had missed – this is ready for review whenever you have the time, @fjosw and @j-towns! I separated out the SciPy tests so that they run on non-PyPy runners in 19dab2a, so that we get faster CI. The pre-commit check will fail until #634 gets merged, so we can ignore that safely.

@agriyakhetarpal
Copy link
Collaborator Author

I have also added tests against NumPy and SciPy nightlies. Both of them support Python 3.10–3.13, which explains the failure across platforms for Python 3.8 and 3.9. We will be dropping 3.8 sometime soon in October anyway, but in the meantime, we could explore ways to make them fail more gracefully somehow or perhaps limit our nightly testing against the Python versions that are supported or the latest Python version. Please let me know your thoughts here.

Copy link
Collaborator

@fjosw fjosw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @agriyakhetarpal, I have no experience with nox but I skimmed your code and things look good to me! Do you already have feeling for how robust the nightly setup is? I think for a project like this, which is mostly in maintenance mode, we would want to prevent too many "false positives" from the nightly tests.

@agriyakhetarpal
Copy link
Collaborator Author

Thanks for going through the changes, @fjosw! After all, nox is essentially the same as tox, but the Pythonic API might have made it at least slightly easier to read and maintain your own sessions, I hope!

I'm not sure if I have a robust answer for how robust the nightly tests will be – I think it depends a lot on our availability to ensure that we can land fixes at the right time since breakages will be critical to our users. At the same time, if we implement an upper pin on NumPy, it could, in rare circumstances, conflict with our ability to run the nightly tests, too. Say, if we have a pin on NumPy, say, <2.4, and NumPy 2.4.0-dev wheels get published in the nightly indices, we won't be able to install those wheels because the dependency resolution will fail with an unsatisfied requirement. If we bump the pin to NumPy <2.5, we run into the risk of having a new version of NumPy release on PyPI sometime later and breaking our code (if we aren't able to notice and fix it in time). It is a very low-risk situation overall, though, since we are both actively making changes to the repository, plus, NumPy makes it easier for us by deprecating things with warnings at least two releases before they get removed, and we as downstream maintainers are urged by them to try out the (one, and in some cases, two) release candidates once and when they are available.

However, without a robust refactoring of the original code, we can expect that there will be problems in the coming months and years. For example, if we see that something breaks, say, in NumPy 2.3.0 in their nightly wheels when we are in the NumPy 2.2.0 stable stage, we'll need to add some code patterns similar to this:

if _np.lib.NumpyVersion(_np.__version__) >= "2.3.0":
    # some logic
elif _np.lib.NumpyVersion(_np.__version__) >= "2.2.0" and _np.lib.NumpyVersion(_np.__version__) < "2.3.0":
    # some other logic
else:
    # yet some more logic

which are more code smells than code patterns, and will surely not be maintainable for us. At the same time, we don't want a lot of reds to appear in our CI either (and not too many if they do). Moreover, it is too early to support NumPy>2, as we discussed in #628 (review).

With that extra context, I'm open to more suggestions on how we can improve our testing, since I'm out of ideas :) We still need to add some testing against the minimum supported NumPy version (1.X) and decide on a lower bound, which I feel can be solved using uv pip install --resolution lowest (see https://docs.astral.sh/uv/concepts/resolution/#lower-bounds). I will be doing it in another PR, of course.

I just triggered CI again with the updated config, and while I can disable the current nightly tests for Python 3.8 and 3.9 since they don't have wheels (we'll drop 3.8 soon in October, anyway), I wonder if you would have suggestions on actual test failures? It could be a NumPy bug or a bug in our own codebase, but I'm not confident about it. I thought it was flaky and re-triggered that test in CI, but that did not seem to help, it fails with the same values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate from tox to nox
2 participants