-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster EngineError running test_read_write_P_2D tests on one system #125
Comments
@minrk, do you have any idea? (Being the ipyparallel wizard!) |
The bug reporter also reports that
So if I'm reading the error message right, openmpi is complaining because it's been asked to run 2 processes but thinks it only has 1 core (and it's ignoring the available hwthreads). I think we could allow for that in the debian build scripts by setting |
yeah, allowing oversubscribe should be the fix here. We have to set a bunch of env to get openmpi to run tests reliably on CI because it's very strict and makes a lot of assumptions by default. oversubscribe is probably the main one for real user machines. You could probably set the oversubscribe env in your conftest to make sure folks don't run into this one. |
Our bug reporter confirms |
A debian user is reporting test failure when building adios4dolfinx 0.8.1.post0 on his system,
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071722
https://people.debian.org/~sanvila/build-logs/202405/adios4dolfinx_0.8.1.post0-1_amd64-20240524T100158.350Z
The tests are passing on other debian project machines (and my own), so I figure the problem is related to the way openmpi distinguishes slot, hwthread, core, socket, etc when binding processes, which would be system-specific.
The error is happening in ipyparallel, so I'm not certain how much adios4dolfinx can do about it (likely the tests would need to know the specific available slots/cores/sockets). But perhaps there's a different way of configuring the test launch that's more robust.
The text was updated successfully, but these errors were encountered: