cluster EngineError running test_read_write_P_2D tests on one system #125

drew-parsons · 2024-05-31T10:51:17Z

A debian user is reporting test failure when building adios4dolfinx 0.8.1.post0 on his system,
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071722
https://people.debian.org/~sanvila/build-logs/202405/adios4dolfinx_0.8.1.post0-1_amd64-20240524T100158.350Z

The tests are passing on other debian project machines (and my own), so I figure the problem is related to the way openmpi distinguishes slot, hwthread, core, socket, etc when binding processes, which would be system-specific.

The error is happening in ipyparallel, so I'm not certain how much adios4dolfinx can do about it (likely the tests would need to know the specific available slots/cores/sockets). But perhaps there's a different way of configuring the test launch that's more robust.

_ ERROR at setup of test_read_write_P_2D[create_2D_mesh0-True-1-Lagrange-True] _

    @pytest.fixture(scope="module")
    def cluster():
        cluster = ipp.Cluster(engine_launcher_class="mpi", n=2)
>       rc = cluster.start_and_connect_sync()

tests/conftest.py:15: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3/dist-packages/ipyparallel/_async.py:73: in _synchronize
    return _asyncio_run(async_f(*args, **kwargs))
/usr/lib/python3/dist-packages/ipyparallel/_async.py:19: in _asyncio_run
    return loop.run_sync(lambda: asyncio.ensure_future(coro))
/usr/lib/python3/dist-packages/tornado/ioloop.py:539: in run_sync
    return future_cell[0].result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Cluster(cluster_id='1716545058-rpjw', profile='default', controller=<running>, engine_sets=['1716545059'])>
n = 2, activate = False

    async def start_and_connect(self, n=None, activate=False):
        """Single call to start a cluster and connect a client
    
        If `activate` is given, a blocking DirectView on all engines will be created
        and activated, registering `%px` magics for use in IPython
    
        Example::
    
            rc = await Cluster(engines="mpi").start_and_connect(n=8, activate=True)
    
            %px print("hello, world!")
    
        Equivalent to::
    
            await self.start_cluster(n)
            client = await self.connect_client()
            await client.wait_for_engines(n, block=False)
    
        .. versionadded:: 7.1
    
        .. versionadded:: 8.1
    
            activate argument.
        """
        if n is None:
            n = self.n
        await self.start_cluster(n=n)
        client = await self.connect_client()
    
        if n is None:
            # number of engines to wait for
            # if not specified, derive current value from EngineSets
            n = sum(engine_set.n for engine_set in self.engines.values())
    
        if n:
>           await asyncio.wrap_future(
                client.wait_for_engines(n, block=False, timeout=self.engine_timeout)
            )
E           ipyparallel.error.EngineError: Engine set stopped: {'exit_code': 1, 'pid': 63936, 'identifier': 'ipengine-1716545058-rpjw-1716545059-59766'}

/usr/lib/python3/dist-packages/ipyparallel/cluster/cluster.py:759: EngineError
------------------------------ Captured log setup ------------------------------
INFO     ipyparallel.cluster.cluster.1716545058-rpjw:cluster.py:708 Starting 2 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
WARNING  ipyparallel.cluster.cluster.1716545058-rpjw:launcher.py:336 Output for ipengine-1716545058-rpjw-1716545059-59766:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  /usr/bin/python3.12

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

WARNING  ipyparallel.cluster.cluster.1716545058-rpjw:cluster.py:721 engine set stopped 1716545059: {'exit_code': 1, 'pid': 63936, 'identifier': 'ipengine-1716545058-rpjw-1716545059-59766'}

The text was updated successfully, but these errors were encountered:

jorgensd · 2024-05-31T10:52:51Z

@minrk, do you have any idea? (Being the ipyparallel wizard!)

drew-parsons · 2024-05-31T11:13:28Z

The bug reporter also reports that lscpu reports

    Thread(s) per core:   2
    Core(s) per socket:   1
    Socket(s):            1

So if I'm reading the error message right, openmpi is complaining because it's been asked to run 2 processes but thinks it only has 1 core (and it's ignoring the available hwthreads).

I think we could allow for that in the debian build scripts by setting OMPI_MCA_rmaps_base_oversubscribe=true, which might be the simplest resolution.

minrk · 2024-05-31T15:38:32Z

yeah, allowing oversubscribe should be the fix here. We have to set a bunch of env to get openmpi to run tests reliably on CI because it's very strict and makes a lot of assumptions by default. oversubscribe is probably the main one for real user machines.

You could probably set the oversubscribe env in your conftest to make sure folks don't run into this one.

drew-parsons · 2024-05-31T15:44:00Z

Our bug reporter confirms OMPI_MCA_rmaps_base_oversubscribe=true resolves the issue in the debian tests. I've now added it to the debian scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster EngineError running test_read_write_P_2D tests on one system #125

cluster EngineError running test_read_write_P_2D tests on one system #125

drew-parsons commented May 31, 2024

jorgensd commented May 31, 2024

drew-parsons commented May 31, 2024

minrk commented May 31, 2024

drew-parsons commented May 31, 2024

cluster EngineError running test_read_write_P_2D tests on one system #125

cluster EngineError running test_read_write_P_2D tests on one system #125

Comments

drew-parsons commented May 31, 2024

jorgensd commented May 31, 2024

drew-parsons commented May 31, 2024

minrk commented May 31, 2024

drew-parsons commented May 31, 2024