Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check MPI error codes #2385

Merged
merged 18 commits into from
Oct 11, 2022
Merged

Check MPI error codes #2385

merged 18 commits into from
Oct 11, 2022

Conversation

IgorBaratta
Copy link
Member

@IgorBaratta IgorBaratta commented Sep 28, 2022

In the following code MPI_Allreduce fails silently (but an error code is actually returned).

#include <dolfinx.h>
#include <dolfinx/fem/petsc.h>
#include <iostream>

using namespace dolfinx;

int main(int argc, char* argv[])
{
  dolfinx::init_logging(argc, argv);
  PetscInitialize(&argc, &argv, nullptr, nullptr);

  {
    MPI_Comm comm{MPI_COMM_WORLD};
    MPI_Comm_dup(MPI_COMM_WORLD, &comm);

    [[maybe_unused]] int rank = dolfinx::MPI::rank(comm);
    [[maybe_unused]] int ierr0 = MPI_Comm_free(&comm);
    // dolfinx::MPI::assert_and_throw(MPI_COMM_WORLD, ierr0);

    int size = 0;
    [[maybe_unused]] int ierr1
        = MPI_Allreduce(&rank, &size, 1, MPI_INT, MPI_SUM, comm);
    // dolfinx::MPI::assert_and_throw(MPI_COMM_WORLD, ierr1);

    assert(size == dolfinx::MPI::size(MPI_COMM_WORLD));
  }

  PetscFinalize();
  return 0;
}

This PR adds a function to check whether an error code returned by an MPI function is equal to MPI_SUCCESS. If the check fails then it prints a useful error message and aborts.

Fixes #2058.

@IgorBaratta IgorBaratta changed the title Igor/mpi error Check MPI error messages Sep 28, 2022
@IgorBaratta IgorBaratta added the proposal Suggested change or addition label Sep 28, 2022
@francesco-ballarin
Copy link
Member

Related to #2058 ?

@IgorBaratta
Copy link
Member Author

Related to #2058 ?

Yes, it is. If merged, I think it would close #2058.

return size;
}
//-----------------------------------------------------------------------------
void dolfinx::MPI::assert_and_throw(MPI_Comm comm, int error_code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the function name is quite right. assert might imply that the check is skipped in non-debug mode, and the function doesn't 'throw' an exception; it just aborts.

@@ -74,6 +74,13 @@ int rank(MPI_Comm comm);
/// communicator
int size(MPI_Comm comm);

/// @brief Checks wether an error code returned by an MPI
/// function is equal to MPI_SUCCESS. If the check fails then
/// throw a runtime error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'runtime error' is a bit vague. Does it just 'abort'?

Copy link
Member Author

@IgorBaratta IgorBaratta Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I need to review documentation and function name.
I was throwing a runtime error before, but it can cause a deadlock when MPI_Abort fails.
This can happen when the communicator is NULL (reusing a communicator that has been freed) followed by a barrier with MPI_COMM_WORLD.
So I think a sensible solution is to forcibly abort the execution while still printing the error message.

@garth-wells garth-wells changed the title Check MPI error messages Check MPI error codes Sep 29, 2022
@garth-wells garth-wells marked this pull request as ready for review October 11, 2022 09:51
@garth-wells garth-wells merged commit 78082a5 into main Oct 11, 2022
@garth-wells garth-wells deleted the igor/mpi-error branch October 11, 2022 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Suggested change or addition
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check MPI return codes
3 participants