Troubleshooting

infiniband

There seem to be some issues with some mpi4py features when used in a computing cluster with infiniband. This leads to cronus to hang in an ìnfiniband multi-node setting.

OpenMPI

If you are using OpenMPI you can try including the following command which in your jobscript.

export OMPI_MCA_pml=ob1

This should disable the infiniband interface.

Intel MPI

The mpi4py package is using matching probes (MPI_Mpobe) for the receiving function recv() instead of regular MPI_Recv operations per default. These matching probes from the MPI 3.0 standard however are not supported for all fabrics, which may lead to a hang in the receiving function.

Therefore, users are recommended to leverage the OFI fabric instead of TMI for Omni-Path systems. For the Intel MPI Library, the configuration could look like the following environment variable setting:

export I_MPI_FABRICS=ofi