Troubleshooting¶
infiniband¶
There seem to be some issues with some mpi4py
features when used in a computing cluster with infiniband.
This leads to cronus
to hang in an ìnfiniband
multi-node setting.
OpenMPI¶
If you are using OpenMPI
you can try including the following command which in your jobscript.
export OMPI_MCA_pml=ob1
This should disable the infiniband interface.
Intel MPI¶
The mpi4py package is using matching probes (MPI_Mpobe)
for the receiving function recv()
instead of regular
MPI_Recv
operations per default. These matching probes from the MPI 3.0
standard however are not supported
for all fabrics, which may lead to a hang in the receiving function.
Therefore, users are recommended to leverage the OFI
fabric instead of TMI
for Omni-Path
systems. For the
Intel MPI Library
, the configuration could look like the following environment variable setting:
export I_MPI_FABRICS=ofi