FS#79543 - [openmpi] Reading from file (MPI_File_open) gets stuck occasionally

Attached to Project: Arch Linux
Opened by Elkin (helq) - Saturday, 02 September 2023, 16:36 GMT
Last edited by Christian Heusel (gromit) - Wednesday, 06 September 2023, 23:00 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To David Runge (dvzrv)
Levente Polyak (anthraxx)
Christian Heusel (gromit)
Architecture x86_64
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
Binary gets stuck at MPI_File_open call when executed twice in a row.

Additional info:
* openmpi 4.1.5-3
* Downgrading system to Aug 02 2023 (where openmpi is 4.1.5-2) "solves" the issue

Steps to reproduce:

1. Create a dummy file called `test-file.txt`.

2. Copy minimal code in `test.c` file:
```
// Based on minimal code for bug report: https://bugs.archlinux.org/task/78786?project=1&string=openmpi
#include <stdio.h>
#include <mpi.h>

int main() {
int rank;
MPI_Init(NULL, NULL);
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm, &rank);

MPI_File fh;
int err = MPI_File_open(comm, "test-file.txt", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);
if (err != MPI_SUCCESS) {
printf("Got error trying to open file\n");
}

printf("Hello, I am rank %d in the merged comm\n", rank);
MPI_Barrier(comm);

MPI_Finalize();
return 0;
}
```

3. Compile code `gcc test.c -lmpi`.

4. Run code twice: `mpirun -np 2 a.out && mpirun -np 2 a.out`

The expected output (order does not matter) is
```
Hello, I am rank 0 in the merged comm
Hello, I am rank 1 in the merged comm
Hello, I am rank 0 in the merged comm
Hello, I am rank 1 in the merged comm
``

Sadly, my output is only the first two lines. It gets stuck without printing the other two lines. Debugging using gdb confirms that it is stuck somewhere in the MPI_File_open call.
This task depends upon

Closed by  Christian Heusel (gromit)
Wednesday, 06 September 2023, 23:00 GMT
Reason for closing:  Fixed
Additional comments about closing:  Should be fixed by openmpi 4.1.5-5
Comment by Toolybird (Toolybird) - Saturday, 02 September 2023, 22:44 GMT
Could you please try openmpi-4.1.5-4 in [extra-testing]?
Comment by Elkin (helq) - Sunday, 03 September 2023, 06:06 GMT
I tried it. The bug is still present :S
Comment by Christian Heusel (gromit) - Sunday, 03 September 2023, 11:25 GMT
Hm, so I have invested some time and tried different versions and build and this seems to be a regression present from openmpi-4.1.5 onwards.
Also it seems like the processes get stuck at 100% CPU.

I tried:
- 5.0.0rc10 (didnt work)
- 4.1.6rc2 (didnt work)
- 4.1.5 (didnt work)
- 4.1.4 (worked)
- 4.1.4 with same flags as the current build (worked)

So I guess this is an upstream bug...

If you need this then you can just build the old package yourself:
$ pkgctl repo clone --switch="4.1.4-4" openmpi
$ pkgctl build openmpi
Comment by David Runge (dvzrv) - Sunday, 03 September 2023, 12:44 GMT
It would be awesome if someone would bisect this between 4.1.4 and 4.1.5 and then report it upstream :)
Comment by Christian Heusel (gromit) - Sunday, 03 September 2023, 14:07 GMT
Will do!
Comment by Christian Heusel (gromit) - Wednesday, 06 September 2023, 19:18 GMT
So I raised the issue upstream as all debugging on my side didn't help: https://github.com/open-mpi/ompi/issues/11913
Comment by loqs (loqs) - Wednesday, 06 September 2023, 21:11 GMT
@gromit can you reproduce the the issue using the current PKGBUILD with both patches disabled? I could not while, adding pkgname-4.1.5-openpmix_4.2.3.patch back reintroduced the issue for me.
Comment by Christian Heusel (gromit) - Wednesday, 06 September 2023, 22:25 GMT
Oh my, I was so sure that I already did disable the patches, but indeed this fixes the issue for me! 🙈

Loading...