FS#79670 - [openmpi] 4.1.5-5 mpirun fails when installed in remote host

Attached to Project: Arch Linux
Opened by Francisco J. Vazquez (Fran) - Tuesday, 12 September 2023, 15:52 GMT
Last edited by Buggy McBugFace (bugbot) - Saturday, 25 November 2023, 20:19 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To David Runge (dvzrv)
Levente Polyak (anthraxx)
Christian Heusel (gromit)
Architecture x86_64
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

I have two fully updated arch systems host1 and host2, both with openmpi 4.1.5-5 installed. Running:

$ mpirun -v -n 2 --hostfile hosts.txt bash -c 'echo $HOSTNAME'

in host1, where hosts.txt is:

host1 slots=1
host2 slots=1

fails with:

--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------

Downgrading the *remote* host to openmpi 4.1.5-4 solves the problem:

$ mpirun -v -n 2 --hostfile hosts.txt bash -c 'echo $HOSTNAME'
host2
host1

The local version of openmpi does not seem to influence the result.

The same thing happens with -n 1, even though the program is launched locally.
This task depends upon

Closed by  Buggy McBugFace (bugbot)
Saturday, 25 November 2023, 20:19 GMT
Reason for closing:  Moved
Additional comments about closing:  https://gitlab.archlinux.org/archlinux/p ackaging/packages/openmpi/issues/1
Comment by loqs (loqs) - Tuesday, 12 September 2023, 17:11 GMT
Related  FS#79543 
Comment by Toolybird (Toolybird) - Tuesday, 12 September 2023, 21:08 GMT
Also, 4.1.5-5 dropped the patch that fixed  FS#78786 ?
Comment by loqs (loqs) - Wednesday, 13 September 2023, 13:43 GMT
I think this issue along with  FS#78786  and  FS#78261  will be fixed by 4.1.6 which judging from [1] is close while it will reintroduce  FS#79543  which has not been reported upstream?
Edit:
4.1.6 will also fix CVE-2023-41915.
Edit2:
bisecting  FS#79543  gives the following which matches 4.1.5-2 adding 6e8e14f2c2f207d5fa51299cc67558697a5b7d63 as a patch
$ git bisect bad
6e8e14f2c2f207d5fa51299cc67558697a5b7d63 is the first bad commit
commit 6e8e14f2c2f207d5fa51299cc67558697a5b7d63
Author: Gilles Gouaillardet <gilles@rist.or.jp>
Date: Wed Mar 8 10:48:00 2023 +0900

pmix3x: use PMIX_VALUE_LOAD() and PMIX_INFO_LOAD() macros

Refs. open-mpi/ompi#10416

bot:notacherrypick

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

opal/mca/pmix/pmix3x/pmix3x.c | 273 ++++++++++++++++++++++++++++-------
opal/mca/pmix/pmix3x/pmix3x.h | 6 +-
opal/mca/pmix/pmix3x/pmix3x_client.c | 48 +++---
3 files changed, 242 insertions(+), 85 deletions(-)

$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [33e63c4beaac85f2cce33140f44f9668051d3028] Merge pull request #11880 from jsquyres/pr/v4.1.x/news-update
git bisect bad 33e63c4beaac85f2cce33140f44f9668051d3028
# status: waiting for good commit(s), bad commit known
# good: [42b829b3b3190dd1987d113fd8c2810eb8584007] Merge pull request #11426 from bwbarrett/v4.1.x-release
git bisect good 42b829b3b3190dd1987d113fd8c2810eb8584007
# good: [4f127c1f21ca5bbbfb6ef9c868129048f4702757] Merge pull request #11684 from wzamazon/v4.1.x_btl_ofi_fix_flush_backport
git bisect good 4f127c1f21ca5bbbfb6ef9c868129048f4702757
# bad: [1ec3d9de2b7680d6766e0aa4007ecc8ae458bdf5] Merge pull request #11812 from wenduwan/backport_han_allreduce
git bisect bad 1ec3d9de2b7680d6766e0aa4007ecc8ae458bdf5
# bad: [12025000b00eed3bef5a9f907cdfab5b8f563010] pmix3x: update to handle PMIx v4.2.3
git bisect bad 12025000b00eed3bef5a9f907cdfab5b8f563010
# bad: [95514e0812804afaf867087ea21c130ab838abeb] Merge pull request #11752 from wzamazon/v4.1.x_fix_pml_cm_heavy_send_request
git bisect bad 95514e0812804afaf867087ea21c130ab838abeb
# bad: [c053cb822770738de5df0a3e998899398d35e728] Merge pull request #11472 from ggouaillardet/topic/v4.1.x/pmix3x_macros
git bisect bad c053cb822770738de5df0a3e998899398d35e728
# bad: [6e8e14f2c2f207d5fa51299cc67558697a5b7d63] pmix3x: use PMIX_VALUE_LOAD() and PMIX_INFO_LOAD() macros
git bisect bad 6e8e14f2c2f207d5fa51299cc67558697a5b7d63
# first bad commit: [6e8e14f2c2f207d5fa51299cc67558697a5b7d63] pmix3x: use PMIX_VALUE_LOAD() and PMIX_INFO_LOAD() macros

@gromit do you want to reopen with upstream or ask @helq to?

[1] https://github.com/open-mpi/ompi/pull/11930
Comment by loqs (loqs) - Saturday, 28 October 2023, 18:50 GMT
Documenting the findings from [1].  FS#79543  was caused by a bug in glibc [2] which is still unresolved. The bug can be worked around by disabling the use of sem_open by having the configure check for it fail by setting the environment variable ac_cv_func_sem_open=no for the ./configure call.

[1]: https://github.com/open-mpi/ompi/issues/11934
[2]: https://sourceware.org/bugzilla/show_bug.cgi?id=30789
Comment by loqs (loqs) - Sunday, 29 October 2023, 19:15 GMT
See attached for suggested fix. This updates the pkgver to 4.1.6 which includes the patches that Arch previously included for  FS#78786  plus  FS#78261  and disable use of sem_open until the issues in glibc is resolved to avoid reintroducing  FS#79543 . See also previous comment.
I did not update the pkgver to 5.0.0 as that is a major release and out of scope for this issue.

Loading...