FS#79670 - [openmpi] 4.1.5-5 mpirun fails when installed in remote host
Attached to Project:
Arch Linux
Opened by Francisco J. Vazquez (Fran) - Tuesday, 12 September 2023, 15:52 GMT
Last edited by Buggy McBugFace (bugbot) - Saturday, 25 November 2023, 20:19 GMT
Opened by Francisco J. Vazquez (Fran) - Tuesday, 12 September 2023, 15:52 GMT
Last edited by Buggy McBugFace (bugbot) - Saturday, 25 November 2023, 20:19 GMT
|
Details
I have two fully updated arch systems host1 and host2, both
with openmpi 4.1.5-5 installed. Running:
$ mpirun -v -n 2 --hostfile hosts.txt bash -c 'echo $HOSTNAME' in host1, where hosts.txt is: host1 slots=1 host2 slots=1 fails with: -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- Downgrading the *remote* host to openmpi 4.1.5-4 solves the problem: $ mpirun -v -n 2 --hostfile hosts.txt bash -c 'echo $HOSTNAME' host2 host1 The local version of openmpi does not seem to influence the result. The same thing happens with -n 1, even though the program is launched locally. |
This task depends upon
Closed by Buggy McBugFace (bugbot)
Saturday, 25 November 2023, 20:19 GMT
Reason for closing: Moved
Additional comments about closing: https://gitlab.archlinux.org/archlinux/p ackaging/packages/openmpi/issues/1
Saturday, 25 November 2023, 20:19 GMT
Reason for closing: Moved
Additional comments about closing: https://gitlab.archlinux.org/archlinux/p ackaging/packages/openmpi/issues/1
FS#79543FS#78786?FS#78786andFS#78261will be fixed by 4.1.6 which judging from [1] is close while it will reintroduceFS#79543which has not been reported upstream?Edit:
4.1.6 will also fix CVE-2023-41915.
Edit2:
bisecting
FS#79543gives the following which matches 4.1.5-2 adding 6e8e14f2c2f207d5fa51299cc67558697a5b7d63 as a patch$ git bisect bad
6e8e14f2c2f207d5fa51299cc67558697a5b7d63 is the first bad commit
commit 6e8e14f2c2f207d5fa51299cc67558697a5b7d63
Author: Gilles Gouaillardet <gilles@rist.or.jp>
Date: Wed Mar 8 10:48:00 2023 +0900
pmix3x: use PMIX_VALUE_LOAD() and PMIX_INFO_LOAD() macros
Refs. open-mpi/ompi#10416
bot:notacherrypick
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
opal/mca/pmix/pmix3x/pmix3x.c | 273 ++++++++++++++++++++++++++++-------
opal/mca/pmix/pmix3x/pmix3x.h | 6 +-
opal/mca/pmix/pmix3x/pmix3x_client.c | 48 +++---
3 files changed, 242 insertions(+), 85 deletions(-)
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [33e63c4beaac85f2cce33140f44f9668051d3028] Merge pull request #11880 from jsquyres/pr/v4.1.x/news-update
git bisect bad 33e63c4beaac85f2cce33140f44f9668051d3028
# status: waiting for good commit(s), bad commit known
# good: [42b829b3b3190dd1987d113fd8c2810eb8584007] Merge pull request #11426 from bwbarrett/v4.1.x-release
git bisect good 42b829b3b3190dd1987d113fd8c2810eb8584007
# good: [4f127c1f21ca5bbbfb6ef9c868129048f4702757] Merge pull request #11684 from wzamazon/v4.1.x_btl_ofi_fix_flush_backport
git bisect good 4f127c1f21ca5bbbfb6ef9c868129048f4702757
# bad: [1ec3d9de2b7680d6766e0aa4007ecc8ae458bdf5] Merge pull request #11812 from wenduwan/backport_han_allreduce
git bisect bad 1ec3d9de2b7680d6766e0aa4007ecc8ae458bdf5
# bad: [12025000b00eed3bef5a9f907cdfab5b8f563010] pmix3x: update to handle PMIx v4.2.3
git bisect bad 12025000b00eed3bef5a9f907cdfab5b8f563010
# bad: [95514e0812804afaf867087ea21c130ab838abeb] Merge pull request #11752 from wzamazon/v4.1.x_fix_pml_cm_heavy_send_request
git bisect bad 95514e0812804afaf867087ea21c130ab838abeb
# bad: [c053cb822770738de5df0a3e998899398d35e728] Merge pull request #11472 from ggouaillardet/topic/v4.1.x/pmix3x_macros
git bisect bad c053cb822770738de5df0a3e998899398d35e728
# bad: [6e8e14f2c2f207d5fa51299cc67558697a5b7d63] pmix3x: use PMIX_VALUE_LOAD() and PMIX_INFO_LOAD() macros
git bisect bad 6e8e14f2c2f207d5fa51299cc67558697a5b7d63
# first bad commit: [6e8e14f2c2f207d5fa51299cc67558697a5b7d63] pmix3x: use PMIX_VALUE_LOAD() and PMIX_INFO_LOAD() macros
@gromit do you want to reopen with upstream or ask @helq to?
[1] https://github.com/open-mpi/ompi/pull/11930
FS#79543was caused by a bug in glibc [2] which is still unresolved. The bug can be worked around by disabling the use of sem_open by having the configure check for it fail by setting the environment variable ac_cv_func_sem_open=no for the ./configure call.[1]: https://github.com/open-mpi/ompi/issues/11934
[2]: https://sourceware.org/bugzilla/show_bug.cgi?id=30789
FS#78786plusFS#78261and disable use of sem_open until the issues in glibc is resolved to avoid reintroducingFS#79543. See also previous comment.I did not update the pkgver to 5.0.0 as that is a major release and out of scope for this issue.