FS#73967 - glibc 2.35 new dependency resolver buggyness with dlclose()

Attached to Project: Arch Linux
Opened by Pieter-Jan Briers (PJB) - Saturday, 26 February 2022, 20:37 GMT
Last edited by freswa (frederik) - Tuesday, 20 September 2022, 14:09 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To freswa (frederik)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

(glibc clearly says that I should report to the distro first. I'm 99% sure this is not Arch's fault but I'm posting here first just to comply with that)

Description:

glibc 2.35 has a new dependency resolver for dynamic library loading, or something like that. I traced some bug reports we were getting in our game to it after a lot of debugging (I have a small C repro and the bug went away with GLIBC_TUNABLES=glibc.rtld.dynamic_sort=1 which disables the new system).

The core of the bug in our game appears to be that a roundabout dlclose(dlopen(NULL)) done in ALSA's libraries causes libbz2.so to unload randomly, which then breaks loading libgtk-3.so when we try to open a MIDI file dialog later. The sequence of events can be best summarized with the small C program I got to reproduce the behavior (just gcc main.c && ./a.out it. You will need to have fluidsynth installed, and probably PortAudio/ALSA which it loads to cascade into this):

#include <dlfcn.h>
#include <stdio.h>
#include <unistd.h>

void main()
{
void* lib;

// Early on, we load freetype, which has a direct dependency on bzip2.
lib = dlopen("libfreetype.so.6", RTLD_NOW);
printf("#### freetype: %llX\n", lib);

// Then, much later (when the user opens the instrument menu), we load fluidsynth
lib = dlopen("libfluidsynth.so.3", RTLD_NOW);
printf("#### fluidsynth: %llX\n", lib);

// Using dlsym so we don't link directly against fluidsynth and it's dynamically loaded.
void* (*new_fluid_settings_fp)() = dlsym(lib, "new_fluid_settings");

// And initialize fluidsynth by calling new_fluid_settings()
printf("#### Initializing fluidsynth!\n");
void* settings = new_fluid_settings_fp();

// Fluidsynth will go through all its audio drivers, which ends up initializing ALSA (through PortAudio)
// alsa-libs/libasound.so ends up doing effectively dlclose(dlopen(NULL)) at config.c line 4017 running config hooks. This dlclose call, for some reason, results in libbz2.so getting fini'd.

printf("#### Initialized fluidsynth!\n");

// GTK3 will fail to load (be zero) if loaded after fluidsynth initialized, because libbz2.so was erroneously unloaded.
lib = dlopen("libgtk-3.so", RTLD_NOW);
printf("#### GTK3: %llX\n", lib);

return;
}

When I run this on my system right now (glibc 2.35), GTK will fail to be loaded, and LD_DEBUG=libs output shows libbz2.so getting unloaded during FluidSynth initialization. If I then run it with GLIBC_TUNABLES=glibc.rtld.dynamic_sort=1 (see glibc 2.35 release notes), this does not happen and GTK loads fine.

I understand that this C program has a *lot* of variables at play through FluidSynth, ALSA, etc... Still we got at least 3 people reproducing it (me, and the two other bug reports[1] we got that made me investigate this). I've attached LD_DEBUG=libs for my system to the bug as mine.log.

So this would be the end of it, except one of our maintainers had a problem where her game would fail to load (like, wouldn't even start, no ALSA/FluidSynth/GTK involved) in most cases with an error in ld.so: "Inconsistency detected by ld.so: dl-close.c: 277: _dl_close_worker: Assertion `imap->l_type == lt_loaded && !imap->l_nodelete_active' failed!"
Running in Steam Linux Runtime seemed to fix it, and we just tried with this new tunable and the problem went away completely. LD_DEBUG file for her scenario also attached as vera.log (our game is .NET though so there's quite a lot of noise)

Additional info:
* package version: glibc 2.35-2
* config and/or log files etc.
* link to upstream bug report, if any: none yet, since they tell me to report here first

Steps to reproduce:
See above, couldn't really figure out a good way to structure this.

[1]: https://github.com/space-wizards/RobustToolbox/issues/2563
   vera.log (39.3 KiB)
   mine.log (57.9 KiB)
This task depends upon

Closed by  freswa (frederik)
Tuesday, 20 September 2022, 14:09 GMT
Reason for closing:  Fixed
Additional comments about closing:  Fix commited to trunk revision 456508.
Comment by freswa (frederik) - Wednesday, 02 March 2022, 21:57 GMT
We're packaging close to upstream. I agree with you, that we can't do much about it.
I'll close this for, but please feel free to request a re-open in case you have a patch/fix we may apply.
Comment by Pieter-Jan Briers (PJB) - Friday, 11 March 2022, 16:54 GMT
  • Field changed: Percent Complete (100% → 0%)
I don't have a patch on-hand for this (not exactly familiar with compiling glibc), but I assume a decent and easy fix for now would be to change the compiled default for glibc.rtld.dynamic_sort to 1 (so that it uses the old system by default). The new sorting system is only a perf improvement if the release notes are to be believed, so I can't imagine this has any adverse effects?

I'm in the process of filing a bug on upstream, but I need approval to make a bugzilla account there. It'll take a bit I guess.
Comment by Jan Alexander Steffens (heftig) - Friday, 11 March 2022, 16:55 GMT
This test program actually sometimes segfaults for me in ld-linux when trying to load GTK. Valgrind isn't happy with it, either.

I think changing the default sort as suggested is a good idea.
Comment by Pieter-Jan Briers (PJB) - Friday, 11 March 2022, 18:38 GMT
Here is the upstream glibc bug, by the way: https://sourceware.org/bugzilla/show_bug.cgi?id=28937
Comment by AK (Andreaskem) - Friday, 11 March 2022, 18:56 GMT
I changed the output to stderr, and for me, the test code does not even reach:

fprintf(stderr, "#### Initialized fluidsynth!\n");

Could this not be a bug in fluidsynth? Some kind of memory corruption, perhaps?

(gdb) where
#0 0x00007f71d7bfe143 in dfs_traversal.part () at /lib64/ld-linux-x86-64.so.2
#1 0x00007f71d7bfe194 in dfs_traversal.part () at /lib64/ld-linux-x86-64.so.2
#2 0x00007f71d7bfe194 in dfs_traversal.part () at /lib64/ld-linux-x86-64.so.2
#3 0x00007f71d7bfe58d in _dl_sort_maps () at /lib64/ld-linux-x86-64.so.2
#4 0x00007f71d7bee56c in _dl_close_worker () at /lib64/ld-linux-x86-64.so.2
#5 0x00007f71d7bef05b in _dl_close () at /lib64/ld-linux-x86-64.so.2
#6 0x00007f71d7b1be18 in _dl_catch_exception () at /usr/lib/libc.so.6
#7 0x00007f71d7b1bee3 in _dl_catch_error () at /usr/lib/libc.so.6
#8 0x00007f71d7a4a24e in _dlerror_run () at /usr/lib/libc.so.6
#9 0x00007f71d7a49f68 in dlclose@GLIBC_2.2.5 () at /usr/lib/libc.so.6
#10 0x00007f71d5de851d in pw_impl_module_destroy () at /usr/lib/libpipewire-0.3.so.0
#11 0x00007f71d5dc8546 in pw_context_destroy () at /usr/lib/libpipewire-0.3.so.0
#12 0x00007f71d5e872fd in () at /usr/lib/alsa-lib/libasound_module_pcm_pipewire.so
#13 0x00007f71d5e873e3 in () at /usr/lib/alsa-lib/libasound_module_pcm_pipewire.so
#14 0x00007f71d732e475 in () at /usr/lib/libasound.so.2
#15 0x00007f71d72eb156 in snd_pcm_close () at /usr/lib/libasound.so.2
#16 0x00007f71d72701cf in () at /usr/lib/libportaudio.so.2
#17 0x00007f71d727087b in () at /usr/lib/libportaudio.so.2
#18 0x00007f71d727596c in () at /usr/lib/libportaudio.so.2
#19 0x00007f71d727625a in PaAlsa_Initialize () at /usr/lib/libportaudio.so.2
#20 0x00007f71d7269336 in Pa_Initialize () at /usr/lib/libportaudio.so.2
#21 0x00007f71d73ef272 in () at /usr/lib/libfluidsynth.so.3
#22 0x00007f71d73f0e38 in new_fluid_settings () at /usr/lib/libfluidsynth.so.3
#23 0x000055aa0cdec22c in main () at test.c:22
Comment by Jan Alexander Steffens (heftig) - Friday, 11 March 2022, 20:15 GMT
Here's valgrind running the test program with glibc built with debug symbols. Maybe upstream can use this.
Comment by Jan Alexander Steffens (heftig) - Friday, 11 March 2022, 20:18 GMT
And this is where it crashes when the test program segfaults:

#0 do_lookup_x (undef_name=undef_name@entry=0x7f5e0e188a6d "BZ2_hbMakeCodeLengths", new_hash=new_hash@entry=2100737866, old_hash=old_hash@entry=0x7ffd7039a000, ref=0x7f5e0e1887f8, result=result@entry=0x7ffd7039a010, scope=<optimized out>, i=2, version=0x0, flags=9, skip=<optimized out>, type_class=<optimized out>, undef_map=0x55a13367eda0) at dl-lookup.c:403
#1 0x00007f5e12e150e1 in _dl_lookup_symbol_x (undef_name=0x7f5e0e188a6d "BZ2_hbMakeCodeLengths", undef_map=undef_map@entry=0x55a13367eda0, ref=ref@entry=0x7ffd7039a128, symbol_scope=symbol_scope@entry=0x55a13367f110, version=0x0, type_class=4, flags=9, skip_map=0x0) at dl-lookup.c:860
#2 0x00007f5e12e1a050 in elf_machine_rela (skip_ifunc=0, reloc_addr_arg=0x7f5e0e199e70, version=0x55a13365c6b8, sym=0x7f5e0e1887f8, reloc=0x7f5e0e188ff0, scope=0x55a13367f110, map=0x55a13367eda0) at ../sysdeps/generic/ldsodefs.h:80
#3 elf_dynamic_do_Rela (skip_ifunc=0, lazy=<optimized out>, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, scope=<optimized out>, map=0x55a13367eda0) at /usr/src/debug/glibc-2.35/elf/do-rel.h:142
#4 _dl_relocate_object (l=0x55a13367eda0, scope=<optimized out>, reloc_mode=reloc_mode@entry=0, consider_profiling=<optimized out>, consider_profiling@entry=0) at dl-reloc.c:288
#5 0x00007f5e12e177de in dl_open_worker_begin (a=a@entry=0x7ffd7039a4d0) at dl-open.c:702
#6 0x00007f5e12d25e18 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd7039a330, operate=operate@entry=0x7f5e12e17510 <dl_open_worker_begin>, args=args@entry=0x7ffd7039a4d0) at /usr/src/debug/glibc-2.35/elf/dl-error-skeleton.c:208
#7 0x00007f5e12e16d7b in dl_open_worker (a=a@entry=0x7ffd7039a4d0) at dl-open.c:782
#8 0x00007f5e12d25e18 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd7039a4b0, operate=operate@entry=0x7f5e12e16d40 <dl_open_worker>, args=args@entry=0x7ffd7039a4d0) at /usr/src/debug/glibc-2.35/elf/dl-error-skeleton.c:208
#9 0x00007f5e12e1715d in _dl_open (file=0x7ffd7039a4b0 "", mode=<optimized out>, caller_dlopen=0x55a13175d232 <main+201>, nsid=-2, argc=1, argv=0x7ffd7039a8b8, env=0x7ffd7039a8c8) at dl-open.c:883
#10 0x00007f5e12c5474c in dlopen_doit (a=a@entry=0x7ffd7039a740) at dlopen.c:56
#11 0x00007f5e12d25e18 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd7039a6a0, operate=0x7f5e12c546f0 <dlopen_doit>, args=0x7ffd7039a740) at /usr/src/debug/glibc-2.35/elf/dl-error-skeleton.c:208
#12 0x00007f5e12d25ee3 in __GI__dl_catch_error (objname=0x7ffd7039a6f8, errstring=0x7ffd7039a700, mallocedp=0x7ffd7039a6f7, operate=<optimized out>, args=<optimized out>) at /usr/src/debug/glibc-2.35/elf/dl-error-skeleton.c:227
#13 0x00007f5e12c5424e in _dlerror_run (operate=operate@entry=0x7f5e12c546f0 <dlopen_doit>, args=args@entry=0x7ffd7039a740) at dlerror.c:138
#14 0x00007f5e12c547d8 in dlopen_implementation (dl_caller=<optimized out>, mode=<optimized out>, file=<optimized out>) at dlopen.c:71
#15 ___dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:81
#16 0x000055a13175d232 in main ()
Comment by freswa (frederik) - Thursday, 05 May 2022, 23:29 GMT
The reproducer seems not enough according to upstream. Any news since that was posted on bugzilla?

Loading...