FS#80061 - [gnome-shell] Intermittent segmentation faults on boot

Attached to Project: Arch Linux
Opened by Frantisek Sumsal (mrc0mmand) - Monday, 23 October 2023, 09:45 GMT
Last edited by Toolybird (Toolybird) - Thursday, 26 October 2023, 20:54 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Fabian Bornschein (fabis_cafe)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
For the past couple of weeks/months I've been seeing occasional gnome-shell segfaults in our upstream systemd CI. We don't do anything special with it, the image just automatically boots into the graphical.target for some extra systemd-logind coverage.

After some tweaking, the latest crash also saved a full (and symbolized) stack trace:

#0 __GI_getenv (name=name@entry=0x7f213f472dda "EXPAT_ACCOUNTING_DEBUG")
#1 0x00007f213f4715e0 in getDebugLevel.constprop.0
#2 0x00007f213f457da8 in parserInit
#3 0x00007f213f4615b0 in parserCreate
#4 0x00007f213f46183b in XML_ParserCreate_MM
#5 0x00007f213f46184e in XML_ParserCreate
#6 0x00007f214277c843 in FcConfigParseAndLoadFromMemoryInternal
#7 0x00007f214277d297 in _FcConfigParse
#8 0x00007f214277d47a in FcConfigParseAndLoadDir
#9 _FcConfigParse
#10 0x00007f2142780476 in FcParseInclude (parse=0x7f2002ff53e0)
#11 FcEndElement (userData=0x7f2002ff53e0, name=<optimized out>)
#12 0x00007f213f45f63f in doContent
#13 0x00007f213f45cc14 in contentProcessor
#14 doProlog
#15 0x00007f213f45e7ed in prologProcessor
#16 0x00007f213f4628ea in XML_ParseBuffer
#17 0x00007f214277c945 in FcConfigParseAndLoadFromMemoryInternal
100 98183 100 98183 0 0 138k 0 --:--:-- --:--:-- --:--:-- 138k
#18 0x00007f214277d297 in _FcConfigParse
#19 0x00007f2142765191 in IA__FcConfigParseAndLoad
#20 FcInitLoadOwnConfig (config=0x7f1fd8000b70)
#21 0x00007f214276015d in FcInitLoadOwnConfigAndFonts (config=0x0)
#22 IA__FcInitLoadConfigAndFonts () at ../fontconfig/src/fcinit.c:184
#23 FcConfigEnsure () at ../fontconfig/src/fccfg.c:96
#24 0x00007f214276548d in FcConfigInit () at ../fontconfig/src/fccfg.c:122
#25 IA__FcInit () at ../fontconfig/src/fcinit.c:193
#26 0x00007f21427a9412 in init_in_thread (task_data=<optimized out>)
#27 0x00007f2143b669a5 in g_thread_proxy (data=0x559b6f5d7e60)
#28 0x00007f21432aa9eb in start_thread (arg=<optimized out>)
#29 0x00007f214332e7cc in clone3 ()

See the attachment (or [0]) for the whole thing, as it's quite big. In case it's needed, there's also the full journal [1] from the machine, as well as the list of all installed packages [2].

[0] https://jenkins-systemd.apps.ocp.cloud.ci.centos.org/job/upstream-vagrant-archlinux-sanitizers/7059/artifact//systemd-centos-ci/artifacts_all/artifacts_dxv80ir8/vagrant-logs.uwh/vagrant-arch-sanitizers-clang-testsuite.9H0/coredumpctl_collect_boot_FAIL.log
[1] https://jenkins-systemd.apps.ocp.cloud.ci.centos.org/job/upstream-vagrant-archlinux-sanitizers/7059/artifact//systemd-centos-ci/artifacts_all/artifacts_dxv80ir8/vagrant-logs.uwh/vagrant-arch-sanitizers-clang-testsuite.9H0/journalctl-testsuite_PASS.log
[2] https://jenkins-systemd.apps.ocp.cloud.ci.centos.org/job/upstream-vagrant-archlinux-sanitizers/7059/artifact//systemd-centos-ci/artifacts_all/artifacts_dxv80ir8/vagrant-logs.uwh/vagrant-arch-sanitizers-clang-installed-pkgs.txt

Additional info:
* package version(s)
gnome-shell 1:45.0+r17+gebf2f8036-1
This task depends upon

Closed by  Toolybird (Toolybird)
Thursday, 26 October 2023, 20:54 GMT
Reason for closing:  Upstream
Additional comments about closing:  Please see comments
Comment by loqs (loqs) - Monday, 23 October 2023, 10:09 GMT
[1] suggest that the getenv [2] call failing might be a symptom of earlier corruption.
Edit:
Does rebuilding with the address sanitizer '-fsanitize=address' detect any issues?

[1]: https://bugs.launchpad.net/ubuntu/+bug/1979118/comments/3
[2]: https://github.com/libexpat/libexpat/blob/R_2_5_0/expat/lib/xmlparse.c#L8389
Comment by Frantisek Sumsal (mrc0mmand) - Monday, 23 October 2023, 14:01 GMT
I rebuilt the gnome-shell package with -Db_sanitize=address,undefined and after several reboots I got "just" this:

[ 4.162456] archlinux dbus-daemon[439]: [session uid=120 pid=439] Successfully activated service 'org.freedesktop.systemd1'
[ 4.295723] archlinux /usr/lib/gdm-wayland-session[442]: dbus-daemon[442]: [session uid=120 pid=442] Activating service name='org.freedesktop.systemd1' requested by ':1.2' (uid=120 pid=443 comm="/usr/lib/gnome-session-binary --autostart /usr/sha")
[ 4.299184] archlinux /usr/lib/gdm-wayland-session[442]: dbus-daemon[442]: [session uid=120 pid=442] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
[ 4.299562] archlinux gnome-session[443]: gnome-session-binary[443]: WARNING: Could not check if unit gnome-session-wayland@gnome-login.target is active: Error calling StartServiceByName for org.freedesktop.systemd1: Process org.freedesktop.systemd1 exited with status 1
[ 4.299989] archlinux gnome-session-binary[443]: WARNING: Could not check if unit gnome-session-wayland@gnome-login.target is active: Error calling StartServiceByName for org.freedesktop.systemd1: Process org.freedesktop.systemd1 exited with status 1
[ 4.312398] archlinux gnome-session[443]: gnome-session-binary[443]: WARNING: Desktop file /usr/share/gdm/greeter/autostart/orca-autostart.desktop for application orca-autostart.desktop could not be parsed or references a missing TryExec binary
[ 4.312555] archlinux gnome-session-binary[443]: WARNING: Desktop file /usr/share/gdm/greeter/autostart/orca-autostart.desktop for application orca-autostart.desktop could not be parsed or references a missing TryExec binary
[ 4.562029] archlinux gnome-shell[455]: Running GNOME Shell (using mutter 45.0) as a Wayland display server
[ 4.621103] archlinux gnome-shell[455]: Failed to make thread 'KMS thread' realtime scheduled: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Name "org.freedesktop.RealtimeKit1" does not exist
[ 4.625626] archlinux org.gnome.Shell.desktop[455]: pci id for fd 12: 1013:00b8, driver (null)
[ 4.625962] archlinux org.gnome.Shell.desktop[455]: MESA-LOADER: failed to open cirrus: /usr/lib/dri/cirrus_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/dri, suffix _dri)
[ 4.906052] archlinux org.gnome.Shell.desktop[455]: pci id for fd 13: 1013:00b8, driver (null)
[ 4.906052] archlinux org.gnome.Shell.desktop[455]: kmsro: driver missing
[ 4.961439] archlinux gnome-shell[455]: Added device '/dev/dri/card0' (cirrus) using atomic mode setting.
[ 4.962980] archlinux gnome-shell[455]: Failed to initialize accelerated iGPU/dGPU framebuffer sharing: Not hardware accelerated
[ 4.963068] archlinux gnome-shell[455]: Created gbm renderer for '/dev/dri/card0'
[ 4.963184] archlinux gnome-shell[455]: Boot VGA GPU /dev/dri/card0 selected as primary
[ 5.183502] archlinux gnome-shell[455]: Disabling DMA buffer screen sharing (not hardware accelerated)
[ 5.193503] archlinux /usr/lib/gdm-wayland-session[442]: dbus-daemon[442]: [session uid=120 pid=442] Activating service name='org.a11y.Bus' requested by ':1.4' (uid=120 pid=455 comm="/usr/bin/gnome-shell")
[ 5.202405] archlinux /usr/lib/gdm-wayland-session[442]: dbus-daemon[442]: [session uid=120 pid=442] Successfully activated service 'org.a11y.Bus'
[ 5.214010] archlinux kernel: gnome-shell[455]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 5.216770] archlinux gnome-shell[455]: Using public X11 display :1024, (using :1025 for managed services)
[ 5.216932] archlinux gnome-shell[455]: Using Wayland display name 'wayland-0'
[ 5.219519] archlinux org.gnome.Shell.desktop[455]: AddressSanitizer:DEADLYSIGNAL
[ 5.219519] archlinux org.gnome.Shell.desktop[455]: =================================================================
[ 5.219710] archlinux org.gnome.Shell.desktop[455]: ==455==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000007c (pc 0x7fd6bc05f93d bp 0x612000027940 sp 0x7fd6864e8820 T28)
[ 5.219774] archlinux org.gnome.Shell.desktop[455]: ==455==The signal is caused by a READ memory access.
[ 5.219833] archlinux org.gnome.Shell.desktop[455]: ==455==Hint: address points to the zero page.
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #0 0x7fd6bc05f93d in getenv (/usr/lib/libc.so.6+0x4193d) (BuildId: 8bfe03f6bf9b6a6e2591babd0bbc266837d8f658)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #1 0x7fd6b73945df (/usr/lib/libexpat.so.1+0x1f5df) (BuildId: a98bfab551dfa3df6889c33d5fd2ccfa6d505608)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #2 0x7fd6b737add9 (/usr/lib/libexpat.so.1+0x5dd9) (BuildId: a98bfab551dfa3df6889c33d5fd2ccfa6d505608)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #3 0x7fd6b73845af (/usr/lib/libexpat.so.1+0xf5af) (BuildId: a98bfab551dfa3df6889c33d5fd2ccfa6d505608)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #4 0x7fd6b83eb842 (/usr/lib/libfontconfig.so.1+0x2d842) (BuildId: 2f7305d108d26daad426b3855fe9225ddfef356b)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #5 0x7fd6b83ec296 (/usr/lib/libfontconfig.so.1+0x2e296) (BuildId: 2f7305d108d26daad426b3855fe9225ddfef356b)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #6 0x7fd6b83ec479 (/usr/lib/libfontconfig.so.1+0x2e479) (BuildId: 2f7305d108d26daad426b3855fe9225ddfef356b)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #7 0x7fd6b83ef475 (/usr/lib/libfontconfig.so.1+0x31475) (BuildId: 2f7305d108d26daad426b3855fe9225ddfef356b)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #8 0x7fd6b738263e (/usr/lib/libexpat.so.1+0xd63e) (BuildId: a98bfab551dfa3df6889c33d5fd2ccfa6d505608)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #9 0x7fd6b737fc13 (/usr/lib/libexpat.so.1+0xac13) (BuildId: a98bfab551dfa3df6889c33d5fd2ccfa6d505608)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #10 0x7fd6b73817ec (/usr/lib/libexpat.so.1+0xc7ec) (BuildId: a98bfab551dfa3df6889c33d5fd2ccfa6d505608)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #11 0x7fd6b73858e9 in XML_ParseBuffer (/usr/lib/libexpat.so.1+0x108e9) (BuildId: a98bfab551dfa3df6889c33d5fd2ccfa6d505608)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #12 0x7fd6b83eb944 (/usr/lib/libfontconfig.so.1+0x2d944) (BuildId: 2f7305d108d26daad426b3855fe9225ddfef356b)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #13 0x7fd6b83ec296 (/usr/lib/libfontconfig.so.1+0x2e296) (BuildId: 2f7305d108d26daad426b3855fe9225ddfef356b)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #14 0x7fd6b83d4190 (/usr/lib/libfontconfig.so.1+0x16190) (BuildId: 2f7305d108d26daad426b3855fe9225ddfef356b)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #15 0x7fd6b83cf15c (/usr/lib/libfontconfig.so.1+0x1115c) (BuildId: 2f7305d108d26daad426b3855fe9225ddfef356b)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #16 0x7fd6b83d448c in FcInit (/usr/lib/libfontconfig.so.1+0x1648c) (BuildId: 2f7305d108d26daad426b3855fe9225ddfef356b)
[ 5.352442] archlinux org.gnome.Shell.desktop[455]: #17 0x7fd6b8416411 (/usr/lib/libpangoft2-1.0.so.0+0x9411) (BuildId: c4942d7c23fe50db42934220b31981cfbf464e48)
[ 5.353672] archlinux org.gnome.Shell.desktop[455]: #18 0x7fd6bcf3f9a4 (/usr/lib/libglib-2.0.so.0+0x8b9a4) (BuildId: 1916d89bc0f8f0932e584f87427c2fedfc8a293b)
[ 5.353672] archlinux org.gnome.Shell.desktop[455]: #19 0x7fd6bc0aa9ea (/usr/lib/libc.so.6+0x8c9ea) (BuildId: 8bfe03f6bf9b6a6e2591babd0bbc266837d8f658)
[ 5.353672] archlinux org.gnome.Shell.desktop[455]: #20 0x7fd6bc12e7cb (/usr/lib/libc.so.6+0x1107cb) (BuildId: 8bfe03f6bf9b6a6e2591babd0bbc266837d8f658)
[ 5.353672] archlinux org.gnome.Shell.desktop[455]: AddressSanitizer can not provide additional info.
[ 5.353672] archlinux org.gnome.Shell.desktop[455]: SUMMARY: AddressSanitizer: SEGV (/usr/lib/libc.so.6+0x4193d) (BuildId: 8bfe03f6bf9b6a6e2591babd0bbc266837d8f658) in getenv
[ 5.353672] archlinux org.gnome.Shell.desktop[455]: Thread T28 created by T0 here:
[ 5.394804] archlinux /usr/lib/gdm-wayland-session[494]: dbus-daemon[494]: Activating service name='org.a11y.atspi.Registry' requested by ':1.0' (uid=120 pid=455 comm="/usr/bin/gnome-shell")
[ 5.399525] archlinux dbus-daemon[370]: [system] Activating via systemd: service name='org.freedesktop.ColorManager' unit='colord.service' requested by ':1.16' (uid=120 pid=455 comm="/usr/bin/gnome-shell")
[ 5.404267] archlinux org.gnome.Shell.desktop[496]: Failed to initialize glamor, falling back to sw
[ 5.430379] archlinux org.gnome.Shell.desktop[455]: #0 0x7fd6bd04a497 in __interceptor_pthread_create /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_interceptors.cpp:208
[ 5.430379] archlinux org.gnome.Shell.desktop[455]: #1 0x7fd6bcf40f53 (/usr/lib/libglib-2.0.so.0+0x8cf53) (BuildId: 1916d89bc0f8f0932e584f87427c2fedfc8a293b)
[ 5.430379] archlinux org.gnome.Shell.desktop[455]: ==455==ABORTING
[ 5.430518] archlinux /usr/lib/gdm-wayland-session[498]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry
[ 5.430758] archlinux /usr/lib/gdm-wayland-session[494]: dbus-daemon[494]: Successfully activated service 'org.a11y.atspi.Registry'
[ 5.430825] archlinux systemd[1]: Starting colord.service...
[ 5.431582] archlinux gnome-session[443]: gnome-session-binary[443]: WARNING: App 'org.gnome.Shell.desktop' exited with code 1

But I'm currently debugging a different issue that also involves rebooting a lot, so I'll keep an eye on the logs if something else (and possibly more helpful) pops up.
Comment by Frantisek Sumsal (mrc0mmand) - Monday, 23 October 2023, 17:19 GMT
FWIW, it can be relatively reliably reproduced using the official Vagrant box[0] with the following crude reproducer:

$ mkdir arch-debug
$ cd arch-debug
$ cat >Vagrantfile <<EOF
Vagrant.configure("2") do |config|
config.vm.box = "archlinux/archlinux"
config.vm.synced_folder ".", "/vagrant", disabled: true
end
EOF
$ vagrant up --provider=libvirt
$ vagrant ssh -c 'sudo bash -c "systemctl disable systemd-time-wait-sync; pacman --noconfirm -Sy gdm; systemctl set-default graphical.target; systemctl enable gdm"'
$ vagrant reload
$ while ! vagrant ssh -c 'systemctl --wait is-system-running; sleep 10; sudo journalctl -b --grep "[k]illed by signal"'; do vagrant reload; done
...
default: Warning: Connection refused. Retrying...
==> default: Machine booted and ready!
==> default: Creating shared folders metadata...
==> default: Machine already provisioned. Run `vagrant provision` or use the `--provision`
==> default: flag to force provisioning. Provisioners marked to run always will still run.
running
Oct 23 17:16:49 archlinux gnome-session-binary[356]: WARNING: Application 'org.gnome.Shell.desktop' killed by signal 11
Oct 23 17:16:49 archlinux gnome-session[356]: gnome-session-binary[356]: WARNING: Application 'org.gnome.Shell.desktop' killed by signal 11

"Relatively" meaning that it can take a couple dozen tries before gnome-shell crashes.

[0] https://gitlab.archlinux.org/archlinux/arch-boxes
Comment by Toolybird (Toolybird) - Monday, 23 October 2023, 18:56 GMT
It sounds like an upstream issue...assigning to GNOME PM's for a look.
Comment by Jan Alexander Steffens (heftig) - Monday, 23 October 2023, 22:29 GMT
This sounds like another thread is using setenv/unsetenv/putenv at the same time as your crashing thread is using getenv. Manipulating the environment is not thread-safe.
Comment by Toolybird (Toolybird) - Thursday, 26 October 2023, 00:30 GMT
> FWIW, it can be relatively reliably reproduced using the official Vagrant box[0] with the following crude reproducer:

Can repro using your instructions. Backtrace with debug symbols attached. Definitely seems fontconfig related.. but it still doesn't seem like an Arch packaging issue...
   gdb.txt (1.6 KiB)
Comment by Jan Alexander Steffens (heftig) - Thursday, 26 October 2023, 00:35 GMT
If you can reproduce this, try to backtrace all threads (`t apply all bt`).
Comment by Toolybird (Toolybird) - Thursday, 26 October 2023, 01:00 GMT
Working in a VM memory constrained environment...so it was an effort :)
   gdb.txt (30.8 KiB)
Comment by Jan Alexander Steffens (heftig) - Thursday, 26 October 2023, 01:23 GMT
Thanks for the effort!

Hm, Mutter is calling setenv at various points during startup so I wouldn't be surprised if this is racing with Pango's threaded initialization of FontConfig.
Comment by Jan Alexander Steffens (heftig) - Thursday, 26 October 2023, 11:53 GMT
This is an bug in Mutter and/or GNOME Shell. They shouldn't be using setenv after threads are spawned. Could you please file this upstream?
Comment by loqs (loqs) - Thursday, 26 October 2023, 17:13 GMT
Could it be the same issue as [1]?

[1]: https://gitlab.gnome.org/GNOME/gnome-shell/-/issues/6974
Comment by Toolybird (Toolybird) - Thursday, 26 October 2023, 20:53 GMT
> Could it be the same issue

Thanks @loqs. It does indeed look very similar. I've linked this report there. After the analysis from @heftig, there is not much doubt about this being an upstream issue.

Loading...