FS#74814 - [linux-hardened] 5.17.9-hardened hangs during boot

Attached to Project: Arch Linux
Opened by James Hogan (jhogan) - Saturday, 21 May 2022, 11:28 GMT
Last edited by Levente Polyak (anthraxx) - Thursday, 02 June 2022, 19:50 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Levente Polyak (anthraxx)
Architecture x86_64
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 6
Private No

Details

Description:
After updating to linux-hardened 5.17.9.hardened1-1, my QEMU/KVM/libvirt VM no longer boots. After removing "quiet" from kernel command line it hangs after "Starting User Login Management.". Cursor stops flashing. Doesn't respond to anything as far as I can tell.

linux-hardened 5.17.7 works fine.
linux 5.17.9.arch1-1 works fine.
This task depends upon

Closed by  Levente Polyak (anthraxx)
Thursday, 02 June 2022, 19:50 GMT
Reason for closing:  Fixed
Additional comments about closing:  5.17.12.hardened2-1
Comment by N.T. (NikTo) - Saturday, 21 May 2022, 13:07 GMT Comment by Levente Polyak (anthraxx) - Saturday, 21 May 2022, 13:29 GMT
Can't reproduce neither on hosts nor on virtualization. People affected need to debug/bisect this. There have been zero kconfig or hardened patch changes between 5.17.7.a and 5.17.9.a so the root needs to come from a vanilla commit inbetween v5.17.7 and v5.17.9.

Please take a look a dmesg on a different tty CTRL-ALT-(F key) or journalctl boot log


Before bisecting, you could try:
1) Use the hardened `config` to compile a vanilla kernel PKGBUILD from source and test if it works

If that still works, you need to make a bisect between v5.17.7 and v5.17.9 while applying the hardened patch set on each bisect step.
Comment by loqs (loqs) - Saturday, 21 May 2022, 18:02 GMT
5.17.9.arch1-1 with config from 5.17.9-hardened1
https://drive.google.com/file/d/1FgLijZUrcOcHZTKAyHHjt2s0B2PDGSga/view?usp=sharing linux-5.17.9.arch1-1.1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1afoxkkfnscfMCfEXiUIMgT9fKgOdHMHB/view?usp=sharing linux-headers-5.17.9.arch1-1.1-x86_64.pkg.tar.zst

PKGBUILD.diff shows one change needed to the PKGBUILD as the hardened config does not enable DEBUG_INFO_BTF_MODULES and also the difference in the configs.
Comment by Alec Trevelian (Trevelian) - Saturday, 21 May 2022, 18:53 GMT
https://drive.google.com/file/d/1FgLijZUrcOcHZTKAyHHjt2s0B2PDGSga/view?usp=sharing linux-5.17.9.arch1-1.1-x86_64.pkg.tar.zst

Its booting when I try on a KVM VM.
Comment by loqs (loqs) - Saturday, 21 May 2022, 19:35 GMT
5.17.8 hardened config and patch set.
https://drive.google.com/file/d/1uvY39aqNjJiZP2s1fs2jqe6T_vDNHGLS/view?usp=sharing linux-hardened-5.17.8.hardened1-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1qalvqF3PbQYbBSlqZ2qAUAlYtcAv0z6e/view?usp=sharing linux-hardened-headers-5.17.8.hardened1-1-x86_64.pkg.tar.zst
Comment by Alec Trevelian (Trevelian) - Saturday, 21 May 2022, 19:58 GMT
https://drive.google.com/file/d/1uvY39aqNjJiZP2s1fs2jqe6T_vDNHGLS/view?usp=sharing linux-hardened-5.17.8.hardened1-1-x86_64.pkg.tar.zst

Boot ok.
Comment by loqs (loqs) - Saturday, 21 May 2022, 21:03 GMT
Just to confirm the KVM VM hangs with linux-hardened 5.17.9.hardened1-1?

$ git bisect start
g$ it bisect good v5.17.8
$ git bisect bad v5.17.9
Bisecting: 57 revisions left to test after this (roughly 6 steps)
[a1c27ea040e47cbe9bc03b703196a2b506c75905] ASoC: SOF: Fix NULL pointer exception in sof_pci_probe callback
a1c27ea040e47cbe9bc03b703196a2b506c75905 with 5.17.8 hardened patch set (5.17.9 did not apply cleanly) 5.17.9 hardened config

https://drive.google.com/file/d/1EZ9VyHbXyBn_-ESKEAkSknXEYcnrH2Zg/view?usp=sharing linux-hardened-5.17.8.r57.ga1c27ea040e4-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1qILEovH7hhScjgt--BgmTbYpvuSJHsTo/view?usp=sharing linux-hardened-headers-5.17.8.r57.ga1c27ea040e4-1-x86_64.pkg.tar.zst
Comment by Alec Trevelian (Trevelian) - Saturday, 21 May 2022, 21:10 GMT
https://drive.google.com/file/d/1EZ9VyHbXyBn_-ESKEAkSknXEYcnrH2Zg/view?usp=sharing linux-hardened-5.17.8.r57.ga1c27ea040e4-1-x86_64.pkg.tar.zst

Not booting, freeze after "Loading initial ramdisk" like the "5.17.9.hardened1-1" in the repo.
Comment by loqs (loqs) - Saturday, 21 May 2022, 21:40 GMT
git bisect bad
Bisecting: 28 revisions left to test after this (roughly 5 steps)
[a872f3bed07930fd7b10550c441c7b7f83749bb5] dim: initialize all struct fields

https://drive.google.com/file/d/1aTmSNyAEFuDdIsZSFBzYU10xOfQEteof/view?usp=sharing linux-hardened-5.17.8.r28.ga872f3bed079-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1urOTYXIT2JNhc9G1WqKP-sMuBQk9n35U/view?usp=sharing linux-hardened-headers-5.17.8.r28.ga872f3bed079-1-x86_64.pkg.tar.zst
Comment by Alec Trevelian (Trevelian) - Saturday, 21 May 2022, 21:55 GMT
https://drive.google.com/file/d/1aTmSNyAEFuDdIsZSFBzYU10xOfQEteof/view?usp=sharing linux-hardened-5.17.8.r28.ga872f3bed079-1-x86_64.pkg.tar.zst

Not booting, freeze after "Loading initial ramdisk" like the "5.17.9.hardened1-1" in the repo.
Comment by loqs (loqs) - Saturday, 21 May 2022, 22:13 GMT
git bisect bad
Bisecting: 13 revisions left to test after this (roughly 4 steps)
[5db0f897ea7cf807f9817a062ee074de5e9f15f1] platform/surface: aggregator: Fix initialization order when compiling as builtin module

https://drive.google.com/file/d/1nhRyXGBHt2_frP3DmaoHarkw5B2Oji0L/view?usp=sharing linux-hardened-5.17.8.r14.g5db0f897ea7c-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/16YuA2WeTAbCEuHZmpcvNcpqqioMAmfg4/view?usp=sharing linux-hardened-headers-5.17.8.r14.g5db0f897ea7c-1-x86_64.pkg.tar.zst
Comment by Alec Trevelian (Trevelian) - Saturday, 21 May 2022, 22:23 GMT
https://drive.google.com/file/d/1nhRyXGBHt2_frP3DmaoHarkw5B2Oji0L/view?usp=sharing linux-hardened-5.17.8.r14.g5db0f897ea7c-1-x86_64.pkg.tar.zst

Not booting, freeze after "Loading initial ramdisk" like the "5.17.9.hardened1-1" in the repo.

(next try from me will be for tomorrow)
Comment by loqs (loqs) - Saturday, 21 May 2022, 22:49 GMT
git bisect bad
Bisecting: 6 revisions left to test after this (roughly 3 steps)
[ac0878d4d67b2158ccaecf420e9a31fa0270ccc0] net: mscc: ocelot: fix last VCAP IS1/IS2 filter persisting in hardware when deleted

https://drive.google.com/file/d/1_Tq75KplQIEJvfOux6pLLwxgRd5hqfio/view?usp=sharing linux-hardened-5.17.8.r7.gac0878d4d67b-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1RPNfxlYuBjrn0VrtbZIhdwDddY_lusbf/view?usp=sharing linux-hardened-headers-5.17.8.r7.gac0878d4d67b-1-x86_64.pkg.tar.zst
Comment by Alec Trevelian (Trevelian) - Sunday, 22 May 2022, 05:47 GMT
https://drive.google.com/file/d/1_Tq75KplQIEJvfOux6pLLwxgRd5hqfio/view?usp=sharing linux-hardened-5.17.8.r7.gac0878d4d67b-1-x86_64.pkg.tar.zst

Boot OK !

# cat /proc/version
Linux version 5.17.8-hardened1-1-hardened-00007-gac0878d4d67b (linux-hardened@archlinux) (gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Sat, 21 May 2022 22:34:15 +0000
Comment by loqs (loqs) - Sunday, 22 May 2022, 13:35 GMT
git bisect good
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[cd30d7b1b4173a423685a58e9ad19a73b0cf3fbe] net: mscc: ocelot: avoid corrupting hardware counters when moving VCAP filters

https://drive.google.com/file/d/1T7h96G4Fv2Q8-GxC7W9PZlSRtCk1q_b1/view?usp=sharing linux-hardened-5.17.8.r10.gcd30d7b1b417-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1Zi3GmAuHrOQ5A1ckyVnet69X5mXuH8-A/view?usp=sharing linux-hardened-headers-5.17.8.r10.gcd30d7b1b417-1-x86_64.pkg.tar.zst
Comment by Alec Trevelian (Trevelian) - Sunday, 22 May 2022, 15:47 GMT
https://drive.google.com/file/d/1T7h96G4Fv2Q8-GxC7W9PZlSRtCk1q_b1/view?usp=sharing linux-hardened-5.17.8.r10.gcd30d7b1b417-1-x86_64.pkg.tar.zst


Boot OK !
Comment by loqs (loqs) - Sunday, 22 May 2022, 16:42 GMT
git bisect good
Bisecting: 1 revision left to test after this (roughly 1 step)
[02109faee127f73bb27106394691c452c42a451e] fbdev: efifb: Cleanup fb_info in .fb_destroy rather than .remove

https://drive.google.com/file/d/1TdeeGNA7Wptd_BkzzbrEBNe0gL69laDp/view?usp=sharing linux-hardened-5.17.8.r12.g02109faee127-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1Pc79OvyFPUXR6Tg3bfc0Lh1gcZyTaO8o/view?usp=sharing linux-hardened-headers-5.17.8.r12.g02109faee127-1-x86_64.pkg.tar.zst
Comment by Alec Trevelian (Trevelian) - Sunday, 22 May 2022, 17:08 GMT
https://drive.google.com/file/d/1TdeeGNA7Wptd_BkzzbrEBNe0gL69laDp/view?usp=sharing linux-hardened-5.17.8.r12.g02109faee127-1-x86_64.pkg.tar.zst

Boot OK !

# cat /proc/version
Linux version 5.17.8-hardened1-1-hardened-00012-g02109faee127 (linux-hardened@archlinux) (gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Sun, 22 May 2022 16:27:31 +0000
Comment by loqs (loqs) - Sunday, 22 May 2022, 17:39 GMT
git bisect good
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[a1aac13288de2935dc1a9330a93b1ac92f1e2b72] fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove

https://drive.google.com/file/d/1ZzASCcevbSJUwxjGTUm0ChVgiDEsAeQF/view?usp=sharing linux-hardened-5.17.8.r13.ga1aac13288de-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1qreXTyjKCiXyVTB2F-92Yz05zS45k6Tq/view?usp=sharing linux-hardened-headers-5.17.8.r13.ga1aac13288de-1-x86_64.pkg.tar.zst
Comment by Alec Trevelian (Trevelian) - Sunday, 22 May 2022, 17:43 GMT
https://drive.google.com/file/d/1ZzASCcevbSJUwxjGTUm0ChVgiDEsAeQF/view?usp=sharing linux-hardened-5.17.8.r13.ga1aac13288de-1-x86_64.pkg.tar.zst

Not booting, freeze after "Loading initial ramdisk" like the "5.17.9.hardened1-1" in the repo.
Comment by loqs (loqs) - Sunday, 22 May 2022, 17:50 GMT
git bisect bad
a1aac13288de2935dc1a9330a93b1ac92f1e2b72 is the first bad commit
commit a1aac13288de2935dc1a9330a93b1ac92f1e2b72
Author: Javier Martinez Canillas <javierm@redhat.com>
Date: Fri May 6 00:06:31 2022 +0200

fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove

[ Upstream commit b3c9a924aab61adbc29df110006aa03afe1a78ba ]

The driver is calling framebuffer_release() in its .remove callback, but
this will cause the struct fb_info to be freed too early. Since it could
be that a reference is still hold to it if user-space opened the fbdev.

This would lead to a use-after-free error if the framebuffer device was
unregistered but later a user-space process tries to close the fbdev fd.

To prevent this, move the framebuffer_release() call to fb_ops.fb_destroy
instead of doing it in the driver's .remove callback.

Strictly speaking, the code flow in the driver is still wrong because all
the hardware cleanupd (i.e: iounmap) should be done in .remove while the
software cleanup (i.e: releasing the framebuffer) should be done in the
.fb_destroy handler. But this at least makes to match the behavior before
commit 27599aacbaef ("fbdev: Hot-unplug firmware fb devices on forced removal").

Fixes: 27599aacbaef ("fbdev: Hot-unplug firmware fb devices on forced removal")
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link:20220505220631.366371-1-javierm@redhat.com"> https://patchwork.freedesktop.org/patch/msgid/20220505220631.366371-1-javierm@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>

drivers/video/fbdev/vesafb.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

git bisect log
git bisect start
# good: [039120668dacf48247c0760b12e3eacd6d6b08a2] Linux 5.17.8
git bisect good 039120668dacf48247c0760b12e3eacd6d6b08a2
# bad: [5c2fc53857eb993952e932da8222b11b063c2581] Linux 5.17.9
git bisect bad 5c2fc53857eb993952e932da8222b11b063c2581
# bad: [a1c27ea040e47cbe9bc03b703196a2b506c75905] ASoC: SOF: Fix NULL pointer exception in sof_pci_probe callback
git bisect bad a1c27ea040e47cbe9bc03b703196a2b506c75905
# bad: [a872f3bed07930fd7b10550c441c7b7f83749bb5] dim: initialize all struct fields
git bisect bad a872f3bed07930fd7b10550c441c7b7f83749bb5
# bad: [5db0f897ea7cf807f9817a062ee074de5e9f15f1] platform/surface: aggregator: Fix initialization order when compiling as builtin module
git bisect bad 5db0f897ea7cf807f9817a062ee074de5e9f15f1
# good: [ac0878d4d67b2158ccaecf420e9a31fa0270ccc0] net: mscc: ocelot: fix last VCAP IS1/IS2 filter persisting in hardware when deleted
git bisect good ac0878d4d67b2158ccaecf420e9a31fa0270ccc0
# good: [cd30d7b1b4173a423685a58e9ad19a73b0cf3fbe] net: mscc: ocelot: avoid corrupting hardware counters when moving VCAP filters
git bisect good cd30d7b1b4173a423685a58e9ad19a73b0cf3fbe
# good: [02109faee127f73bb27106394691c452c42a451e] fbdev: efifb: Cleanup fb_info in .fb_destroy rather than .remove
git bisect good 02109faee127f73bb27106394691c452c42a451e
# bad: [a1aac13288de2935dc1a9330a93b1ac92f1e2b72] fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove
git bisect bad a1aac13288de2935dc1a9330a93b1ac92f1e2b72
# first bad commit: [a1aac13288de2935dc1a9330a93b1ac92f1e2b72] fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove
Comment by loqs (loqs) - Sunday, 22 May 2022, 18:26 GMT
5.17.9 with a1aac13288de2935dc1a9330a93b1ac92f1e2b72] reverted hardened patch set and config.

https://drive.google.com/file/d/1niCW55vFlx9prQgJK5vt9yurHUMI8m_f/view?usp=sharing linux-hardened-5.17.9-1.2-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1WQI5Z3nDO4jYssy36HwgfWGERa1Q_J54/view?usp=sharing linux-hardened-headers-5.17.9-1.2-x86_64.pkg.tar.zst
Comment by Levente Polyak (anthraxx) - Sunday, 22 May 2022, 18:50 GMT
Interesting, besides another commit this was actually in my potential candidates list from the changelog.

However, all three commits do similar changes to the same code architecture, it may just be that vesafb is used there.
If the fundamental assumption on those commits if faulty, it must affect all (vesafb,efifb,simplefb) to the same degree:

* a1aac13288de2 - Javier Martinez Canillas - fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove (4 days ago)
* 02109faee127f - Javier Martinez Canillas - fbdev: efifb: Cleanup fb_info in .fb_destroy rather than .remove (4 days ago)
* 8872a31f204b1 - Javier Martinez Canillas - fbdev: simplefb: Cleanup fb_info in .fb_destroy rather than .remove (4 days ago)


Will create a temporary reverted release of those set of patches. However it would be great if you all could stick around for further debugging so we can get the patches addressed in the kernel.
I'll prepare some debugging patches and read into the architecture and API of the fbdev subsystem to understand the issue, but most likely some page verification leads to a panic that may be simply ignored in regular kernel, which vanilla kernel often prefers to do instead of denying further execution.

If nobody else (looking at loqs here :P) comes up with more ideas or debugging patch test releases, I'll try to hack them together. A reproducer would be nice that forces vesafb.
Comment by Alec Trevelian (Trevelian) - Sunday, 22 May 2022, 19:34 GMT
I confirm that "5.17.9 with a1aac13288de2935dc1a9330a93b1ac92f1e2b72] reverted hardened patch set and config." is working.

# cat /proc/version
Linux version 5.17.9-hardened1-1.2-hardened (linux-hardened@archlinux) (gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Sun, 22 May 2022 18:13:27 +0000
Comment by N.T. (NikTo) - Sunday, 22 May 2022, 21:21 GMT Comment by Pascal Ernster (hardfalcon) - Monday, 23 May 2022, 09:18 GMT
One of my machines is also affected by this (a Kaby Lake machine booting in BIOS mode), and I can confirm that using a hardened kernel with 1aac13288de2935dc1a9330a93b1ac92f1e2b72 reverted fixes the problem.

Two notes:

- The crash occurs rather late during boot - the initramfs definitely does get loaded and at least some of the mkinitcpio hooks get executed, because the "encryptssh" hook still works before the crash on my system. However, very shortly after that hook is finished, the kernel hangs. Since the other hooks after "encryptssh" don't do anything video-related, I assume that the crash occurs after pivoting from the initramfs to the actual rootfs on the SSD/HDD.

- Since vesafb is affected, this can probably only be triggered on machines booting in BIOS mode. However, since a second machine with a much older Centerton CPU does not crash although it also boots in BIOS mode with exactly the same mkinitcpio hooks, booting in BIOS mode does not seem to be the only factor involved in triggering this bug.


//EDIT: This bug can also be reproduced in a VM that boots in BIOS mode and uses the QXL GPU (interestingly, the VGA GPU doesn't trigger the bug). Perhaps add "console=ttyS0,115200 loglevel=7" to the kernel's boot parameters, wait till it crashes, then check the serial console of the VM. I've attached a bootlog that shows the bug.
Comment by Levente Polyak (anthraxx) - Wednesday, 25 May 2022, 19:48 GMT
I have released 5.17.11.hardened2-1 with the workaround to revert the fbdev changes. Let's keep using this bug to track down the root issue in the vanilla changes or related to the driver usage.
Comment by N.T. (NikTo) - Wednesday, 25 May 2022, 20:58 GMT
linux 5.18.arch1-1 from testing hangs during boot with many errors.
linux-hardened 5.17.11.hardened2-1 works fine. Thank you very much!
Comment by James Hogan (jhogan) - Thursday, 26 May 2022, 08:19 GMT
linux-hardened 1.17.11-hardened2 works for me now too. Thanks for bisecting folks, I wouldn't have found the time to do so immediately.
Comment by Pascal Ernster (hardfalcon) - Thursday, 26 May 2022, 14:22 GMT
I've tried to hunt down this issue a little more, and I've come up with a shell script that builds a VM image that you can boot with qemu to reproduce the bug. Turns out this might actually be a bug in GRUB's "vbe" module. To be more precise, all of the following conditions seem to be required to trigger the bug:
- GRUB's "vbe" module gets loaded
- GRUB's "gfxterm" is enabled (line "terminal_output gfxterm" in grub.cfg)
- The menu entry in grub.cfg contains the line "set gfxpayload=keep"

The attached script needs to be run as root to be able to create the VM image, and outputs the command required to run the VM as a non-root user in qemu.

//EDIT: I should mention though that I'm not 100% sure if these are the only circumstances under which the bug occurs, since on the bare metal machine where I originally ran into this bug, the relevant GRUB menu entries do not contain the "set gfxpayload=keep" line (though other entries in the GRUB config do contain that line).
Comment by Pascal Ernster (hardfalcon) - Friday, 27 May 2022, 07:49 GMT
The following patch fixes the issue for me even when commit 1aac13288de2935dc1a9330a93b1ac92f1e2b72 and the other commits that were reverted in linux-hardened 5.17.11-hardened2 are not reverted (I've built my test kernel with linux-hardened-5.17.11-hardened1.patch):

https://marc.info/?l=linux-kernel&m=165359685517072&q=raw

I've confirmed this to work in both the test VM setup from my previous comment and on the actual bare metal machine where I originally discovered the issue.
Comment by Pascal Ernster (hardfalcon) - Saturday, 28 May 2022, 17:42 GMT Comment by Pascal Ernster (hardfalcon) - Tuesday, 31 May 2022, 02:13 GMT
linux-hardened 5.17.12.hardened2-1 solves the issue for me as well. :)

Loading...