FS#74814 - [linux-hardened] 5.17.9-hardened hangs during boot
Attached to Project:
Arch Linux
Opened by James Hogan (jhogan) - Saturday, 21 May 2022, 11:28 GMT
Last edited by Levente Polyak (anthraxx) - Thursday, 02 June 2022, 19:50 GMT
Opened by James Hogan (jhogan) - Saturday, 21 May 2022, 11:28 GMT
Last edited by Levente Polyak (anthraxx) - Thursday, 02 June 2022, 19:50 GMT
|
Details
Description:
After updating to linux-hardened 5.17.9.hardened1-1, my QEMU/KVM/libvirt VM no longer boots. After removing "quiet" from kernel command line it hangs after "Starting User Login Management.". Cursor stops flashing. Doesn't respond to anything as far as I can tell. linux-hardened 5.17.7 works fine. linux 5.17.9.arch1-1 works fine. |
This task depends upon
Closed by Levente Polyak (anthraxx)
Thursday, 02 June 2022, 19:50 GMT
Reason for closing: Fixed
Additional comments about closing: 5.17.12.hardened2-1
Thursday, 02 June 2022, 19:50 GMT
Reason for closing: Fixed
Additional comments about closing: 5.17.12.hardened2-1
Please take a look a dmesg on a different tty CTRL-ALT-(F key) or journalctl boot log
Before bisecting, you could try:
1) Use the hardened `config` to compile a vanilla kernel PKGBUILD from source and test if it works
If that still works, you need to make a bisect between v5.17.7 and v5.17.9 while applying the hardened patch set on each bisect step.
https://drive.google.com/file/d/1FgLijZUrcOcHZTKAyHHjt2s0B2PDGSga/view?usp=sharing linux-5.17.9.arch1-1.1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1afoxkkfnscfMCfEXiUIMgT9fKgOdHMHB/view?usp=sharing linux-headers-5.17.9.arch1-1.1-x86_64.pkg.tar.zst
PKGBUILD.diff shows one change needed to the PKGBUILD as the hardened config does not enable DEBUG_INFO_BTF_MODULES and also the difference in the configs.
Its booting when I try on a KVM VM.
https://drive.google.com/file/d/1uvY39aqNjJiZP2s1fs2jqe6T_vDNHGLS/view?usp=sharing linux-hardened-5.17.8.hardened1-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1qalvqF3PbQYbBSlqZ2qAUAlYtcAv0z6e/view?usp=sharing linux-hardened-headers-5.17.8.hardened1-1-x86_64.pkg.tar.zst
Boot ok.
$ git bisect start
g$ it bisect good v5.17.8
$ git bisect bad v5.17.9
Bisecting: 57 revisions left to test after this (roughly 6 steps)
[a1c27ea040e47cbe9bc03b703196a2b506c75905] ASoC: SOF: Fix NULL pointer exception in sof_pci_probe callback
a1c27ea040e47cbe9bc03b703196a2b506c75905 with 5.17.8 hardened patch set (5.17.9 did not apply cleanly) 5.17.9 hardened config
https://drive.google.com/file/d/1EZ9VyHbXyBn_-ESKEAkSknXEYcnrH2Zg/view?usp=sharing linux-hardened-5.17.8.r57.ga1c27ea040e4-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1qILEovH7hhScjgt--BgmTbYpvuSJHsTo/view?usp=sharing linux-hardened-headers-5.17.8.r57.ga1c27ea040e4-1-x86_64.pkg.tar.zst
Not booting, freeze after "Loading initial ramdisk" like the "5.17.9.hardened1-1" in the repo.
Bisecting: 28 revisions left to test after this (roughly 5 steps)
[a872f3bed07930fd7b10550c441c7b7f83749bb5] dim: initialize all struct fields
https://drive.google.com/file/d/1aTmSNyAEFuDdIsZSFBzYU10xOfQEteof/view?usp=sharing linux-hardened-5.17.8.r28.ga872f3bed079-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1urOTYXIT2JNhc9G1WqKP-sMuBQk9n35U/view?usp=sharing linux-hardened-headers-5.17.8.r28.ga872f3bed079-1-x86_64.pkg.tar.zst
Not booting, freeze after "Loading initial ramdisk" like the "5.17.9.hardened1-1" in the repo.
Bisecting: 13 revisions left to test after this (roughly 4 steps)
[5db0f897ea7cf807f9817a062ee074de5e9f15f1] platform/surface: aggregator: Fix initialization order when compiling as builtin module
https://drive.google.com/file/d/1nhRyXGBHt2_frP3DmaoHarkw5B2Oji0L/view?usp=sharing linux-hardened-5.17.8.r14.g5db0f897ea7c-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/16YuA2WeTAbCEuHZmpcvNcpqqioMAmfg4/view?usp=sharing linux-hardened-headers-5.17.8.r14.g5db0f897ea7c-1-x86_64.pkg.tar.zst
Not booting, freeze after "Loading initial ramdisk" like the "5.17.9.hardened1-1" in the repo.
(next try from me will be for tomorrow)
Bisecting: 6 revisions left to test after this (roughly 3 steps)
[ac0878d4d67b2158ccaecf420e9a31fa0270ccc0] net: mscc: ocelot: fix last VCAP IS1/IS2 filter persisting in hardware when deleted
https://drive.google.com/file/d/1_Tq75KplQIEJvfOux6pLLwxgRd5hqfio/view?usp=sharing linux-hardened-5.17.8.r7.gac0878d4d67b-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1RPNfxlYuBjrn0VrtbZIhdwDddY_lusbf/view?usp=sharing linux-hardened-headers-5.17.8.r7.gac0878d4d67b-1-x86_64.pkg.tar.zst
Boot OK !
# cat /proc/version
Linux version 5.17.8-hardened1-1-hardened-00007-gac0878d4d67b (linux-hardened@archlinux) (gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Sat, 21 May 2022 22:34:15 +0000
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[cd30d7b1b4173a423685a58e9ad19a73b0cf3fbe] net: mscc: ocelot: avoid corrupting hardware counters when moving VCAP filters
https://drive.google.com/file/d/1T7h96G4Fv2Q8-GxC7W9PZlSRtCk1q_b1/view?usp=sharing linux-hardened-5.17.8.r10.gcd30d7b1b417-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1Zi3GmAuHrOQ5A1ckyVnet69X5mXuH8-A/view?usp=sharing linux-hardened-headers-5.17.8.r10.gcd30d7b1b417-1-x86_64.pkg.tar.zst
Boot OK !
Bisecting: 1 revision left to test after this (roughly 1 step)
[02109faee127f73bb27106394691c452c42a451e] fbdev: efifb: Cleanup fb_info in .fb_destroy rather than .remove
https://drive.google.com/file/d/1TdeeGNA7Wptd_BkzzbrEBNe0gL69laDp/view?usp=sharing linux-hardened-5.17.8.r12.g02109faee127-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1Pc79OvyFPUXR6Tg3bfc0Lh1gcZyTaO8o/view?usp=sharing linux-hardened-headers-5.17.8.r12.g02109faee127-1-x86_64.pkg.tar.zst
Boot OK !
# cat /proc/version
Linux version 5.17.8-hardened1-1-hardened-00012-g02109faee127 (linux-hardened@archlinux) (gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Sun, 22 May 2022 16:27:31 +0000
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[a1aac13288de2935dc1a9330a93b1ac92f1e2b72] fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove
https://drive.google.com/file/d/1ZzASCcevbSJUwxjGTUm0ChVgiDEsAeQF/view?usp=sharing linux-hardened-5.17.8.r13.ga1aac13288de-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1qreXTyjKCiXyVTB2F-92Yz05zS45k6Tq/view?usp=sharing linux-hardened-headers-5.17.8.r13.ga1aac13288de-1-x86_64.pkg.tar.zst
Not booting, freeze after "Loading initial ramdisk" like the "5.17.9.hardened1-1" in the repo.
a1aac13288de2935dc1a9330a93b1ac92f1e2b72 is the first bad commit
commit a1aac13288de2935dc1a9330a93b1ac92f1e2b72
Author: Javier Martinez Canillas <javierm@redhat.com>
Date: Fri May 6 00:06:31 2022 +0200
fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove
[ Upstream commit b3c9a924aab61adbc29df110006aa03afe1a78ba ]
The driver is calling framebuffer_release() in its .remove callback, but
this will cause the struct fb_info to be freed too early. Since it could
be that a reference is still hold to it if user-space opened the fbdev.
This would lead to a use-after-free error if the framebuffer device was
unregistered but later a user-space process tries to close the fbdev fd.
To prevent this, move the framebuffer_release() call to fb_ops.fb_destroy
instead of doing it in the driver's .remove callback.
Strictly speaking, the code flow in the driver is still wrong because all
the hardware cleanupd (i.e: iounmap) should be done in .remove while the
software cleanup (i.e: releasing the framebuffer) should be done in the
.fb_destroy handler. But this at least makes to match the behavior before
commit 27599aacbaef ("fbdev: Hot-unplug firmware fb devices on forced removal").
Fixes: 27599aacbaef ("fbdev: Hot-unplug firmware fb devices on forced removal")
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link:20220505220631.366371-1-javierm@redhat.com"> https://patchwork.freedesktop.org/patch/msgid/20220505220631.366371-1-javierm@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
drivers/video/fbdev/vesafb.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
git bisect log
git bisect start
# good: [039120668dacf48247c0760b12e3eacd6d6b08a2] Linux 5.17.8
git bisect good 039120668dacf48247c0760b12e3eacd6d6b08a2
# bad: [5c2fc53857eb993952e932da8222b11b063c2581] Linux 5.17.9
git bisect bad 5c2fc53857eb993952e932da8222b11b063c2581
# bad: [a1c27ea040e47cbe9bc03b703196a2b506c75905] ASoC: SOF: Fix NULL pointer exception in sof_pci_probe callback
git bisect bad a1c27ea040e47cbe9bc03b703196a2b506c75905
# bad: [a872f3bed07930fd7b10550c441c7b7f83749bb5] dim: initialize all struct fields
git bisect bad a872f3bed07930fd7b10550c441c7b7f83749bb5
# bad: [5db0f897ea7cf807f9817a062ee074de5e9f15f1] platform/surface: aggregator: Fix initialization order when compiling as builtin module
git bisect bad 5db0f897ea7cf807f9817a062ee074de5e9f15f1
# good: [ac0878d4d67b2158ccaecf420e9a31fa0270ccc0] net: mscc: ocelot: fix last VCAP IS1/IS2 filter persisting in hardware when deleted
git bisect good ac0878d4d67b2158ccaecf420e9a31fa0270ccc0
# good: [cd30d7b1b4173a423685a58e9ad19a73b0cf3fbe] net: mscc: ocelot: avoid corrupting hardware counters when moving VCAP filters
git bisect good cd30d7b1b4173a423685a58e9ad19a73b0cf3fbe
# good: [02109faee127f73bb27106394691c452c42a451e] fbdev: efifb: Cleanup fb_info in .fb_destroy rather than .remove
git bisect good 02109faee127f73bb27106394691c452c42a451e
# bad: [a1aac13288de2935dc1a9330a93b1ac92f1e2b72] fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove
git bisect bad a1aac13288de2935dc1a9330a93b1ac92f1e2b72
# first bad commit: [a1aac13288de2935dc1a9330a93b1ac92f1e2b72] fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove
https://drive.google.com/file/d/1niCW55vFlx9prQgJK5vt9yurHUMI8m_f/view?usp=sharing linux-hardened-5.17.9-1.2-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1WQI5Z3nDO4jYssy36HwgfWGERa1Q_J54/view?usp=sharing linux-hardened-headers-5.17.9-1.2-x86_64.pkg.tar.zst
However, all three commits do similar changes to the same code architecture, it may just be that vesafb is used there.
If the fundamental assumption on those commits if faulty, it must affect all (vesafb,efifb,simplefb) to the same degree:
* a1aac13288de2 - Javier Martinez Canillas - fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove (4 days ago)
* 02109faee127f - Javier Martinez Canillas - fbdev: efifb: Cleanup fb_info in .fb_destroy rather than .remove (4 days ago)
* 8872a31f204b1 - Javier Martinez Canillas - fbdev: simplefb: Cleanup fb_info in .fb_destroy rather than .remove (4 days ago)
Will create a temporary reverted release of those set of patches. However it would be great if you all could stick around for further debugging so we can get the patches addressed in the kernel.
I'll prepare some debugging patches and read into the architecture and API of the fbdev subsystem to understand the issue, but most likely some page verification leads to a panic that may be simply ignored in regular kernel, which vanilla kernel often prefers to do instead of denying further execution.
If nobody else (looking at loqs here :P) comes up with more ideas or debugging patch test releases, I'll try to hack them together. A reproducer would be nice that forces vesafb.
# cat /proc/version
Linux version 5.17.9-hardened1-1.2-hardened (linux-hardened@archlinux) (gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Sun, 22 May 2022 18:13:27 +0000
https://fedoraproject.org/wiki/Changes/ReplaceFbdevDrivers
https://gitlab.com/cki-project/kernel-ark/-/commit/53d2c01aef5aaad8b8bb54e4254f6ae671c76ee5
https://blog.dowhile0.org/2022/04/22/fedora-36-a-brave-new-drm-kms-only-world/
Framebuffer drivers have been disabled in kernel and replaced by simpledrm in Alpine Linux 3.15.0 (2021-11-24).
Two notes:
- The crash occurs rather late during boot - the initramfs definitely does get loaded and at least some of the mkinitcpio hooks get executed, because the "encryptssh" hook still works before the crash on my system. However, very shortly after that hook is finished, the kernel hangs. Since the other hooks after "encryptssh" don't do anything video-related, I assume that the crash occurs after pivoting from the initramfs to the actual rootfs on the SSD/HDD.
- Since vesafb is affected, this can probably only be triggered on machines booting in BIOS mode. However, since a second machine with a much older Centerton CPU does not crash although it also boots in BIOS mode with exactly the same mkinitcpio hooks, booting in BIOS mode does not seem to be the only factor involved in triggering this bug.
//EDIT: This bug can also be reproduced in a VM that boots in BIOS mode and uses the QXL GPU (interestingly, the VGA GPU doesn't trigger the bug). Perhaps add "console=ttyS0,115200 loglevel=7" to the kernel's boot parameters, wait till it crashes, then check the serial console of the VM. I've attached a bootlog that shows the bug.
linux-hardened 5.17.11.hardened2-1 works fine. Thank you very much!
- GRUB's "vbe" module gets loaded
- GRUB's "gfxterm" is enabled (line "terminal_output gfxterm" in grub.cfg)
- The menu entry in grub.cfg contains the line "set gfxpayload=keep"
The attached script needs to be run as root to be able to create the VM image, and outputs the command required to run the VM as a non-root user in qemu.
//EDIT: I should mention though that I'm not 100% sure if these are the only circumstances under which the bug occurs, since on the bare metal machine where I originally ran into this bug, the relevant GRUB menu entries do not contain the "set gfxpayload=keep" line (though other entries in the GRUB config do contain that line).
https://marc.info/?l=linux-kernel&m=165359685517072&q=raw
I've confirmed this to work in both the test VM setup from my previous comment and on the actual bare metal machine where I originally discovered the issue.
https://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev.git/commit/?h=for-next&id=acde4003efc16480375543638484d8f13f2e99a3