FS#74891 - kernel bug (upon boot) after upgrade to 5.18.arch1-1
Attached to Project:
Arch Linux
Opened by Christian Cwienk (dr1fter) - Sunday, 29 May 2022, 06:51 GMT
Last edited by Toolybird (Toolybird) - Thursday, 20 July 2023, 00:06 GMT
Opened by Christian Cwienk (dr1fter) - Sunday, 29 May 2022, 06:51 GMT
Last edited by Toolybird (Toolybird) - Thursday, 20 July 2023, 00:06 GMT
|
Details
Description:
After upgrading to linux-5.18-arch1-1 (from 5.17.9-arch1-1), booting no longer works on my machine (seems to be specific for CPU/Chipset combination). Booting seems to work, but veeery slowly. After some tens of seconds, systemd displays a warning about a job "Load Kernel Modules" taking too long (eventually, this runs into a time-out). After waiting some more, booting seems to continue. Upon display mode change, my machine reproducibly freezes, displaying a black screen with only a white cursor in the very upper left. It does not accept any keyboard input (e.g. to switch to rescue-shell) at this point. During active booting phase, it is still possible to e.g. use CTRL-ALT-DEL to trigger a reboot. After downgrading back to 5.17.9, booting works again. Additional info: Excerpt from `journalctl --boot=-1` (full log attached): ``` 803 May 29 08:04:12 arch kernel: traps: Missing ENDBR: _nv011430rm+0x0/0x10 [nvidia] 804 May 29 08:04:12 arch kernel: ------------[ cut here ]------------ 805 May 29 08:04:12 arch kernel: kernel BUG at arch/x86/kernel/traps.c:252! 806 May 29 08:04:12 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI 807 May 29 08:04:12 arch kernel: CPU: 10 PID: 279 Comm: modprobe Tainted: P OE 5.18.0-arc 1-1 #1 b71a70fe104889aac2f32556bc52f649da2881d2 808 May 29 08:04:12 arch kernel: Hardware name: ASUS System Product Name/TUF GAMING Z690-PLUS WIFI, BIO 0809 12/08/2021 809 May 29 08:04:12 arch kernel: RIP: 0010:exc_control_protection+0xc2/0xd0 810 May 29 08:04:12 arch kernel: Code: 8b 93 80 00 00 00 be f9 00 00 00 48 c7 c7 d3 ab 06 b2 e8 d1 01 5 ff e9 72 ff ff ff 48 c7 c7 ba ab 06 b2 e8 c7 31 fb ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 0 90 66 0f 1f 00 55 53 48 89 ``` I just recently upgraded my PC (had an AMD Ryzen 7 + B550 Chipset before). After doing so, EFISTUB-boot no longer worked, so I had to switch to rEFInd. I mention this as this seems to indicate there are some "quirks" w/ new mainboard (hardware specs seem to be included in attached log, so I will omit them here). Steps to reproduce: - run `pacman -Syu` (or `pacman -U /var/cache/pacman/pkg/linux-5.18.arch1-1-x86_64.pkg.tar.zs`) - reboot using updated kernel (will most likely only affect machines w/ similar hardware than mine) |
This task depends upon
Closed by Toolybird (Toolybird)
Thursday, 20 July 2023, 00:06 GMT
Reason for closing: Fixed
Additional comments about closing: See comments
Thursday, 20 July 2023, 00:06 GMT
Reason for closing: Fixed
Additional comments about closing: See comments
I was able to boot into Linux 5.18 after blacklisting the `nvidia` kernel module.
Kernel: archlinux 5.18.0-zen1-1-zen
FS#74886i7 12700
AMD 6600 XT
/proc/version: Linux version 5.18.0-arch1-1 (linux@archlinux) (gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT_DYNAMIC Tue, 24 May 2022 22:00:36 +0000
Partial dmesg output when issue presents(without `ibt=off`):
```
May 29 17:21:26.199326 nmd-arch kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x19@0x178
May 29 17:21:26.199867 nmd-arch kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x1e@0x21c
May 29 17:21:31.502634 nmd-arch kernel: traps: Missing ENDBR: cmpl_eax_edx+0x0/0x10 [kvm]
May 29 17:21:31.503134 nmd-arch kernel: ------------[ cut here ]------------
May 29 17:21:31.503178 nmd-arch kernel: kernel BUG at arch/x86/kernel/traps.c:252!
May 29 17:21:31.503315 nmd-arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
May 29 17:21:31.503448 nmd-arch kernel: CPU: 11 PID: 12436 Comm: CPU 11/KVM Not tainted 5.18.0-arch1-1 #1 b71a70fe104889aac2f32556bc52f649da2881d2
May 29 17:21:31.503473 nmd-arch kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D43/PRO B660M-A DDR4 (MS-7D43), BIOS 1.10 02/25/2022
May 29 17:21:31.503497 nmd-arch kernel: RIP: 0010:exc_control_protection+0xc2/0xd0
May 29 17:21:31.503517 nmd-arch kernel: Code: 8b 93 80 00 00 00 be f9 00 00 00 48 c7 c7 d3 ab a6 8d e8 d1 01 50 ff e9 72 ff ff ff 48 c7 c7 ba ab a6 8d e8 c7 31 fb ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 66 0f 1f 00 55 53 48 89
May 29 17:21:31.503561 nmd-arch kernel: RSP: 0018:ffffb3fa45007bf8 EFLAGS: 00010002
May 29 17:21:31.503594 nmd-arch kernel: RAX: 0000000000000031 RBX: ffffb3fa45007c18 RCX: 0000000000000000
May 29 17:21:31.503612 nmd-arch kernel: RDX: 0000000000000000 RSI: ffff925e502e16a0 RDI: ffff925e502e16a0
May 29 17:21:31.503637 nmd-arch kernel: RBP: 0000000000000003 R08: 0000000000000000 R09: ffffb3fa45007a18
May 29 17:21:31.503653 nmd-arch kernel: R10: 0000000000000003 R11: ffff925e707ac1e8 R12: 0000000000000000
May 29 17:21:31.503676 nmd-arch kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
May 29 17:21:31.503689 nmd-arch kernel: FS: 00007f381e9ff640(0000) GS:ffff925e502c0000(0000) knlGS:0000000000000000
May 29 17:21:31.503704 nmd-arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 29 17:21:31.503714 nmd-arch kernel: CR2: ffffb68b46bdc000 CR3: 00000002ee2e6002 CR4: 0000000000f72ee0
May 29 17:21:31.503724 nmd-arch kernel: PKRU: 55555554
May 29 17:21:31.503748 nmd-arch kernel: Call Trace:
May 29 17:21:31.503759 nmd-arch kernel: <TASK>
May 29 17:21:31.503769 nmd-arch kernel: asm_exc_control_protection+0x22/0x30
...
```
see attached file for full dmesg
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6649fa876da4c505548b8e8945a6fc48e62e427c
CPU (according to `lscpu`):
```
BIOS Vendor ID: Intel(R) Corporation
Model name: 12th Gen Intel(R) Core(TM) i7-12700F
```
GPU: (according to `glxinfo`=:
```
OpenGL renderer string: NVIDIA GeForce GTX 1660 SUPER/PCIe/SSE2
```
[0] https://github.com/NVIDIA/open-gpu-kernel-modules/issues/256
FS#74886Note it applies only to nvidia-open and those cards supported by that driver.In the nvidia package the supplied pre compiled blob is missing IBT and SLS so the fix for that will have to come from nvidia.
```
May 31 00:32:55 thinkpad kernel: pci 0000:00:07.0: DPC: RP PIO log size 0 is invalid
May 31 00:32:55 thinkpad kernel: pci 0000:00:07.2: DPC: RP PIO log size 0 is invalid
May 31 00:32:55 thinkpad kernel: traps: Missing ENDBR: init_module+0x0/0x1a0 [vmmon]
May 31 00:32:55 thinkpad kernel: kernel BUG at arch/x86/kernel/traps.c:252!
May 31 00:32:55 thinkpad systemd[1]: Failed to start Load Kernel Modules.
May 31 00:32:58 thinkpad kernel: Bluetooth: hci0: Malformed MSFT vendor event: 0x02
```
Do I understand it correctly, that this is not an issue w/ linux-5.18 as such, but rather a change of how linux will load (or run?) kernel modules (if ibt is enabled). Which causes the described issues if a module is loaded that has not been built accordingly. Which is something nvidia will have to do for their nonfree/proprietary gpu-driver-module.
So available options for using linux-5.18 might be:
- disable ibt (as has been suggested as workaround)
- wait for nvidia to publish an updated kernel-module
- switch to nouveau or nvidia-open
- switch to non-nvidia GPU (+ uninstall nvidia-kernel-module)
As far as I understand, both nouveau and nvidia-open have drawbacks (performance / supported features-set, ..), so it seems to me that disabling ibt, but staying w/ proprietary nvidia-kmod seems to be the best choice for now (comments on my assumptions and conclusions are very appreciated).
Different than I originally assumed when I created this ticket, the issue does not stem from the kernel-image itself, but rather nvidia's proprietary kernel-module not yet being compiled w/ support for IBT. this does not affect most other GNU/Linux distributions, as those either still use a kernel that does not feature IBT, or configure their kernel packages in such a way that those by default do not enable IBT.
As mentioned in the gh-issue I referenced above, linux will enable IBT by default w/ the incoming 6.2 version. Thus, there is some likelihood Nvidia will by then offer a fixed version of their kernel-module. The open-source version of Nvidia's kernel-module, however, has already been fixed.
So instead of disabling IBT, you might also opt for nvidia-open, or nouveau (which may or may not lead to other issues).
~ $ lsmod | grep nv
nvme 65536 3
nvme_core 221184 5 nvme
nvme_common 24576 1 nvme_core
The kernel/system is booting properly with IBT enabled using kernel 6.2.7 and the newly-released stable nVidia proprietary driver 530.41.03.
See:
https://www.virtualbox.org/ticket/21435
After all this time I'm finally able to run my system (nVidia graphics + Virtualbox host) with IBT enabled!