FS#74891 - kernel bug (upon boot) after upgrade to 5.18.arch1-1

Attached to Project: Arch Linux
Opened by Christian Cwienk (dr1fter) - Sunday, 29 May 2022, 06:51 GMT
Last edited by Toolybird (Toolybird) - Thursday, 20 July 2023, 00:06 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 23
Private No

Details

Description:

After upgrading to linux-5.18-arch1-1 (from 5.17.9-arch1-1), booting no longer works on my machine (seems to be specific for CPU/Chipset combination).

Booting seems to work, but veeery slowly. After some tens of seconds, systemd displays a warning about a job "Load Kernel Modules" taking too long (eventually, this runs into a time-out). After waiting some more, booting seems to continue. Upon display mode change, my machine reproducibly freezes, displaying a black screen with only a white cursor in the very upper left. It does not accept any keyboard input (e.g. to switch to rescue-shell) at this point. During active booting phase, it is still possible to e.g. use CTRL-ALT-DEL to trigger a reboot.

After downgrading back to 5.17.9, booting works again.

Additional info:

Excerpt from `journalctl --boot=-1` (full log attached):

```
803 May 29 08:04:12 arch kernel: traps: Missing ENDBR: _nv011430rm+0x0/0x10 [nvidia]
804 May 29 08:04:12 arch kernel: ------------[ cut here ]------------
805 May 29 08:04:12 arch kernel: kernel BUG at arch/x86/kernel/traps.c:252!
806 May 29 08:04:12 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
807 May 29 08:04:12 arch kernel: CPU: 10 PID: 279 Comm: modprobe Tainted: P OE 5.18.0-arc
1-1 #1 b71a70fe104889aac2f32556bc52f649da2881d2
808 May 29 08:04:12 arch kernel: Hardware name: ASUS System Product Name/TUF GAMING Z690-PLUS WIFI, BIO
0809 12/08/2021
809 May 29 08:04:12 arch kernel: RIP: 0010:exc_control_protection+0xc2/0xd0
810 May 29 08:04:12 arch kernel: Code: 8b 93 80 00 00 00 be f9 00 00 00 48 c7 c7 d3 ab 06 b2 e8 d1 01 5
ff e9 72 ff ff ff 48 c7 c7 ba ab 06 b2 e8 c7 31 fb ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00
0 90 66 0f 1f 00 55 53 48 89
```

I just recently upgraded my PC (had an AMD Ryzen 7 + B550 Chipset before). After doing so, EFISTUB-boot no longer worked, so I had to switch to rEFInd. I mention this as this seems to indicate there are some "quirks" w/ new mainboard (hardware specs seem to be included in attached log, so I will omit them here).

Steps to reproduce:

- run `pacman -Syu` (or `pacman -U /var/cache/pacman/pkg/linux-5.18.arch1-1-x86_64.pkg.tar.zs`)
- reboot using updated kernel (will most likely only affect machines w/ similar hardware than mine)
This task depends upon

Closed by  Toolybird (Toolybird)
Thursday, 20 July 2023, 00:06 GMT
Reason for closing:  Fixed
Additional comments about closing:  See comments
Comment by Abhijeet V (abhijeetviswa) - Sunday, 29 May 2022, 12:28 GMT
Reproducible on my machine. I have an Intel i5-11400H + RTX 3050 Mobile (It's an Asus Tuf F15 laptop).
I was able to boot into Linux 5.18 after blacklisting the `nvidia` kernel module.
Kernel: archlinux 5.18.0-zen1-1-zen
Comment by loqs (loqs) - Sunday, 29 May 2022, 13:39 GMT
 FS#74886 
Comment by Jozef Matus (beretis) - Sunday, 29 May 2022, 14:21 GMT
Im having the same issue. I have GTX 3060 and i5 12600.
Comment by nmdanny (nmdanny) - Sunday, 29 May 2022, 15:17 GMT
I have a similar issue. I'm using KVM & GPU passthrough (the GPU is isolated from the host), and I have issues whenever I boot the VM(the VM would hang when starting to boot, and the only way to stop is to restart the host PC). The issue was fixed for me by adding `ibt=off` kernel parameter

i7 12700
AMD 6600 XT

/proc/version: Linux version 5.18.0-arch1-1 (linux@archlinux) (gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT_DYNAMIC Tue, 24 May 2022 22:00:36 +0000

Partial dmesg output when issue presents(without `ibt=off`):
```
May 29 17:21:26.199326 nmd-arch kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x19@0x178
May 29 17:21:26.199867 nmd-arch kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x1e@0x21c
May 29 17:21:31.502634 nmd-arch kernel: traps: Missing ENDBR: cmpl_eax_edx+0x0/0x10 [kvm]
May 29 17:21:31.503134 nmd-arch kernel: ------------[ cut here ]------------
May 29 17:21:31.503178 nmd-arch kernel: kernel BUG at arch/x86/kernel/traps.c:252!
May 29 17:21:31.503315 nmd-arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
May 29 17:21:31.503448 nmd-arch kernel: CPU: 11 PID: 12436 Comm: CPU 11/KVM Not tainted 5.18.0-arch1-1 #1 b71a70fe104889aac2f32556bc52f649da2881d2
May 29 17:21:31.503473 nmd-arch kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D43/PRO B660M-A DDR4 (MS-7D43), BIOS 1.10 02/25/2022
May 29 17:21:31.503497 nmd-arch kernel: RIP: 0010:exc_control_protection+0xc2/0xd0
May 29 17:21:31.503517 nmd-arch kernel: Code: 8b 93 80 00 00 00 be f9 00 00 00 48 c7 c7 d3 ab a6 8d e8 d1 01 50 ff e9 72 ff ff ff 48 c7 c7 ba ab a6 8d e8 c7 31 fb ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 66 0f 1f 00 55 53 48 89
May 29 17:21:31.503561 nmd-arch kernel: RSP: 0018:ffffb3fa45007bf8 EFLAGS: 00010002
May 29 17:21:31.503594 nmd-arch kernel: RAX: 0000000000000031 RBX: ffffb3fa45007c18 RCX: 0000000000000000
May 29 17:21:31.503612 nmd-arch kernel: RDX: 0000000000000000 RSI: ffff925e502e16a0 RDI: ffff925e502e16a0
May 29 17:21:31.503637 nmd-arch kernel: RBP: 0000000000000003 R08: 0000000000000000 R09: ffffb3fa45007a18
May 29 17:21:31.503653 nmd-arch kernel: R10: 0000000000000003 R11: ffff925e707ac1e8 R12: 0000000000000000
May 29 17:21:31.503676 nmd-arch kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
May 29 17:21:31.503689 nmd-arch kernel: FS: 00007f381e9ff640(0000) GS:ffff925e502c0000(0000) knlGS:0000000000000000
May 29 17:21:31.503704 nmd-arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 29 17:21:31.503714 nmd-arch kernel: CR2: ffffb68b46bdc000 CR3: 00000002ee2e6002 CR4: 0000000000f72ee0
May 29 17:21:31.503724 nmd-arch kernel: PKRU: 55555554
May 29 17:21:31.503748 nmd-arch kernel: Call Trace:
May 29 17:21:31.503759 nmd-arch kernel: <TASK>
May 29 17:21:31.503769 nmd-arch kernel: asm_exc_control_protection+0x22/0x30
...
```

see attached file for full dmesg
Comment by loqs (loqs) - Sunday, 29 May 2022, 15:26 GMT
@nmdanny please report your issue upstream. Unlike with the nvidia modules your issue is in code from upstream that has already been changed for IBT support [1]

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6649fa876da4c505548b8e8945a6fc48e62e427c
Comment by Christian Cwienk (dr1fter) - Sunday, 29 May 2022, 19:03 GMT
since other users shared their hw-specs, and the issue seems to be related to both CPU (Intel Alder Lake) and GPU (NVIDIA) [0], I would like to share my hardware specs as well (hoping this helps):

CPU (according to `lscpu`):
```
BIOS Vendor ID: Intel(R) Corporation
Model name: 12th Gen Intel(R) Core(TM) i7-12700F
```

GPU: (according to `glxinfo`=:
```
OpenGL renderer string: NVIDIA GeForce GTX 1660 SUPER/PCIe/SSE2
```



[0] https://github.com/NVIDIA/open-gpu-kernel-modules/issues/256
Comment by loqs (loqs) - Monday, 30 May 2022, 14:13 GMT
I have attached a test fix for the nvidia modules to  FS#74886  Note it applies only to nvidia-open and those cards supported by that driver.
In the nvidia package the supplied pre compiled blob is missing IBT and SLS so the fix for that will have to come from nvidia.
Comment by Iyan (iyanmv) - Monday, 30 May 2022, 22:39 GMT
I don't have an nvidia card on my laptop but I see four new errors after updating to linux 5.18.x

```
May 31 00:32:55 thinkpad kernel: pci 0000:00:07.0: DPC: RP PIO log size 0 is invalid
May 31 00:32:55 thinkpad kernel: pci 0000:00:07.2: DPC: RP PIO log size 0 is invalid
May 31 00:32:55 thinkpad kernel: traps: Missing ENDBR: init_module+0x0/0x1a0 [vmmon]
May 31 00:32:55 thinkpad kernel: kernel BUG at arch/x86/kernel/traps.c:252!
May 31 00:32:55 thinkpad systemd[1]: Failed to start Load Kernel Modules.
May 31 00:32:58 thinkpad kernel: Bluetooth: hci0: Malformed MSFT vendor event: 0x02
```
Comment by loqs (loqs) - Monday, 30 May 2022, 22:50 GMT
@iyanmv is the vmmon module from vmware? You can either disable IBT with ibt=off or investigate how to add the compiler options `-fcf-protection=branch -mindirect-branch-register` to the build of the vmmon module or contact the modules author.
Comment by Iyan (iyanmv) - Monday, 30 May 2022, 22:58 GMT
@loqs thanks ;)
Comment by BettyMorlock (Bettymorlock) - Thursday, 02 June 2022, 03:14 GMT
Same issue as @iyanmv disabling IBT did nothing but I ended up just uninstalling vmware and everything is fine now.
Comment by Christian Cwienk (dr1fter) - Saturday, 04 June 2022, 10:22 GMT
thanks to everybody for sharing additional insights, and the workaround of disabling ibt (which - unsurprisingly - also worked for me)

Do I understand it correctly, that this is not an issue w/ linux-5.18 as such, but rather a change of how linux will load (or run?) kernel modules (if ibt is enabled). Which causes the described issues if a module is loaded that has not been built accordingly. Which is something nvidia will have to do for their nonfree/proprietary gpu-driver-module.

So available options for using linux-5.18 might be:

- disable ibt (as has been suggested as workaround)
- wait for nvidia to publish an updated kernel-module
- switch to nouveau or nvidia-open
- switch to non-nvidia GPU (+ uninstall nvidia-kernel-module)

As far as I understand, both nouveau and nvidia-open have drawbacks (performance / supported features-set, ..), so it seems to me that disabling ibt, but staying w/ proprietary nvidia-kmod seems to be the best choice for now (comments on my assumptions and conclusions are very appreciated).
Comment by Christian Cwienk (dr1fter) - Wednesday, 08 June 2022, 05:41 GMT
Comment by loqs (loqs) - Thursday, 09 June 2022, 19:55 GMT Comment by hugo bini (poulpomancien) - Thursday, 16 June 2022, 07:50 GMT
Had to use ibt=off too to host my virtualbox VMs over an 11th gen intel laptop. Without that, virtualbox hanged with the same "Missing ENDBR" error in vboxdrv module.
Comment by skrat (skrat) - Tuesday, 10 January 2023, 10:41 GMT
Same here, ibt=off helped. Is there an explanation?
Comment by Christian Cwienk (dr1fter) - Tuesday, 10 January 2023, 11:10 GMT
@skrat: see https://github.com/NVIDIA/open-gpu-kernel-modules/issues/256

Different than I originally assumed when I created this ticket, the issue does not stem from the kernel-image itself, but rather nvidia's proprietary kernel-module not yet being compiled w/ support for IBT. this does not affect most other GNU/Linux distributions, as those either still use a kernel that does not feature IBT, or configure their kernel packages in such a way that those by default do not enable IBT.

As mentioned in the gh-issue I referenced above, linux will enable IBT by default w/ the incoming 6.2 version. Thus, there is some likelihood Nvidia will by then offer a fixed version of their kernel-module. The open-source version of Nvidia's kernel-module, however, has already been fixed.

So instead of disabling IBT, you might also opt for nvidia-open, or nouveau (which may or may not lead to other issues).
Comment by skrat (skrat) - Tuesday, 10 January 2023, 11:14 GMT
@dr1fter does it matter that I don't have any nvidia HW or nvidia modules loaded (afaik), yet I'm still experiencing the issue?

~ $ lsmod | grep nv
nvme 65536 3
nvme_core 221184 5 nvme
nvme_common 24576 1 nvme_core
Comment by Christian Cwienk (dr1fter) - Tuesday, 10 January 2023, 11:27 GMT
@skrat: generally speaking, I would assume that any kernel-modules not supporting IBT should have this issue. However, NVidia's should be one of the most common ones
Comment by tekstryder (Tekstryder) - Saturday, 25 March 2023, 14:27 GMT
I can confirm this is finally resolved in regard to the nVidia binary blob.

The kernel/system is booting properly with IBT enabled using kernel 6.2.7 and the newly-released stable nVidia proprietary driver 530.41.03.
Comment by Max (qtmax) - Saturday, 25 March 2023, 14:29 GMT
How about VirtualBox? Does it still require ibt=off, or has it been fixed too?
Comment by tekstryder (Tekstryder) - Saturday, 25 March 2023, 14:33 GMT
@qtmax: Nope, VirtualBox is still busted with IBT enabled.

See:
https://www.virtualbox.org/ticket/21435
Comment by tekstryder (Tekstryder) - Wednesday, 19 July 2023, 21:35 GMT
Virtualbox 7.0.10 was released with IBT support.

After all this time I'm finally able to run my system (nVidia graphics + Virtualbox host) with IBT enabled!

Loading...