Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#74891 - kernel bug (upon boot) after upgrade to 5.18.arch1-1

Attached to Project: Arch Linux
Opened by Christian Cwienk (dr1fter) - Sunday, 29 May 2022, 06:51 GMT
Last edited by Andreas Radke (AndyRTR) - Sunday, 29 May 2022, 08:05 GMT
Task Type Bug Report
Category Kernel
Status Assigned
Assigned To Jan Alexander Steffens (heftig)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 17
Private No

Details

Description:

After upgrading to linux-5.18-arch1-1 (from 5.17.9-arch1-1), booting no longer works on my machine (seems to be specific for CPU/Chipset combination).

Booting seems to work, but veeery slowly. After some tens of seconds, systemd displays a warning about a job "Load Kernel Modules" taking too long (eventually, this runs into a time-out). After waiting some more, booting seems to continue. Upon display mode change, my machine reproducibly freezes, displaying a black screen with only a white cursor in the very upper left. It does not accept any keyboard input (e.g. to switch to rescue-shell) at this point. During active booting phase, it is still possible to e.g. use CTRL-ALT-DEL to trigger a reboot.

After downgrading back to 5.17.9, booting works again.

Additional info:

Excerpt from `journalctl --boot=-1` (full log attached):

```
803 May 29 08:04:12 arch kernel: traps: Missing ENDBR: _nv011430rm+0x0/0x10 [nvidia]
804 May 29 08:04:12 arch kernel: ------------[ cut here ]------------
805 May 29 08:04:12 arch kernel: kernel BUG at arch/x86/kernel/traps.c:252!
806 May 29 08:04:12 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
807 May 29 08:04:12 arch kernel: CPU: 10 PID: 279 Comm: modprobe Tainted: P OE 5.18.0-arc
1-1 #1 b71a70fe104889aac2f32556bc52f649da2881d2
808 May 29 08:04:12 arch kernel: Hardware name: ASUS System Product Name/TUF GAMING Z690-PLUS WIFI, BIO
0809 12/08/2021
809 May 29 08:04:12 arch kernel: RIP: 0010:exc_control_protection+0xc2/0xd0
810 May 29 08:04:12 arch kernel: Code: 8b 93 80 00 00 00 be f9 00 00 00 48 c7 c7 d3 ab 06 b2 e8 d1 01 5
ff e9 72 ff ff ff 48 c7 c7 ba ab 06 b2 e8 c7 31 fb ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00
0 90 66 0f 1f 00 55 53 48 89
```

I just recently upgraded my PC (had an AMD Ryzen 7 + B550 Chipset before). After doing so, EFISTUB-boot no longer worked, so I had to switch to rEFInd. I mention this as this seems to indicate there are some "quirks" w/ new mainboard (hardware specs seem to be included in attached log, so I will omit them here).

Steps to reproduce:

- run `pacman -Syu` (or `pacman -U /var/cache/pacman/pkg/linux-5.18.arch1-1-x86_64.pkg.tar.zs`)
- reboot using updated kernel (will most likely only affect machines w/ similar hardware than mine)
This task depends upon

Comment by Abhijeet V (abhijeetviswa) - Sunday, 29 May 2022, 12:28 GMT
Reproducible on my machine. I have an Intel i5-11400H + RTX 3050 Mobile (It's an Asus Tuf F15 laptop).
I was able to boot into Linux 5.18 after blacklisting the `nvidia` kernel module.
Kernel: archlinux 5.18.0-zen1-1-zen
Comment by loqs (loqs) - Sunday, 29 May 2022, 13:39 GMT Comment by Jozef Matus (beretis) - Sunday, 29 May 2022, 14:21 GMT
Im having the same issue. I have GTX 3060 and i5 12600.
Comment by nmdanny (nmdanny) - Sunday, 29 May 2022, 15:17 GMT
I have a similar issue. I'm using KVM & GPU passthrough (the GPU is isolated from the host), and I have issues whenever I boot the VM(the VM would hang when starting to boot, and the only way to stop is to restart the host PC). The issue was fixed for me by adding `ibt=off` kernel parameter

i7 12700
AMD 6600 XT

/proc/version: Linux version 5.18.0-arch1-1 (linux@archlinux) (gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT_DYNAMIC Tue, 24 May 2022 22:00:36 +0000

Partial dmesg output when issue presents(without `ibt=off`):
```
May 29 17:21:26.199326 nmd-arch kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x19@0x178
May 29 17:21:26.199867 nmd-arch kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x1e@0x21c
May 29 17:21:31.502634 nmd-arch kernel: traps: Missing ENDBR: cmpl_eax_edx+0x0/0x10 [kvm]
May 29 17:21:31.503134 nmd-arch kernel: ------------[ cut here ]------------
May 29 17:21:31.503178 nmd-arch kernel: kernel BUG at arch/x86/kernel/traps.c:252!
May 29 17:21:31.503315 nmd-arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
May 29 17:21:31.503448 nmd-arch kernel: CPU: 11 PID: 12436 Comm: CPU 11/KVM Not tainted 5.18.0-arch1-1 #1 b71a70fe104889aac2f32556bc52f649da2881d2
May 29 17:21:31.503473 nmd-arch kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D43/PRO B660M-A DDR4 (MS-7D43), BIOS 1.10 02/25/2022
May 29 17:21:31.503497 nmd-arch kernel: RIP: 0010:exc_control_protection+0xc2/0xd0
May 29 17:21:31.503517 nmd-arch kernel: Code: 8b 93 80 00 00 00 be f9 00 00 00 48 c7 c7 d3 ab a6 8d e8 d1 01 50 ff e9 72 ff ff ff 48 c7 c7 ba ab a6 8d e8 c7 31 fb ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 66 0f 1f 00 55 53 48 89
May 29 17:21:31.503561 nmd-arch kernel: RSP: 0018:ffffb3fa45007bf8 EFLAGS: 00010002
May 29 17:21:31.503594 nmd-arch kernel: RAX: 0000000000000031 RBX: ffffb3fa45007c18 RCX: 0000000000000000
May 29 17:21:31.503612 nmd-arch kernel: RDX: 0000000000000000 RSI: ffff925e502e16a0 RDI: ffff925e502e16a0
May 29 17:21:31.503637 nmd-arch kernel: RBP: 0000000000000003 R08: 0000000000000000 R09: ffffb3fa45007a18
May 29 17:21:31.503653 nmd-arch kernel: R10: 0000000000000003 R11: ffff925e707ac1e8 R12: 0000000000000000
May 29 17:21:31.503676 nmd-arch kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
May 29 17:21:31.503689 nmd-arch kernel: FS: 00007f381e9ff640(0000) GS:ffff925e502c0000(0000) knlGS:0000000000000000
May 29 17:21:31.503704 nmd-arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 29 17:21:31.503714 nmd-arch kernel: CR2: ffffb68b46bdc000 CR3: 00000002ee2e6002 CR4: 0000000000f72ee0
May 29 17:21:31.503724 nmd-arch kernel: PKRU: 55555554
May 29 17:21:31.503748 nmd-arch kernel: Call Trace:
May 29 17:21:31.503759 nmd-arch kernel: <TASK>
May 29 17:21:31.503769 nmd-arch kernel: asm_exc_control_protection+0x22/0x30
...
```

see attached file for full dmesg
Comment by loqs (loqs) - Sunday, 29 May 2022, 15:26 GMT
@nmdanny please report your issue upstream. Unlike with the nvidia modules your issue is in code from upstream that has already been changed for IBT support [1]

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6649fa876da4c505548b8e8945a6fc48e62e427c
Comment by Christian Cwienk (dr1fter) - Sunday, 29 May 2022, 19:03 GMT
since other users shared their hw-specs, and the issue seems to be related to both CPU (Intel Alder Lake) and GPU (NVIDIA) [0], I would like to share my hardware specs as well (hoping this helps):

CPU (according to `lscpu`):
```
BIOS Vendor ID: Intel(R) Corporation
Model name: 12th Gen Intel(R) Core(TM) i7-12700F
```

GPU: (according to `glxinfo`=:
```
OpenGL renderer string: NVIDIA GeForce GTX 1660 SUPER/PCIe/SSE2
```



[0] https://github.com/NVIDIA/open-gpu-kernel-modules/issues/256
Comment by loqs (loqs) - Monday, 30 May 2022, 14:13 GMT
I have attached a test fix for the nvidia modules to  FS#74886  Note it applies only to nvidia-open and those cards supported by that driver.
In the nvidia package the supplied pre compiled blob is missing IBT and SLS so the fix for that will have to come from nvidia.
Comment by Iyan (iyanmv) - Monday, 30 May 2022, 22:39 GMT
I don't have an nvidia card on my laptop but I see four new errors after updating to linux 5.18.x

```
May 31 00:32:55 thinkpad kernel: pci 0000:00:07.0: DPC: RP PIO log size 0 is invalid
May 31 00:32:55 thinkpad kernel: pci 0000:00:07.2: DPC: RP PIO log size 0 is invalid
May 31 00:32:55 thinkpad kernel: traps: Missing ENDBR: init_module+0x0/0x1a0 [vmmon]
May 31 00:32:55 thinkpad kernel: kernel BUG at arch/x86/kernel/traps.c:252!
May 31 00:32:55 thinkpad systemd[1]: Failed to start Load Kernel Modules.
May 31 00:32:58 thinkpad kernel: Bluetooth: hci0: Malformed MSFT vendor event: 0x02
```
Comment by loqs (loqs) - Monday, 30 May 2022, 22:50 GMT
@iyanmv is the vmmon module from vmware? You can either disable IBT with ibt=off or investigate how to add the compiler options `-fcf-protection=branch -mindirect-branch-register` to the build of the vmmon module or contact the modules author.
Comment by Iyan (iyanmv) - Monday, 30 May 2022, 22:58 GMT
@loqs thanks ;)
Comment by BettyMorlock (Bettymorlock) - Thursday, 02 June 2022, 03:14 GMT
Same issue as @iyanmv disabling IBT did nothing but I ended up just uninstalling vmware and everything is fine now.
Comment by Christian Cwienk (dr1fter) - Saturday, 04 June 2022, 10:22 GMT
thanks to everybody for sharing additional insights, and the workaround of disabling ibt (which - unsurprisingly - also worked for me)

Do I understand it correctly, that this is not an issue w/ linux-5.18 as such, but rather a change of how linux will load (or run?) kernel modules (if ibt is enabled). Which causes the described issues if a module is loaded that has not been built accordingly. Which is something nvidia will have to do for their nonfree/proprietary gpu-driver-module.

So available options for using linux-5.18 might be:

- disable ibt (as has been suggested as workaround)
- wait for nvidia to publish an updated kernel-module
- switch to nouveau or nvidia-open
- switch to non-nvidia GPU (+ uninstall nvidia-kernel-module)

As far as I understand, both nouveau and nvidia-open have drawbacks (performance / supported features-set, ..), so it seems to me that disabling ibt, but staying w/ proprietary nvidia-kmod seems to be the best choice for now (comments on my assumptions and conclusions are very appreciated).
Comment by Christian Cwienk (dr1fter) - Wednesday, 08 June 2022, 05:41 GMT
Comment by loqs (loqs) - Thursday, 09 June 2022, 19:55 GMT Comment by hugo bini (poulpomancien) - Thursday, 16 June 2022, 07:50 GMT
Had to use ibt=off too to host my virtualbox VMs over an 11th gen intel laptop. Without that, virtualbox hanged with the same "Missing ENDBR" error in vboxdrv module.

Loading...