FS#66824 - [linux] Kernel panic with kernel 5.6.14-arch1-1

Attached to Project: Arch Linux
Opened by Martin Dratva (raqua) - Friday, 29 May 2020, 10:14 GMT
Last edited by freswa (frederik) - Thursday, 23 July 2020, 22:18 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
Levente Polyak (anthraxx)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 10
Private No

Details

Description:
My headless server will not stay up for more than about 10 hours since upgrade to kernel 5.6.14-arch1-1, resulting if kernel panic and freeze requiring hard reset. Downgrade to 5.6.13.arch1-1 solves the problem.

I am attaching screenshot from journal -f after I attached some monitor to the machine and also full journal log. Also my HW config (done on downgraded kernel).

I am not sure if this is Arch or upstream.


Steps to reproduce:
Start machine and let it run until it crashes...
This task depends upon

Closed by  freswa (frederik)
Thursday, 23 July 2020, 22:18 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 5.7.10.arch1-1 linux-lts 5.4.53-1
Comment by Martin Dratva (raqua) - Friday, 29 May 2020, 10:15 GMT
Not sure why the other files were not attached, so here it goes...

EDIT: Ah, they are too big. Well, this system could say something instead of silently ignoring it ...
Comment by Martin Dratva (raqua) - Friday, 29 May 2020, 10:18 GMT
Another attempt to add files..
Comment by Martin Dratva (raqua) - Friday, 29 May 2020, 10:18 GMT
and journal ...
Comment by Jan Alexander Steffens (heftig) - Friday, 29 May 2020, 23:20 GMT
Is .15 affected as well?
Comment by Mike Javorski (javmorin) - Saturday, 30 May 2020, 05:41 GMT
I experienced the same thing overnight last night (see image for the panic screen I woke up to this morning). I had upgraded to 5.6.14 on 5/27, so I updated this morning to 5.6.15.arch1-1 and it just crashed again when I tried to access the server via NFS and I happened to be able to get the dmesg via my open ssh session (see the txt attachment). I need this machine to be running reliably for work, so I have downgraded to 5.6.13 which was stable for over a week.
Comment by Martin Dratva (raqua) - Saturday, 30 May 2020, 10:43 GMT
@Jan I have not tested .15 version and I would rather not if Mike's word is enough for you as I also need this server to work. But if you think it is necessary, I will. Please let me know.
Comment by Jan Alexander Steffens (heftig) - Saturday, 30 May 2020, 13:54 GMT
If .15-arch1 is affected it looks like this is an issue with the .14 stable patches (.15-arch1 only adds the long-standing unprivileged_userns_clone patch).
Comment by Jan Alexander Steffens (heftig) - Saturday, 30 May 2020, 14:34 GMT
Did you enable any of the accounting options in /etc/systemd/system.conf?
Comment by Mike Javorski (javmorin) - Saturday, 30 May 2020, 14:49 GMT
@heftig On my machine that file has all its options commented out, and I don't recall ever manually enabling anything related to systemd and accounting, so I believe the answer is No for me.
Comment by Martin Dratva (raqua) - Saturday, 30 May 2020, 19:46 GMT
No, I don't think I have enabled any accounting. My conf is attached.
Comment by loqs (loqs) - Sunday, 07 June 2020, 14:26 GMT
Reverting [1] e2d928d5ee43f372618a9f98b0c73674717f2a2c fixed the issue for one user [2].
Possibly the same issue has been reported upstream [3].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e2d928d5ee43f372618a9f98b0c73674717f2a2c
[2] https://bbs.archlinux.org/viewtopic.php?pid=1908840#p1908840
[3] https://www.spinics.net/lists/netdev/msg658503.html
Comment by slip (slip) - Sunday, 07 June 2020, 18:32 GMT
I believe I'm suffering from the same issue, even on 5.6.15. Virtually identical crash output. I've just built 5.6.15 with the reverted e2d928d5ee43f372618a9f98b0c73674717f2a2c patch. If it lasts more than 24 hours, it will likely be a victory. My crashes have been very random from 15 minutes to 10 hours, so I'll give it a day to make sure.
Comment by Zoé (zoe1337) - Monday, 08 June 2020, 19:17 GMT
I have the same issue. LTS kernel is also affected. As recommended in the mailing list thread [3], did a workaround:
cd /usr/lib/systemd/system && sed -i.bak 's/^IPAddressDeny/#IPAddressDeny/g' *.service
Comment by slip (slip) - Monday, 08 June 2020, 20:04 GMT
5.6.15 with e2d928d5ee43f372618a9f98b0c73674717f2a2c applied has initially succeeded for me. The system has remained stable for considerably longer than it would have without the patch.

I can also see that the cgroup warning at boot is no longer there nor does it return `kernel: cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation`
Comment by Amir Zarrinkafsh (nightah) - Monday, 15 June 2020, 11:39 GMT
I've also been experiencing this same issue usually anywhere from 14-24 hours after boot, with no real discernable logs other than:

May 26 21:45:04 nerv kernel: BUG: kernel NULL pointer dereference, address: 0000000000000010
May 26 21:45:04 nerv kernel: #PF: supervisor read access in kernel mode

These lock ups/crashes persisted through 5.6.14, 5.6.15 and most recently 5.7.2. Only running on 5.6.13 allows me to maintain proper uptime.
I'm trying to build 5.7.2 from source with e2d928d5ee43f372618a9f98b0c73674717f2a2c reverted, will report back shortly.
Comment by loqs (loqs) - Monday, 15 June 2020, 14:57 GMT Comment by Amir Zarrinkafsh (nightah) - Friday, 26 June 2020, 05:44 GMT
So just to report back I have been running 5.7.2 with e2d928d5ee43f372618a9f98b0c73674717f2a2c reverted for 10 days without any issues.

I guess at this rate it's just a matter of waiting for the patch https://lore.kernel.org/netdev/CAM_iQpUKQJrj8wE+Qa8NGR3P0L+5Uz=qo-O5+k_P60HzTde6aw%40mail.gmail.com/ to be upstreamed?
Comment by Amir Zarrinkafsh (nightah) - Friday, 26 June 2020, 07:02 GMT
Deleted double post.
Comment by José Luis Salvador Rufo (jlsalvador) - Friday, 26 June 2020, 13:55 GMT
Same issue here.
I applied the patch (https://lore.kernel.org/netdev/CAM_iQpUKQJrj8wE+Qa8NGR3P0L+5Uz=qo-O5+k_P60HzTde6aw%40mail.gmail.com/) in 5.4.47-1-lts and I had not any issue yet, 4 days in a row. Without the patch the kernel crash before 12h.
Comment by loqs (loqs) - Wednesday, 22 July 2020, 21:40 GMT
Can you confirm linux 5.7.10.arch1-1 and linux-lts 5.4.53-1 currently in testing have resolved the issue?
Comment by slip (slip) - Wednesday, 22 July 2020, 22:51 GMT
I've installed 5.7.10.arch1-1. I do have the same `cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation` in dmesg that I was getting when I'd get the random crashes. To be fair, I never paid attention that before, so I can't say if it's related. I'll update after 24 hours or before if it crashes.
Comment by José Luis Salvador Rufo (jlsalvador) - Thursday, 23 July 2020, 20:24 GMT
I confirm that linux-lts 5.4.53-1 resolves this issue.
Comment by slip (slip) - Thursday, 23 July 2020, 22:07 GMT
My 24 hours has been successful as well on 5.7.10.arch1-1. This appears to be resolved for me.

Loading...