FS#59483 - [linux] Random full freezes on linux-4.17.10-1

Attached to Project: Arch Linux
Opened by f (bakgwailo) - Monday, 30 July 2018, 04:54 GMT
Last edited by Andreas Radke (AndyRTR) - Tuesday, 01 March 2022, 21:12 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 9
Private No

Details

Description:

I am getting pretty frequent full system lockups/freezes on the latest stable kernel. Using the latest LTS kernel seems to be OK. X/Plasma completely lock up, and keyboard input no longer works (i.e. can't shift to virtual terminals, and can't even do things like toggle num-lock). Currrently up to date with stable (test repos not enabled). I have attached my journalctl output.

Additional info:
* Ryzen 2700x with a GTX-1070 using the latest drivers (396.45-1)


Steps to reproduce:
Boot the computer and wait a few minutes.

   crash (187.6 KiB)
This task depends upon

Closed by  Andreas Radke (AndyRTR)
Tuesday, 01 March 2022, 21:12 GMT
Reason for closing:  Fixed
Additional comments about closing:  Fixed upstream.
Comment by f (bakgwailo) - Monday, 30 July 2018, 05:01 GMT
Attaching another log where it froze earlier. I have another, too, where the journalctl just... abruptly ends mid line with no errors.
   crash2 (915.7 KiB)
Comment by Heinz Witt (HeinzDo57) - Monday, 30 July 2018, 19:38 GMT
Ryzen 2600x with Radeon 560.
With the stable kernel I have only problems.
The computer does not shut down anymore and I also get these messages from systemd-udevd.
https://imgur.com/a/2erltyr
Comment by loqs (loqs) - Monday, 30 July 2018, 19:51 GMT
@HeinzDo57 please do not post a screenshot of a text file. Please test with the most recent version of the linux package 4.17.11-1
@HeinzDo57 and @bakgwailo what was the last version of linux package without the issue and the first version with the issue?
Comment by f (bakgwailo) - Monday, 30 July 2018, 20:08 GMT
So, I generally keep things up to date, but, apparently my grub was defaulting to the LTS kernel for awhile. I want to say, though, 4.17.10 was fine when I explicitly booted into it, but I can't be sure, unfortunately. If it helps, neither of my Intel based laptops have this issue on 4.17.10.
Comment by f (bakgwailo) - Monday, 30 July 2018, 20:09 GMT
I can also boot back into 4.17.10 to try to get some more logs - or if you need anything let me know, I can try to grab things before the freeze.
Comment by loqs (loqs) - Monday, 30 July 2018, 21:01 GMT
Jul 30 00:26:26 desktop kernel: BUG: unable to handle kernel paging request at ffff9652c92f8368
Jul 30 00:26:26 desktop kernel: PGD 3a7660067 P4D 3a7660067 PUD 0
Jul 30 00:26:26 desktop kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jul 30 00:26:26 desktop kernel: Modules linked in: snd_hda_codec_hdmi nct6775 hwmon_vid usblp arc4 nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) iwlmvm mac80211 nls_iso8859_1 nls_cp437 vfat fat kvm iwlwifi drm_kms_helper snd_hda_codec_realtek btusb btrtl btbcm snd_hda_codec_generic btintel uvcvideo raid10 irqbypass crct10dif_pclmul videobuf2_vmalloc videobuf2_memops crc32_pclmul snd_hda_intel videobuf2_v4l2 md_mod snd_usb_audio wmi_bmof mxm_wmi cfg80211 bluetooth ghash_clmulni_intel drm snd_hda_codec videobuf2_common snd_usbmidi_lib pcbc videodev snd_rawmidi snd_hda_core snd_seq_device snd_hwdep igb agpgart input_leds media ipmi_devintf snd_pcm led_class ipmi_msghandler ecdh_generic joydev syscopyarea sysfillrect mousedev aesni_intel i2c_algo_bit dca snd_timer aes_x86_64 crypto_simd rfkill cryptd snd glue_helper sysimgblt
Jul 30 00:26:26 desktop kernel: fb_sys_fops ccp(+) sp5100_tco soundcore rng_core i2c_piix4 k10temp pcspkr shpchp evdev rtc_cmos wmi mac_hid pinctrl_amd gpio_amdpt pcc_cpufreq acpi_cpufreq crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sr_mod cdrom sd_mod hid_roccat_konepure hid_roccat hid_roccat_common hid_generic usbhid hid ahci xhci_pci crc32c_intel libahci xhci_hcd libata usbcore scsi_mod usb_common
Jul 30 00:26:26 desktop kernel: CPU: 2 PID: 627 Comm: Xorg Tainted: P O 4.17.10-1-ARCH #1
Jul 30 00:26:26 desktop kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470 Taichi, BIOS P1.50 07/03/2018
Jul 30 00:26:26 desktop kernel: RIP: 0010:select_idle_sibling+0x38d/0x460
Jul 30 00:26:26 desktop kernel: RSP: 0018:ffffa98583defa08 EFLAGS: 00010006
Jul 30 00:26:26 desktop kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000001
Jul 30 00:26:26 desktop kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff96520cb91538
Jul 30 00:26:26 desktop kernel: RBP: 0000000000000047 R08: 000000cac3c9a9b5 R09: 0000000000000002
Jul 30 00:26:26 desktop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff96520cb91538
Jul 30 00:26:26 desktop kernel: R13: ffff9652c92f8368 R14: ffff96520cb92e00 R15: 0000000000000001
Jul 30 00:26:26 desktop kernel: FS: 00007fc3e37ece00(0000) GS:ffff96521ec80000(0000) knlGS:0000000000000000
Jul 30 00:26:26 desktop kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 30 00:26:26 desktop kernel: CR2: ffff9652c92f8368 CR3: 00000003ef586000 CR4: 00000000003406e0
Jul 30 00:26:26 desktop kernel: Call Trace:
Jul 30 00:26:26 desktop kernel: select_task_rq_fair+0xbcb/0xc20
Jul 30 00:26:26 desktop kernel: ? preempt_count_add+0x49/0xa0
Jul 30 00:26:26 desktop kernel: ? memcg_kmem_get_cache+0x8c/0x1b0
Jul 30 00:26:26 desktop kernel: ? preempt_count_add+0x49/0xa0
Jul 30 00:26:26 desktop kernel: ? memcg_kmem_put_cache+0x3f/0x70
Jul 30 00:26:26 desktop kernel: ? __kmalloc_node_track_caller+0x210/0x2b0
Jul 30 00:26:26 desktop kernel: ? __alloc_skb+0x82/0x1d0
Jul 30 00:26:26 desktop kernel: try_to_wake_up+0x13a/0x490
Jul 30 00:26:26 desktop kernel: pollwake+0x74/0x90
Jul 30 00:26:26 desktop kernel: ? wake_up_q+0x70/0x70
Jul 30 00:26:26 desktop kernel: __wake_up_common+0x77/0x140
Jul 30 00:26:26 desktop kernel: __wake_up_common_lock+0x7c/0xc0
Jul 30 00:26:26 desktop kernel: sock_def_readable+0x41/0x80
Jul 30 00:26:26 desktop kernel: unix_stream_sendmsg+0x1b5/0x3c0
Jul 30 00:26:26 desktop kernel: sock_sendmsg+0x33/0x40
Jul 30 00:26:26 desktop kernel: sock_write_iter+0x8f/0xf0
Jul 30 00:26:26 desktop kernel: do_iter_readv_writev+0x12b/0x190
Jul 30 00:26:26 desktop kernel: do_iter_write+0x80/0x190
Jul 30 00:26:26 desktop kernel: vfs_writev+0x84/0xf0
Jul 30 00:26:26 desktop kernel: ? __vfs_read+0x36/0x170
Jul 30 00:26:26 desktop kernel: do_writev+0x5c/0xf0
Jul 30 00:26:26 desktop kernel: do_syscall_64+0x5b/0x170
Jul 30 00:26:26 desktop kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 30 00:26:26 desktop kernel: RIP: 0033:0x7fc3e3351744
Jul 30 00:26:26 desktop kernel: RSP: 002b:00007ffea1f88ba8 EFLAGS: 00003246 ORIG_RAX: 0000000000000014
Jul 30 00:26:26 desktop kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3e3351744
Jul 30 00:26:26 desktop kernel: RDX: 0000000000000001 RSI: 00007ffea1f88e80 RDI: 0000000000000041
Jul 30 00:26:26 desktop kernel: RBP: 0000561e8c8b0490 R08: 0000000000000001 R09: 0000000000000007
Jul 30 00:26:26 desktop kernel: R10: 0000000000000001 R11: 0000000000003246 R12: 0000000000000001
Jul 30 00:26:26 desktop kernel: R13: 00007ffea1f88e80 R14: 0000000000000020 R15: 0000561e8c93f500
Jul 30 00:26:26 desktop kernel: Code: 44 24 08 e8 66 1e 67 00 41 89 c7 3d 3f 01 00 00 77 48 48 8b 04 24 4c 8d a8 68 03 00 00 eb 09 83 ed 01 0f 84 db fe ff ff 44 89 f8 <49> 0f a3 45 00 73 0c 44 89 ff e8 34 82 ff ff 85 c0 75 4f 44 89
Jul 30 00:26:26 desktop kernel: RIP: select_idle_sibling+0x38d/0x460 RSP: ffffa98583defa08
Jul 30 00:26:26 desktop kernel: CR2: ffff9652c92f8368
Jul 30 00:26:26 desktop kernel: ---[ end trace ead38b84905aaf33 ]---
Normally I would suggest reporting it upstream but upstream does not support issues produced on tainted kernels. Can you reproduce it without the nvidia modules?
You might also try testing 4.18-rc7.
Comment by f (bakgwailo) - Tuesday, 31 July 2018, 01:18 GMT
Unfortunately, I don't think nouveau works with the 1070/1000 series? If it does, I can give it a try.
Comment by loqs (loqs) - Tuesday, 31 July 2018, 14:15 GMT
I believe it has some support for the 1070 from looking at https://nouveau.freedesktop.org/wiki/FeatureMatrix/ but I expect you would need to use the modesetting X driver.
Comment by f (bakgwailo) - Tuesday, 31 July 2018, 14:40 GMT
Ah, nice - I must have missed that. I know the firmware wasn't released forever for the 900+ series. I can try it out (and also the 4.11 kernel) maybe tonight, although I will be home pretty late. If not tomorrow, then I can play with this after work tomorrow.
Comment by schnilch (schnilch) - Saturday, 11 August 2018, 18:18 GMT
I am using a Ryzen 2700X and Vega 64 with 4.18.0-rc8 and have the same problem. As far as I can tell, this is the first relevant message in dmesg:

[ 64.610838] systemd-udevd[361]: seq 2656 '/devices/pci0000:00/0000:00:07.1/0000:31:00.2' is taking a long time
[ 64.610846] systemd-udevd[361]: seq 2776 '/devices/system/cpu/cpu0' is taking a long time
[...]
[ 184.608688] systemd-udevd[361]: seq 2656 '/devices/pci0000:00/0000:00:07.1/0000:31:00.2' killed
[ 184.608756] systemd-udevd[361]: seq 2761 '/devices/system/cpu/cpu0' killed
[...]
[ 246.764303] INFO: task systemd-udevd:375 blocked for more than 120 seconds.
[ 246.764305] Not tainted 4.18.0-rc8-g8efcf34a2639 #1
[ 246.764306] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Comment by f (bakgwailo) - Saturday, 11 August 2018, 19:22 GMT
So, haven't had time to try to get nouveau working (my one attempt wasn't... great). But, an update: Latest LTS is still going strong with no issues. The latest stable (4.17.14.arch1-1) fixes the random freezes for me, but, completely breaks shutting down and suspend, and has other things in the log. Looks like systemd-udev under that kernel can never be stopped, so it just hangs on reboot/shutdown, and for suspend just goes to a virtual terminal and continuously prints out that it cant stop systemd-udev. If the mouse is moved or a key is pressed, it does actually go back into X and continues on its merry way like the suspend worked.
Comment by Heinz Witt (HeinzDo57) - Sunday, 12 August 2018, 04:52 GMT
For me it was a BIOS update (MSI X470 Gaming Plus, BIOS 7B79vA4 2018-07-03)
After I had flashed the old BIOS ( 7B79vA3 2018-05-10)the problems were gone.
I then contacted MSI.
They had checked that and thought it was the new AGESA code.
Comment by schnilch (schnilch) - Sunday, 12 August 2018, 08:02 GMT
Downgrading to AGESA 1.0.0.2 Patch C fixed it for me as well. Thank you Heinz.
Comment by gandalf3 (gandalf3) - Monday, 13 August 2018, 21:06 GMT
I was having a similar crash on a Ryzen 2700X and Vega 64 (currently on 4.17.13), and it hasn't happened since I updated the BIOS (ASRock AB350 Pro4, updated to BIOS 5.00).
No more freezes, but I'm still getting udev output similar to that posted by schnilch (forum thread: https://bbs.archlinux.org/viewtopic.php?id=239539). I'm not completely sure, but I don't think I got that output before updating the BIOS.
Oh, in addition to the freezes I was also somehow missing one CPU core on the old BIOS. I had to use Ryzen Master on windows to re-enable it.

EDIT: I'm blind, I read "downgrade" as "upgrade". I will try that, thanks.
Comment by f (bakgwailo) - Tuesday, 14 August 2018, 16:23 GMT
Looks like its been reported a few other places:

https://bugzilla.redhat.com/show_bug.cgi?id=1608242#c11

http://forum.asrock.com/forum_posts.asp?TID=9179&title=new-asrock-x470-taichi-uefi-150

Basically looks like a bug in the latest bios. Solutions are to roll back to a pre-1.0.0.4a AGESA code, or, use the LTS kernel, or really, any pre-4.16 kernel as that is when code was added that interacts with the PSP, which is bugged in the latest BIOS.
Comment by Jaume (zjaume) - Sunday, 19 August 2018, 16:07 GMT
My desktop udevd unit is also slowing boot process a lot after upgrade to AGESA 1.0.0.4, here is the stack trace: https://pastebin.com/9k6tUvJB
But after booting the cpu seems to work well.
CPU: Ryzen 5 1600
Mobo: Gigabyte AB350M-Gaming3
Comment by Kevin (HarlemSquirrel) - Thursday, 30 August 2018, 04:37 GMT
Experiencing the same issue with an Aorus X470 Ultra Gaming Mobo with BIOS F3 with AGESA 1.0.0.4
Comment by t-ask (tAsk) - Thursday, 30 August 2018, 18:38 GMT
MSI X470 Gaming Plus BIOS update to "7B79vA4 2018-07-03" is broken ... downgrading back to E7B79AMS fixes this issue for me.
Comment by roqz (roqz) - Friday, 31 August 2018, 03:46 GMT
Just for the sake of documentation, I'm having the same issue with the Gigabyte GA-AB350M-D3H (rev. 1.0) motherboard, updated to BIOS F23, which states that the main/only change is "Update AGESA 1.0.0.4".

The issue is also documented here:

https://forum.level1techs.com/t/aorus-x399-gaming-7-new-bios-update/129389/5

Seems that the 'ccp' module is the culprit, and can be fixed in the kernel by compiling with CONFIG_CRYPTO_DEV_SP_PSP=n .
Comment by odites (odites) - Wednesday, 05 September 2018, 16:51 GMT
I've the same problem with: 2600X, RX 580 8G, MSI B350M Mortar with latest bios (AGESA 1.0.0.4C) and latest kernel.
Comment by Matej Špindler (MatejSpindler) - Thursday, 20 September 2018, 09:20 GMT
Same problems with my computer:
X470 Aorus ultra gaming
Ryzen 2700x
F3 BIOS with AGESA 1.0.0.4

Recompiling arch kernel with CONFIG_CRYPTO_DEV_SP_PSP=n solves the problem.
Comment by Doug (neo-alquimista) - Saturday, 20 October 2018, 12:08 GMT
This is a piece of my logs from around the time of the freeze. It happened to me on Ubuntu and Fedora as well when running a recent 4.17+ kernel. Couldn't pinpoint the exact version it starts.
Comment by Doug (neo-alquimista) - Tuesday, 13 November 2018, 14:28 GMT
I am running on Intel Graphics and this affects me.
Comment by mattia (nTia89) - Monday, 28 February 2022, 16:42 GMT
I cannot reproduce the issue. Is it still valid for you?
Comment by roqz (roqz) - Monday, 28 February 2022, 20:22 GMT
nTia89: issue not valid anymore, was fixed.

Loading...