FS#74405 - [kernel][AMD][IOMMU] Bug in AMD IOMMU on Ryzen leads to suspend to RAM not resuming properly

Attached to Project: Arch Linux
Opened by Lahfa Samy (AkechiShiro) - Friday, 08 April 2022, 13:22 GMT
Last edited by Toolybird (Toolybird) - Thursday, 14 September 2023, 07:12 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
ArchLinux does not resume properly after suspend from RAM due to an AMD IOMMU bug/oops about interrupts enabling.


Additional info:
* linux 5.17.1 mainline kernel
* Thinkpad T495 Ryzen 7 3700U with Radeon Vega RX10 (iGPU)
* I'm planning to report this upstream to the Linux kernel (Bugzilla) on the IOMMU driver.
* This issue started very recently on this kernel, I believe the oldest working one was 5.16.16 maybe the regression was introduced by the 5.17 kernel.

Steps to reproduce:
- Boot on linux 5.17.1
- systemctl suspend
- Push power button.
- The issue thus is triggered if any X11 graphic server was started the system cannot resume from suspend to RAM (black screen) and a force reboot is needed.

The output from dmesg given here was done using `no_console_suspend`, `initcall_debug` and `ignore_loglevel`.

Here is the relevant output :
[ 82.540316] ACPI: PM: Preparing to enter system sleep state S3
[ 82.547782] ACPI: EC: event blocked
[ 82.547784] ACPI: EC: EC stopped
[ 82.547785] ACPI: PM: Saving platform NVS memory
[ 82.548228] Disabling non-boot CPUs ...
[ 82.550506] smpboot: CPU 1 is now offline
[ 82.553132] smpboot: CPU 2 is now offline
[ 82.555485] smpboot: CPU 3 is now offline
[ 82.557593] smpboot: CPU 4 is now offline
[ 82.559873] smpboot: CPU 5 is now offline
[ 82.561829] smpboot: CPU 6 is now offline
[ 82.563933] smpboot: CPU 7 is now offline
[ 82.565077] ACPI: PM: Low-level resume complete
[ 82.565107] ACPI: EC: EC started
[ 82.565108] ACPI: PM: Restoring platform NVS memory
[ 83.718277] ------------[ cut here ]------------
[ 83.718278] WARNING: CPU: 0 PID: 2572 at drivers/iommu/amd/init.c:851 amd_iommu_enable_interrupts+0x34d/0x420
[ 83.718290] Modules linked in: ccm cmac algif_hash algif_skcipher af_alg bnep lm92 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc btusb btrtl btbcm btintel btmtk bluetooth intel_rapl_msr ecdh_generic joydev mousedev intel_rapl_common crc16 edac_mce_amd snd_sof_amd_renoir snd_acp_config kvm_amd iwlmvm snd_sof_amd_acp kvm snd_sof_pci irqbypass snd_sof mac80211 snd_ctl_led snd_soc_acpi crct10dif_pclmul snd_hda_codec_realtek think_lmi crc32_pclmul libarc4 snd_hda_codec_hdmi snd_hda_codec_generic firmware_attributes_class crc32c_intel snd_soc_core ghash_clmulni_intel snd_hda_intel aesni_intel wmi_bmof snd_compress snd_intel_dspcfg iwlwifi snd_intel_sdw_acpi crypto_simd ac97_bus vfat snd_hda_codec snd_pcm_dmaengine cryptd iwlmei fat rapl snd_hda_core snd_pci_acp6x thinkpad_acpi snd_pci_acp5x snd_hwdep tpm_crb ledtrig_audio snd_pcm cfg80211 psmouse sp5100_tco platform_profile snd_rn_pci_acp3x ucsi_acpi zenpower(OE) snd_timer tpm_tis rfkill i2c_piix4
[ 83.718366] typec_ucsi snd ipmi_devintf typec snd_pci_acp3x tpm_tis_core ccp mei ipmi_msghandler r8168(OE) soundcore roles wmi tpm video rng_core i2c_scmi pinctrl_amd mac_hid acpi_cpufreq sg crypto_user acpi_call(OE) fuse bpf_preload ip_tables x_tables usbhid zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) serio_raw atkbd libps2 sdhci_pci cqhci sdhci xhci_pci xhci_pci_renesas mmc_core i8042 serio radeon amdgpu gpu_sched drm_ttm_helper ttm
[ 83.718413] CPU: 0 PID: 2572 Comm: systemd-sleep Tainted: P OE 5.17.1-arch1-1 #1 0ea933cb6bfe82a8dc16ab834a4bccdd297f98b7
[ 83.718418] Hardware name: LENOVO 20NKS28F00/20NKS28F00, BIOS R12ET55W(1.25 ) 07/06/2020
[ 83.718421] RIP: 0010:amd_iommu_enable_interrupts+0x34d/0x420
[ 83.718427] Code: ff ff 49 8b 7f 18 89 04 24 e8 9f 36 ee ff 8b 04 24 e9 4b fd ff ff 0f 0b 4d 8b 3f 49 81 ff 50 09 56 99 0f 85 05 fd ff ff eb 96 <0f> 0b 4d 8b 3f 49 81 ff 50 09 56 99 0f 85 f1 fc ff ff eb 82 31 f6
[ 83.718429] RSP: 0018:ffffa787405cbc68 EFLAGS: 00010046
[ 83.718432] RAX: 00000001262cdc89 RBX: 0000000000000000 RCX: 0000000000000000
[ 83.718434] RDX: 000000000000607e RSI: 00000000000059ae RDI: 00000001262c7c0b
[ 83.718436] RBP: 0000000080000000 R08: 0000000000000000 R09: 000000000000000f
[ 83.718437] R10: 0000000079726f6d R11: 000000006d656d20 R12: 000ffffffffffff8
[ 83.718439] R13: 0800000000000000 R14: ffffa787405cbc70 R15: ffff95d48004a800
[ 83.718441] FS: 00007fb3d354fe80(0000) GS:ffff95d76fa00000(0000) knlGS:0000000000000000
[ 83.718443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 83.718445] CR2: 00007f42204d6ad0 CR3: 000000012dbe8000 CR4: 00000000003506f0
[ 83.718447] Call Trace:
[ 83.718450] <TASK>
[ 83.718455] ? early_enable_iommus+0x1c5/0x300
[ 83.718460] ? enable_iommus_v2+0x8e/0x130
[ 83.718464] syscore_resume+0x4b/0x160
[ 83.718469] suspend_devices_and_enter+0x6d3/0x7d0
[ 83.718476] pm_suspend.cold+0x2fb/0x342
[ 83.718482] state_store+0x71/0xd0
[ 83.718487] kernfs_fop_write_iter+0x11c/0x1b0
[ 83.718493] new_sync_write+0x15c/0x1f0
[ 83.718500] vfs_write+0x1eb/0x280
[ 83.718503] ksys_write+0x67/0xe0
[ 83.718506] do_syscall_64+0x5c/0x80
[ 83.718511] ? do_syscall_64+0x69/0x80
[ 83.718513] ? exc_page_fault+0x72/0x170
[ 83.718517] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 83.718522] RIP: 0033:0x7fb3d3f44257
[ 83.718526] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 83.718528] RSP: 002b:00007ffeda5645a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 83.718531] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb3d3f44257
[ 83.718532] RDX: 0000000000000004 RSI: 00007ffeda564690 RDI: 0000000000000004
[ 83.718534] RBP: 00007ffeda564690 R08: 000055ba9c2d1230 R09: 0000000000000000
[ 83.718535] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000004
[ 83.718536] R13: 000055ba9c2cd3c0 R14: 0000000000000004 R15: 00007fb3d403d7a0
[ 83.718540] </TASK>
[ 83.718541] ---[ end trace 0000000000000000 ]---
This task depends upon

Closed by  Toolybird (Toolybird)
Thursday, 14 September 2023, 07:12 GMT
Reason for closing:  Fixed
Additional comments about closing:  See comments
Comment by Andreas Radke (AndyRTR) - Friday, 08 April 2022, 14:25 GMT Comment by Lahfa Samy (AkechiShiro) - Friday, 08 April 2022, 15:08 GMT
But wouldn't this bug be an upstream bug ? How could the Lenovo support help here ?
Comment by Lahfa Samy (AkechiShiro) - Friday, 08 April 2022, 21:44 GMT
Report of the bug upstream on kernel bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=215821
Comment by loqs (loqs) - Saturday, 09 April 2022, 04:00 GMT
This might also be the same bug https://lore.kernel.org/lkml/20220125150832.1570-1-mike%40fireburn.co.uk/ and has a test fix you could try.
Edit:
See also  FS#74285 
Comment by Lahfa Samy (AkechiShiro) - Saturday, 07 May 2022, 04:08 GMT
Quick update, the bug is now kinda of gone, I was mostly on `linux-lts` for now to avoid the bug, just tried the mainstream linux 5.17.5-arch1-1 and the oops from the kernel is now different, resuming from suspend to RAM no longer fails on my laptop.

I've attached the new logs, the oops seems to be still in the same function but elsewhere.

Loading...