FS#69764 - [linux] Upgrade to 5.11 - desktop fails to wake from sleep

Attached to Project: Arch Linux
Opened by James (thx1138) - Wednesday, 24 February 2021, 20:05 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Wednesday, 09 March 2022, 02:10 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Architecture x86_64
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 6
Private No

Details

linux 5.11.1.arch1-1

After upgrade from 5.10 to 5.11, on an a laptop with Intel Core2 and ATI Mobility Radeon X1600, running the lxqt desktop, sleep seems to work normally, but on wake from sleep, the desktop is frozen. The mouse still works, caps lock still works, but the tray clock is frozen, and there is no response from application windows. Non-desktop processes still work normally. For instance, ssh recovers from sleep and wake. After sleep and wake, `ps wax` does show a seemingly large number of kworker processes remaining in state I, "Idle kernel thread", of the form `[kworker/u4:21-events_unbound]`.

Reverting to the lts kernel, linux-lts 5.10.18-1, desktop processes work as expected after sleep and wake.

Ideas? Suggestions?
This task depends upon

Closed by  Sven-Hendrik Haase (Svenstaro)
Wednesday, 09 March 2022, 02:10 GMT
Reason for closing:  Fixed
Additional comments about closing:  2022-03-03: A task closure has been requested. Reason for request: All three separately mentioned and investigated issues got patched, merged and reported as fixed (by 5.11.12/5.12-rc4; then 5.12.4; then 5.13). Details/links in my last comment.
Comment by Crazy Frog (volvenstein) - Saturday, 27 February 2021, 10:17 GMT
ryzen 4500u redmibook is also affected.
the screen never turns on after sleep.
doesn't respond to any input.
usb-connected smartphone seems to successfully establish tethering but if i select file sharing, then the phone stops to be even charged.
   dmesg.txt (16.5 KiB)
Comment by James (thx1138) - Saturday, 27 February 2021, 13:48 GMT
Again using linux 5.11.2.arch1-1, I also upgraded a Toshiba Satellite with Intel Integrated Graphics Controller running the i915 driver, and no problem with sleep and wake.

Another LXQt user with some GPU not Radeon did not see any problem with sleep and wake.

I see that the ryzen 4500u redmibook has the Radeon RX Vega 6 integrated GPU.

So, it seems that this may be a radeon driver issue, which would be consistent with the screen freezing, and most everything else still working.
Comment by James (thx1138) - Saturday, 27 February 2021, 14:34 GMT
From ssh, on the HP with the ATI Mobility Radeon X1600, a `sudo rmmod -v -f radeon;sudo modprobe radeon` restores the display and keyboard. The original desktop is lost, of course, leaving the original Xorg process a zombie. All of those Idle kernel threads also disappear. So, again, this seems to be a radeon driver issue.
Comment by James (thx1138) - Saturday, 27 February 2021, 19:03 GMT
Same issue in linux 5.11.2.arch1-1
I sent a note upstream for the radeon driver.
Comment by James (thx1138) - Monday, 01 March 2021, 19:28 GMT
Apparently, the X1600 and the RX Vega 6 use completely different drivers, radeon and amdgpu, so a kernel bisect is best. DRM Memory Management is a common element, so that's a possible area.
Comment by Stanimir (korikori) - Tuesday, 02 March 2021, 02:56 GMT
Can confirm on an Ideapad 5 laptop with the Ryzen 4500u CPU and the amdgpu driver. Going back to linux-lts resolves this.
Comment by James (thx1138) - Thursday, 04 March 2021, 17:02 GMT
Note sent to upstream.
```
$ git bisect bad
0b8793f6e7fc097c112f1848aa7dab60b9ede5a7 is the first bad commit
commit 0b8793f6e7fc097c112f1848aa7dab60b9ede5a7
Author: Christian König <christian.koenig@amd.com>
Date: Mon Sep 21 13:18:02 2020 +0200

drm/radeon: switch over to the new pin interface

Stop using TTM_PL_FLAG_NO_EVICT.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Link: https://patchwork.freedesktop.org/patch/391610/?series=81973&rev=1

drivers/gpu/drm/radeon/radeon.h | 1 -
drivers/gpu/drm/radeon/radeon_display.c | 9 ++------
drivers/gpu/drm/radeon/radeon_object.c | 37 ++++++++-------------------------
drivers/gpu/drm/radeon/radeon_object.h | 2 +-
drivers/gpu/drm/radeon/radeon_ttm.c | 2 +-
5 files changed, 13 insertions(+), 38 deletions(-)
```
and the system log is showing:
```
kernel: WARNING: CPU: 1 PID: 799 at include/drm/ttm/ttm_bo_api.h:608 radeon_bo_unpin+0x47/0x60 [radeon]
...
kernel: CPU: 1 PID: 799 Comm: kworker/u4:17 Not tainted 5.9.0-rc5-1 #11
kernel: Hardware name: Hewlett-Packard /309F, BIOS 68YAF Ver. F.1D 07/11/2008
kernel: Workqueue: events_unbound async_run_entry_fn
kernel: RIP: 0010:radeon_bo_unpin+0x47/0x60 [radeon]
...
kernel: Call Trace:
kernel: radeon_gart_table_vram_unpin+0x47/0xa0 [radeon]
kernel: r520_resume+0x74/0xb0 [radeon]
kernel: radeon_resume_kms+0x5c/0x350 [radeon]
kernel: ? pci_pm_restore+0xe0/0xe0
kernel: dpm_run_callback+0x4f/0x180
kernel: device_resume+0xa7/0x200
kernel: async_resume+0x19/0x30
kernel: async_run_entry_fn+0x37/0x140
kernel: process_one_work+0x1da/0x3d0
kernel: worker_thread+0x4d/0x3d0
kernel: ? rescuer_thread+0x410/0x410
kernel: kthread+0x133/0x150
kernel: ? __kthread_bind_mask+0x60/0x60
kernel: ret_from_fork+0x22/0x30
kernel: ---[ end trace 8908b03655c5613e ]---
```

The commit is one of a series, 08/11, as you can see at the patchwork link. The amdgpu driver is addressed in 09/11. The amdgpu driver has similar functions, amdgpu_bo_unpin() and amdgpu_gart_table_vram_unpin(). I have only the radeon hardware to test. The patch set changes the functions radeon_bo_unpin() and amdgpu_bo_unpin() and changes their return type from `int` to `void`, but amdgpu_object.c still includes the comment:
```
* Returns:
* 0 for success or a negative error code on failure.
```
Comment by Hongpeng-Li (Hongpeng-Li) - Monday, 08 March 2021, 12:48 GMT
I met with the same problem on my laptop(amd r54600u radeon), it could not wake up from suspend or hibernation, but everything else works well. Suggestions on wiki power-management didn't work, i finnaly switch my kernel to 5.10.21-1-lts now problems are solved. Probably there's something unstable with the new kernel.
Comment by Zach Smith (zaxmyth) - Monday, 08 March 2021, 14:31 GMT
Same problem on Intel Core i9-9880H with UHD Graphics 630. Booted 5.10.21-1-lts and resume from sleep works correctly.

I will also note that resume from hibernate on 5.11.2 works as expected.
Comment by Zach Smith (zaxmyth) - Monday, 08 March 2021, 17:11 GMT
Comment by James (thx1138) - Saturday, 13 March 2021, 18:07 GMT
There is a first draft patch - attached - which resolves the freeze on wake issue on the radeon driver, but still produces kernel warnings. It is not yet clear if the issue has the same cause on the amdgpu driver or the Intel driver. I don't have that hardware. If someone would test the effect of this patch against the amdgpu driver, or the Intel driver, or even just report kernel warnings in the system log after the freeze on wake, that could be helpful.
Comment by Stanimir (korikori) - Sunday, 14 March 2021, 04:14 GMT
Just applied your patch to 5.11.6 and I can confirm that wake from suspend works fine on an AMD 4500U CPU (amdgpu). Attaching a copy of my dmesg logs from the suspend cycle.
Comment by James (thx1138) - Sunday, 14 March 2021, 17:27 GMT
Thanks for your report! Note sent upstream.
Comment by loqs (loqs) - Sunday, 14 March 2021, 19:10 GMT
@thx1138 is there an upstream public bug report or are you working with the developers privately?
Comment by James (thx1138) - Sunday, 14 March 2021, 21:56 GMT
As far as I know, this is the only thread for this bug. I've been emailing the developers directly, those names on the source files.
Comment by James (thx1138) - Friday, 26 March 2021, 19:25 GMT
Latest update on Christian's patch, from Alex Deucher:

It landed in the kernel last week:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6c5403173a13a08ff61dbdafa4c0ed4a9dedbfe0
Just needs to go to 5.11 stable.
Comment by James (thx1138) - Friday, 26 March 2021, 20:52 GMT
There seems to be some confusion. Christian's patch was applied and then reverted, I suspect after being confused with another patch. Patience.
Comment by loqs (loqs) - Friday, 26 March 2021, 21:11 GMT
@thx1138 it was Christian König who requested the patch be reverted [1]. May have to wait for 5.12 which will contain the whole patch series.

[1] https://lore.kernel.org/lkml/8c3da8bc-0bf3-496f-1fd6-4f65a07b2d13%40amd.com/
Comment by James (thx1138) - Saturday, 27 March 2021, 14:58 GMT
On Thu, Mar 25, 2021 at 5:16 AM Christian König <christian.koenig@amd.com> wrote:
> Am 25.03.21 um 10:01 schrieb Greg KH:
> > On Thu, Mar 25, 2021 at 09:57:04AM +0100, Christian König wrote:
> >> This one here can be kept. It is unrelated to the warning caused by the
> >> other patch.
> > It causes a revert issue with the other patch, which is why I dropped
> > both of them.
>
> Ah, of course.
>
> > I'll gladly take this one, if someone wants to provide a working
> > backport
>
> Going to add that to my TODO list.
Comment by loqs (loqs) - Saturday, 27 March 2021, 15:56 GMT
@thx1138 Attached is an attempt at backporting 6c5403173a13a08ff61dbdafa4c0ed4a9dedbfe0 to 5.11.10 does it work for you?
Comment by James (thx1138) - Saturday, 27 March 2021, 18:01 GMT
Yes, that looks correct. Do you want to run that through Christian, since he has been handling this issue? Or, go straight to Greg and CC Christian?
Comment by loqs (loqs) - Saturday, 27 March 2021, 18:59 GMT
If you have tested it. Can you send it upstream whichever way you think is more appropriate?
This version adds the commit message back. Upstream does not accept anonymous commits. Which is perfectly understandable.
Comment by James (thx1138) - Sunday, 28 March 2021, 00:58 GMT
I'll forward the plain patch to Christian. That might move things along.
Comment by loqs (loqs) - Sunday, 04 April 2021, 15:04 GMT Comment by Alexey Stukalov (alyst) - Thursday, 08 April 2021, 16:57 GMT
I don't know if it's the same issue but since 5.11 upgrade (including 5.11.12-arch1-1 which I have just tested) the screen backlight is not turned on when my Thinkpad T470s wakes from suspend, while on battery.
Comment by James (thx1138) - Thursday, 08 April 2021, 19:29 GMT
This 5.11.12-arch1-1 needs testing.

The patch resolves the wake from sleep issue for the radeon driver and my ATI Mobility Radeon X1600, though there is still the "radeon_bo_unpin" warning, which is incidental and not catastrophic.

Is this update resolving the issue when using the amdgpu driver?

@zaxmyth - is there still an issue when using the Intel UHD Graphics 630? Or, is that a completely different issue?

If I understand, the source file patched, "include/drm/ttm/ttm_bo_api.h", is not just AMD/ATI specific.

@alyst, I don't know that it would affect the screen backlight, but then, I don't know that it would not. The Lenovo website says "Intel HD Graphics 520" for the Thinkpad T470s. You might check the 5.11.12 changelog for anything else that might be suspicious - https://lwn.net/Articles/851870/
Comment by loqs (loqs) - Thursday, 08 April 2021, 20:28 GMT
@alyst do you mean the issue was introduced by 5.11.12-arch1-1 and is not present in 5.11.11-arch1-1 or was introduced by the 5.11 series and has been present in every Arch release of that series?

@thx1138 is the warning present under 5.12-rc6?
Comment by Alexey Stukalov (alyst) - Thursday, 08 April 2021, 20:48 GMT
@loqs I have this issue with every 5.11.x release I have tested. I had hopes for the patch proposed here (retrospectively I'm not sure why :)), but it doesn't fix the behavior.
Comment by James (thx1138) - Thursday, 08 April 2021, 22:40 GMT
> is the warning present under 5.12-rc6?

I have not checked, but my impression has been that fixing buffer object pinning is something still on the "to-do" list for the radeon driver developers.
Comment by James (thx1138) - Friday, 09 April 2021, 14:53 GMT
Christian says that the Intel hardware problems are completely unrelated since they don't use TTM at all, and that the buffer object pinning problem is still an open issue.

Someone having sleep-wake problems with the Intel GPU may need to do a bisect.
Comment by Alexander Kaltsas (firewalker) - Thursday, 29 April 2021, 09:25 GMT
I have similar problems with my ThinkPad T15 (Ryzen 5, amdgpu). Up until linux-5.11.16. Everything is OK. With every linux-5.11.x version the laptop can't wake up. Black screen, and nothing works (unable to ssh, capslocks dead, etc). What is the conclusion of this bug? Waiting for 5.12.x?
Comment by James (thx1138) - Thursday, 29 April 2021, 22:45 GMT
A patch that resolves this issue for the devices tested made it into 5.11.13. What you are describing seems to be a different issue. You can look at https://gitlab.freedesktop.org/drm/amd/-/issues/1575 , "[amdgpu] kernel crash when trying to resume from suspend", to see if that appears to be your issue. If not, then open a new issue there, at https://gitlab.freedesktop.org/drm/amd/-/issues .
Comment by Stanimir (korikori) - Monday, 10 May 2021, 13:33 GMT
I am experiencing an even more "catastrophic" version of this issue with 5.12 - after waking up from suspend, the screen is completely blank, and the system is unresponsive (has to be hard-rebooted). From what I can see, the problem is still the amdgpu driver - these are the most relevant entries at the time of the wake request:

May 10 16:04:46 tempest kernel: pci 0000:00:00.2: can't derive routing for PCI INT A
May 10 16:04:46 tempest kernel: pci 0000:00:00.2: PCI INT A: no GSI
May 10 16:04:46 tempest kernel: nvme nvme0: 15/0/0 default/read/poll queues
May 10 16:04:46 tempest kernel: nvme nvme1: Shutdown timeout set to 8 seconds
May 10 16:04:46 tempest kernel: nvme nvme1: 12/0/0 default/read/poll queues
May 10 16:04:46 tempest kernel: amdgpu 0000:05:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
May 10 16:04:46 tempest kernel: amdgpu 0000:05:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
May 10 16:04:46 tempest kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
May 10 16:04:46 tempest kernel: amdgpu 0000:05:00.0: amdgpu: SMU is resuming...
May 10 16:04:46 tempest kernel: amdgpu 0000:05:00.0: amdgpu: dpm has been disabled
May 10 16:04:46 tempest kernel: amdgpu 0000:05:00.0: amdgpu: SMU is resumed successfully!
May 10 16:04:46 tempest kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110)
May 10 16:04:46 tempest kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110
May 10 16:04:46 tempest kernel: amdgpu 0000:05:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
May 10 16:04:46 tempest kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x1c0 returns -110
May 10 16:04:46 tempest kernel: amdgpu 0000:05:00.0: PM: failed to resume async: error -110
May 10 16:04:46 tempest kernel: acpi LNXPOWER:08: Turning OFF
May 10 16:04:46 tempest kernel: acpi LNXPOWER:07: Turning OFF
May 10 16:04:46 tempest kernel: acpi LNXPOWER:05: Turning OFF
Comment by Sefa Eyeoglu (Scrumplex) - Tuesday, 11 May 2021, 08:36 GMT
I had a similar issue since 5.11. After some debugging using a serial port I finally got to capture a panic in the amdgpu.

I patched the bug and submitted it to the amd-gfx mailing list. I don't know though when it will be merged into mainline.

See Mailing List: https://lists.freedesktop.org/archives/amd-gfx/2021-March/060754.html
Alex Deucher's (one of the AMDGPU Maintainers) drm-next branch: https://gitlab.freedesktop.org/agd5f/linux/-/commit/7df4ceb60fa9a3c5160cfd5b696657291934a2c9

So backporting that might fix the issue
Comment by Alexey Stukalov (alyst) - Tuesday, 11 May 2021, 21:15 GMT
Just to report that currently I'm on 5.12.2-arch1, and my issues with the T470s Thinkpad backlight not turning back on after the suspend seem to be gone.
Also the screen brightness control seems to be restored.
Comment by Peter Schröder (espresso) - Saturday, 05 February 2022, 06:01 GMT
Why is this bug report still open? It uses an ancient Linux kernel and has not been re-tested with the most recent one. Is this an actual Arch Linux bug and has this been confirmed to work on other distributions, or is it a Linux kernel/amdgpu/linux-firmware/.. bug? I think it's time to close this, and many such similar, old bug reports, here. PS: I'm not going to mention let alone the old hardware.
Comment by James (thx1138) - Saturday, 05 February 2022, 07:41 GMT
You may note that there is a "Request Closure" link, below the initial bug report, that you may use. I have no objection. I reported the original bug resolved 2021 April, but then, there are other users searching for guidance with similar seeming issues, and that may have been the motivation for this issue being left open. I suppose that's on me. As it is, AMD is a bit lethargic when it comes to bug fixes in legacy drivers. In the end, the authority to close a bug is a little "fuzzy" with archlinux, and I might suggest appeal directly to the hierarchy.

Bye the way, as for the "old hardware", I am responsible for quite my share of fixes to regressions in the kernel running on that "old hardware". As long as other people are not having problems with their newer hardware, that's great. Software always "just works" - until it doesn't.
Comment by Marcell Meszaros (MarsSeed) - Thursday, 03 March 2022, 15:22 GMT
For OP issue (radeon, linux 5.11.1) by @thx1138: Patch "drm/ttm: make ttm_bo_unpin more defensive"
- Fix merged to Linux 5.11.12: https://lwn.net/Articles/851870/
- Fix merged to Linux 5.12-rc4: https://lwn.net/Articles/849985/

Next issue (Ryzen 5, amdgpu, linux-5.11.x) @firewalker: Patch "drm/amd/display: check fb of primary plane"
- Fix merged to Linux 5.12.4: https://lwn.net/Articles/856267/

New issue mentioned by @korikori (Ryzen 4500u, amdgpu, linux-5-12) (10 May 2021):
- Reported fixed by Linux 5.13 (7 Jul 2021): https://bbs.archlinux.org/viewtopic.php?id=266108
- Issue linked (fixed, closed): https://gitlab.freedesktop.org/drm/amd/-/issues/1230

All mentioned issues have been patched, merged and reported as fixed.

Loading...