FS#75138 - Suspend can hang if ath9k wireless interface is up but un-associated

Attached to Project: Arch Linux
Opened by Greg Erwin (Odysseus) - Wednesday, 22 June 2022, 04:08 GMT
Last edited by Toolybird (Toolybird) - Sunday, 13 November 2022, 05:33 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
David Runge (dvzrv)
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Suspend can hang if ath9k wireless interface is up but un-associated

Description:

Starting from kernel version 5.18.arch1-1 and onwards, I typically experience long delays in the suspend process. It can be anywhere from a few seconds to upwards of two minutes.

Additional info:

After including 'initcall_debug' to my kernel cmdline, I see the following sequence of journal messages when the hang occurs:
[ 1290.501909] ieee80211 phy0: PM: calling wiphy_suspend+0x0/0x2b0 [cfg80211] @ 86, parent: 0000:06:00.0
...
[ 1294.038295] queueing ieee80211 work while going to suspend
[ 1348.011752] hwrng: no data available
[ 1358.043585] ieee80211 phy0: PM: wiphy_suspend+0x0/0x2b0 [cfg80211] returned 0 after 67541642 usecs

Additionally, the 'hwrng: no data available' message appears in the log sporadically at other times wile the PC is running normally.
In particular, the suspend hang will not occur until the first time the message is reported in the log.

In contrast, a normal suspend process on kernel version 5.17.9.arch1-1 results in this sequence:
[ 28.654797] ieee80211 phy0: PM: calling wiphy_suspend+0x0/0x2b0 [cfg80211] @ 86, parent: 0000:06:00.0
...
[ 28.663307] ieee80211 phy0: PM: wiphy_suspend+0x0/0x2b0 [cfg80211] returned 0 after 8476 usecs

With effort, I bisected the issue down to two linux kernel commits that are both necessary to trigger the suspend hang:
[fcd09c90c3c5254b18ef34e30c57c65d34290a84] ath9k: use hw_random API instead of directly dumping into random.c
[c321e907aa4803d562d6e70ebed9444ad082f953] random: continually use hwgenerator randomness

I have a 'Qualcomm Atheros AR93xx Wireless Network Adapter' installed which is typically not in use and remains un-associated with any AP. While the interface is up but un-associated, the hardware random number generator provided by the ath9k driver will block and not yield any data. It is not until I start the wpa_supplicant service that 'dd if=/dev/hwrng count=1' will unblock and return data. While connected to wifi, I never see the 'hwrng: no data available' message or experience the suspend hang.

Similarly, if I explicitly set 'ip link wlp6s0 down' then /sys/class/misc/hw_random/rng_available is empty, and instead of blocking, 'dd if=/devv/hwrng count=1' returns 'dd: error reading '/dev/hwrng': No such device'. While the interface is down, I do not get any "hwrng: no data available" log messages or suspend hangs.

I am guessing that it is this available but blocking behavior of the ath9k hardware number generator that is problematic.

The ath9k driver is the only provider of a hwrng in my system. I would assume that having an additional hwrng available, like a TPM chip or a modern CPU would prevent this issue from triggering.

Steps to reproduce:

0) Have a wireless network adapter that uses the ath9k driver.
1) Install a kernel that includes the two commits called out above, such as 5.18.arch1-1 or later.
2) Include 'initcall_debug' in your kernel cmdline.
3) Disconnect the wireless interface, but leave it up.
4) Look in /sys/class/misc/hw_random/rng_available to blacklist or rmmod any drivers that provide a hwrng other than ath9k.
5) Wait for "demsg -w | grep hwrng" to show "hwrng: no data available".
6) Suspend.
7) The system will go unresponsive for a long period of time before finally entering the suspended state.
8) Resume.
9) Confirm the hang with 'dmesg | grep wiphy_suspend' which will show it returned 0 after a few million usecs.


This task depends upon

Closed by  Toolybird (Toolybird)
Sunday, 13 November 2022, 05:33 GMT
Reason for closing:  Fixed
Additional comments about closing:  Reporter says "Patched upstream."
Comment by loqs (loqs) - Wednesday, 22 June 2022, 18:02 GMT
This matches the findings from https://bbs.archlinux.org/viewtopic.php?id=277097
See also  FS#74899 

Have you contacted the upstream kernel developers?
Comment by Greg Erwin (Odysseus) - Wednesday, 22 June 2022, 22:47 GMT
No, this is the first and only place I've reported it.

Interesting how this seems to manifest differently for others.
I can't recall ever seeing my system hang except when entering suspend.
Comment by loqs (loqs) - Thursday, 23 June 2022, 00:22 GMT
I would suggest contacting Jason A. Donenfeld <Jason@zx2c4.com> not sure which list you would CC could be any of:
ath9k-devel <ath9k-devel@qca.qualcomm.com> "linux-wireless@vger.kernel.org" <linux-wireless@vger.kernel.org> "Linux Crypto Mailing List" <linux-crypto@vger.kernel.org>
Edit:
rereading https://lore.kernel.org/all/YgykjbhgdqeYyiY5%40owl.dominikbrodowski.net/ this analysis seems to match the issue.
Edit2:
That was on v2 and was addressed in v3.
Comment by Greg Erwin (Odysseus) - Thursday, 23 June 2022, 00:32 GMT
I'll give it a shot.

By the way, I went back and spent more time with a kernel that only includes the "ath9k: use hw_random API" commit, and was eventually able to reproduce the suspend hang. In this case, it took an hour for the first "hwrng: no data available" message to appear. I think the "random: continually use hwgenerator randomness" commit simply makes the issue more probable to occur.

Comment by David Haller (davidhaller) - Monday, 11 July 2022, 17:32 GMT
Thank you very much for reporting this! I'm also affected by this bug; now I know how to workaround it. The built-in wifi on my machine is not in use anyway.
Comment by Toolybird (Toolybird) - Monday, 01 August 2022, 07:37 GMT
Good work on the bisection.

> I'll give it a shot

Any luck?
Comment by Greg Erwin (Odysseus) - Tuesday, 02 August 2022, 23:26 GMT
There has been a fair bit of discussion on the mailing lists over this.
It looks like version 12 of the patch started by Jason Donenfeld and modified by Herbert Xu will get picked up.

I think these are all the relevant threads on LKML:
https://bit.ly/3PZmijo (My initial post, Jason's patches v1-v6)
https://bit.ly/3zSG4aD (Jason's patches v7-v10)
https://bit.ly/3JsaT97 (Herbert Xu's additions v11-v12)
Comment by loqs (loqs) - Wednesday, 31 August 2022, 22:16 GMT
You might want to add regzbot [1] or mention the issue on Linux kernel regressions mailing list to try and get the fix included in 6.0.

[1] https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
Comment by Ho Tuan Kiet (tuankiet65) - Sunday, 11 September 2022, 18:46 GMT
Has this been fixed in recent kernel releases? Doesn't seem to be the case. Is there anything I can do to advocate for it?
Comment by loqs (loqs) - Sunday, 11 September 2022, 19:21 GMT
The fix is now in the crypto tree [1]. I do not know when it will reach mainline / stable.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git/commit/?id=36cb6494429bd64b27b7ff8b4af56f8e526da2b4
Comment by Greg Erwin (Odysseus) - Saturday, 12 November 2022, 07:19 GMT

Loading...