FS#69426 - [linux] r8169 : wol broken

Attached to Project: Arch Linux
Opened by hamelg (hamelg) - Sunday, 24 January 2021, 18:56 GMT
Last edited by Andreas Radke (AndyRTR) - Tuesday, 30 March 2021, 12:05 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To No-one
Architecture x86_64
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

I don't remember the last time I successfully used the wake-on-lan on my lan adapter, but I notice these days wol is broken :
When power down, the light link stays off and nothing happens when it receives magic packets.

Additional info:
linux 5.10.9.arch1-1
22:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

Base Board Information
Manufacturer: Micro-Star International Co., Ltd
Product Name: B450 GAMING PLUS (MS-7B86)
I flashed the bios. It is up to date.

There are 2 workarounds to have the wol operational :

Install linux-lts
or
Install r8168 package

This task depends upon

Closed by  Andreas Radke (AndyRTR)
Tuesday, 30 March 2021, 12:05 GMT
Reason for closing:  Fixed
Additional comments about closing:  Fixed with linux 5.11
Comment by loqs (loqs) - Monday, 25 January 2021, 02:33 GMT
Using the ALA [1] can you determine which kernel introduced the issue?
Then you could bisect between the last working kernel and the first bad release to find the causal commit.

[1] https://wiki.archlinux.org/index.php/Arch_Linux_Archive
Comment by AK (Andreaskem) - Monday, 25 January 2021, 08:58 GMT
What does

cat /sys/class/net/<your adapter>/device/power/wakeup

print on your system? Does

# echo enabled > /sys/class/net/<your adapter>/device/power/wakeup

enable WoL for you?

So, is it completely broken or just disabled on newer kernels for some reason?
Comment by hamelg (hamelg) - Monday, 25 January 2021, 10:03 GMT
$ cat /sys/class/net/eth0/device/power/wakeup
enabled

Yes, my setting is correct. It just works fine by installing either linux-lts or r8168 package.

I found nothing about my issue at https://bugzilla.kernel.org.
Comment by Heiner Kallweit (kalle) - Monday, 25 January 2021, 17:33 GMT
How did you configure Wol?
Does "ethtool <if>" list WoL as enabled?
Does WoL work after "ethtool -s <if> wol g"?
Comment by hamelg (hamelg) - Monday, 25 January 2021, 17:42 GMT
$ cat /etc/netctl/eth0
Connection=ethernet
Description='eth0'
Interface=eth0
IP=static
IP6=stateless
ExecUpPost='ethtool -s eth0 wol g'

# ethtool eth0|grep Wake
Supports Wake-on: pumbg
Wake-on: g
Comment by Heiner Kallweit (kalle) - Monday, 25 January 2021, 19:43 GMT
This looks ok. Does WoL work from "Suspend to RAM" (e.g. after a "systemctl suspend")?
This can make a difference because WoL from S5 needs proper BIOS support. On my test system WoL only works after a shutdown if I disable "Deep Sleep (S5)" in BIOS.
Behavior is the same for r8169 and r8168. Having said that at least a part of the issue may be system-dependent.
As proposed before you could do a bisect between the LTS kernel version and 5.10.9.
Comment by hamelg (hamelg) - Monday, 25 January 2021, 20:14 GMT
>> Does WoL work from "Suspend to RAM" (e.g. after a "systemctl suspend?
Yes, it works from a Suspend to RAM.
My BIOS doesn't have the option to disable S5.
Comment by loqs (loqs) - Monday, 25 January 2021, 23:27 GMT
Pick a 5.7 release from the ALA if that works try 5.8 then 5.9 if 5.8 worked. If 5.7 fails try 5.5 if that works try 5.6.
Hopefully after three kernel installs and three tests you should have located which series introduced the issue.
After that we can look at what changed in r8169 for that release or bisect that release to find the causal commit.
Comment by hamelg (hamelg) - Tuesday, 26 January 2021, 19:09 GMT
5.9.13 : OK
5.9.14 : OK
5.10.1 : OK
5.10.2 : BROKEN
5.10.3 : BROKEN
...
Comment by loqs (loqs) - Tuesday, 26 January 2021, 19:54 GMT
Not seeing what in 5.10.2 [1] or [2] could have triggered the issue in 5.10.2-arch1.
There were a lot of config changes between 5.10.1.arch1-1 and 5.10.2.arch1-1 [3].
If you build 5.10.2-arch1 with the config from 5.10.1.arch1-1 does that work?

[1] https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.2
[2] https://git.archlinux.org/linux.git/log/?h=v5.10.2-arch1
[3] https://github.com/archlinux/svntogit-packages/commits/8f8cb52701cf1f2adbe06c1f19158c63c1a636ca
Comment by AK (Andreaskem) - Tuesday, 26 January 2021, 19:59 GMT
The actual kernel changes in 5.10.2 seem to be pretty innocuous but there were quite a few changes in Arch's kernel configuration between 5.10.1 and 5.10.2. Maybe there is some collateral damage?
edit: See what loqs said.
Comment by hamelg (hamelg) - Wednesday, 27 January 2021, 16:38 GMT
I did the test 5.10.2 with the config from 5.10.1.
And yes, wol works !
The breakage is somewhere in the new kernel options.
Comment by Heiner Kallweit (kalle) - Wednesday, 27 January 2021, 19:49 GMT
As next step you could check which of the config change commits breaks WoL.
I think it should be one of the following, the others look quite unrelated.

"Pick some configuration options from Fedora's default kernel"
"Disable CONFIG_EXPERT"
and maybe "Disable OpenFirmware support"
Comment by hamelg (hamelg) - Friday, 29 January 2021, 16:01 GMT
The culprit is
--
EDIT: Not sure, I double check ...
Comment by hamelg (hamelg) - Friday, 29 January 2021, 19:16 GMT
OK, double check done.
The culprit is

--
r404491 | heftig | 2020-12-19 00:32:00 +0100 (Sat, 19 Dec 2020) | 12 lines
Pick some configuration options from Fedora's default kernel

Any ideas about the faulty option ?
Comment by Heiner Kallweit (kalle) - Friday, 29 January 2021, 19:47 GMT
Not really. My best guess would be CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON. You could try to disable this option.
Alternatively you could bisect the config changes:
- apply first half of the changes and see whether issue persists or not
- if yes, then proceed with first half of the first half
- if not, then proceed with first half of the second half
- ..
Comment by hamelg (hamelg) - Friday, 29 January 2021, 20:08 GMT
My hardware is a AMD platform. CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON is relevant ?
Comment by Heiner Kallweit (kalle) - Friday, 29 January 2021, 20:38 GMT
Not sure. Just try whether it changes the behavior. You can also start directly with bisecting the changes.
Comment by hamelg (hamelg) - Saturday, 30 January 2021, 17:40 GMT
I found the faulty option : DEBUG_SHIRQ
NOT SET => WOL works
SET to y => WOL broken

I checked with the latest kernel (5.10.11), I get the same behavior.
Comment by Heiner Kallweit (kalle) - Saturday, 30 January 2021, 22:20 GMT
Thanks! Interesting .. I just wonder why nobody else faced this issue yet.
Can you apply the following patch to the kernel sources, then build the kernel and re-test with DEBUG_SHIRQ enabled?

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 7ee974793..08e63ecb1 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4544,8 +4544,10 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
}

- rtl_irq_disable(tp);
- napi_schedule(&tp->napi);
+ if (napi_schedule_prep(&tp->napi)) {
+ rtl_irq_disable(tp);
+ __napi_schedule(&tp->napi);
+ }
out:
rtl_ack_events(tp, status);

--
2.30.0

Comment by Heiner Kallweit (kalle) - Sunday, 31 January 2021, 13:35 GMT
In addition please also test with a 5.11-rc version. From 5.11 the interrupt isn't requested as shared if it's a MIS(-X) interrupt, and therefore should be independent of the DEBUG_SHIRQ config option.
Comment by hamelg (hamelg) - Sunday, 31 January 2021, 17:07 GMT
Thanks for the patch. I tested it with 5.10.11.
Unfortunately, it doesn't work. It has no effect with DEBUG_SHIRQ enabled.
Comment by Heiner Kallweit (kalle) - Sunday, 31 January 2021, 17:28 GMT
OK, but it doesn't come as a surprise. What's left is testing with a 5.11-rc version.
If this also doesn't help, then the root cause is not in the r8169 driver.
Comment by hamelg (hamelg) - Monday, 01 February 2021, 17:24 GMT
5.11 has fixed the issue, no patch needed.
Thanks for all :)
Comment by Heiner Kallweit (kalle) - Monday, 01 February 2021, 18:58 GMT
Thanks for the testing efforts and good that the issue is fixed with 5.11.
I just still have no clue what could be the root cause of the issue.
Comment by Heiner Kallweit (kalle) - Monday, 01 February 2021, 20:28 GMT
I think I got it. Subtle bug:
So far phy_disconnect() is called before free_irq().
If DEBUG_SHIRQ is set and irq is shared, then free_irq() creates an "artificial" interrupt.
The "link change" flag is set in the irq status register, as a consequence phy_suspend() is called.
Because the net_device is detached from the PHY already, the PHY driver can't recognize that WoL is set
and powers down the PHY.
The following should fix WoL also under 5.10 and prior kernel versions.

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 4253d51a9..53c2079c7 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4667,10 +4667,10 @@ static int rtl8169_close(struct net_device *dev)

cancel_work_sync(&tp->wk.work);

- phy_disconnect(tp->phydev);
-
free_irq(pci_irq_vector(pdev, 0), tp);

+ phy_disconnect(tp->phydev);
+
dma_free_coherent(&pdev->dev, R8169_RX_RING_BYTES, tp->RxDescArray,
tp->RxPhyAddr);
dma_free_coherent(&pdev->dev, R8169_TX_RING_BYTES, tp->TxDescArray,
--
2.30.0

Comment by hamelg (hamelg) - Tuesday, 02 February 2021, 17:46 GMT
I tested the last patch with 5.10 and odd things happens.
It doesn't works every time, and when it works the console displays a GPF message in module r8169 just before powering down.
It doesn't matter, we can wait the imminent 5.11.
Comment by Heiner Kallweit (kalle) - Tuesday, 02 February 2021, 18:21 GMT
I see. Can you post the error? Should be accessible by e.g. journalctl.
Comment by hamelg (hamelg) - Tuesday, 02 February 2021, 19:21 GMT
No, journald is stopped when the error happens.
It's the last message displayed on the console before powering down and it stays on screen less than 1 second.
I rebooted 5 times and I have not been able to get the GPF error message again.
Comment by hamelg (hamelg) - Wednesday, 24 February 2021, 16:37 GMT
As the bug is still open, I add some clues.
I notice that sometimes wol works.
The patch makes no difference, excepted I don't see the GPF error without it.
It seems the bug is trigger by a race condition.
If I run the kernel with the parameter maxcpus=1, wol works every time.
Comment by Heiner Kallweit (kalle) - Wednesday, 24 February 2021, 21:52 GMT
To get the GPF trace you could add a delay at the end of rtl8169_close(), e.g. ssleep(10), and take a picture of the error message.
Comment by hamelg (hamelg) - Thursday, 25 February 2021, 20:17 GMT
I have just added the ssleep(10) with linux 5.10.16.
Now, the FAULT errors appears in the journal.

Feb 25 20:58:23 xxxxxxxx systemd[1]: Stopping eth0...
Feb 25 20:58:23 xxxxxxxx network[2424]: Stopping network profile 'eth0'...
Feb 25 20:58:23 xxxxxxxx kernel: r8169 0000:22:00.0 eth0: Link is Down
Feb 25 20:58:27 xxxxxxxx kernel: r8169 0000:22:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xdf602c50 flags=0x0020]
Feb 25 20:58:27 xxxxxxxx kernel: r8169 0000:22:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xdf602c60 flags=0x0000]
Feb 25 20:58:27 xxxxxxxx kernel: r8169 0000:22:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xde6b4000 flags=0x0020]
Feb 25 20:58:33 xxxxxxxx network[2424]: Stopped network profile 'eth0'
Feb 25 20:58:33 xxxxxxxx systemd[1]: netctl@eth0.service: Succeeded.
Feb 25 20:58:33 xxxxxxxx systemd[1]: Stopped eth0.
Comment by Heiner Kallweit (kalle) - Thursday, 25 February 2021, 20:44 GMT
Thanks. You'll find lots of reports regarding AMD IOMMU errors. I can't see any obvious problem in r8169, so it may be an AMD IOMMU issue. Try to switch it off in BIOS.
Comment by hamelg (hamelg) - Thursday, 25 February 2021, 21:43 GMT
It makes no difference with IOMMU switched off in BIOS.
I get the same errors with linux 5.11.1. Perhaps these errors have always been there, but I didn't notice them until now.
The difference is with 5.11 wol works every time, but not with 5.10.
In fact, WOL works only when these errors are displayed.
Comment by Heiner Kallweit (kalle) - Thursday, 25 February 2021, 22:03 GMT
You could try updating the BIOS. Maybe the AMD IOMMU issue is a BIOS bug.
Comment by hamelg (hamelg) - Friday, 26 February 2021, 07:17 GMT
My bios is up to date.
The IO_PAGE-FAULT error was not happening before the commit :
r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
If you look the journal log extract, how to explain the FAULT is happening 5s after calling sleep(10) ?
Comment by Heiner Kallweit (kalle) - Friday, 26 February 2021, 07:34 GMT
The IO_PAGE_FAULT occurring after this commit doesn't necessarily mean it's wrong, it could have triggered some issue in the AMD IOMMU driver.
No clue where the system is spending these 5s, I don't see this behavior on my Intel-based systems.
Comment by Heiner Kallweit (kalle) - Friday, 26 February 2021, 08:24 GMT
As a further idea, could you please check whether the following makes a difference:

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 1be07e45d..2be1736aa 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1168,7 +1168,8 @@ void phy_state_machine(struct work_struct *work)
void phy_mac_interrupt(struct phy_device *phydev)
{
/* Trigger a state machine change */
- phy_trigger_machine(phydev);
+ if (phy_is_started(phydev))
+ phy_trigger_machine(phydev);
}
EXPORT_SYMBOL(phy_mac_interrupt);

--
2.30.1

Comment by hamelg (hamelg) - Friday, 26 February 2021, 19:14 GMT
Thanks for the patch.
I tested it and unfortunately the behavior stays identical.
Comment by Heiner Kallweit (kalle) - Saturday, 27 February 2021, 13:39 GMT
Any change in behavior if you set kernel options iommu=soft or iommu=pt ?
Comment by hamelg (hamelg) - Saturday, 27 February 2021, 17:07 GMT
Here is the test results :

5.10.16 : works 1 time on 10, when it works iommu=soft or pt fixes the FAULT : no error message
5.10.16 with ssleep(10) : works every time with IO_PAGE_FAULT message, iommu=soft or pt fixes the FAULT : no error message
5.11.1 : works every time with IO_PAGE_FAULT message, iommu=soft or pt fixes the FAULT : no error message
Comment by Heiner Kallweit (kalle) - Sunday, 28 February 2021, 16:04 GMT
In order to get an idea what could race with closing the network device: Which network management tool are you using (e.g. netctl, systemd-networkd, NetworkManager)?
Ah, just see it. According to an earlier comment you're using netctl.
Comment by Heiner Kallweit (kalle) - Monday, 22 March 2021, 10:52 GMT
Following fix that just made it to linux-next may help also in your case:
f658b90977d2 ("r8169: fix DMA being used after buffer free if WoL is enabled")
It will be backported to 5.10 and 5.11, but this may take 1-2 weeks.

Loading...