FS#69426 - [linux] r8169 : wol broken
Attached to Project:
Arch Linux
Opened by hamelg (hamelg) - Sunday, 24 January 2021, 18:56 GMT
Last edited by Andreas Radke (AndyRTR) - Tuesday, 30 March 2021, 12:05 GMT
Opened by hamelg (hamelg) - Sunday, 24 January 2021, 18:56 GMT
Last edited by Andreas Radke (AndyRTR) - Tuesday, 30 March 2021, 12:05 GMT
|
Details
I don't remember the last time I successfully used the
wake-on-lan on my lan adapter, but I notice these days wol
is broken :
When power down, the light link stays off and nothing happens when it receives magic packets. Additional info: linux 5.10.9.arch1-1 22:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) Base Board Information Manufacturer: Micro-Star International Co., Ltd Product Name: B450 GAMING PLUS (MS-7B86) I flashed the bios. It is up to date. There are 2 workarounds to have the wol operational : Install linux-lts or Install r8168 package |
This task depends upon
Closed by Andreas Radke (AndyRTR)
Tuesday, 30 March 2021, 12:05 GMT
Reason for closing: Fixed
Additional comments about closing: Fixed with linux 5.11
Tuesday, 30 March 2021, 12:05 GMT
Reason for closing: Fixed
Additional comments about closing: Fixed with linux 5.11
Then you could bisect between the last working kernel and the first bad release to find the causal commit.
[1] https://wiki.archlinux.org/index.php/Arch_Linux_Archive
cat /sys/class/net/<your adapter>/device/power/wakeup
print on your system? Does
# echo enabled > /sys/class/net/<your adapter>/device/power/wakeup
enable WoL for you?
So, is it completely broken or just disabled on newer kernels for some reason?
enabled
Yes, my setting is correct. It just works fine by installing either linux-lts or r8168 package.
I found nothing about my issue at https://bugzilla.kernel.org.
Does "ethtool <if>" list WoL as enabled?
Does WoL work after "ethtool -s <if> wol g"?
Connection=ethernet
Description='eth0'
Interface=eth0
IP=static
IP6=stateless
ExecUpPost='ethtool -s eth0 wol g'
# ethtool eth0|grep Wake
Supports Wake-on: pumbg
Wake-on: g
This can make a difference because WoL from S5 needs proper BIOS support. On my test system WoL only works after a shutdown if I disable "Deep Sleep (S5)" in BIOS.
Behavior is the same for r8169 and r8168. Having said that at least a part of the issue may be system-dependent.
As proposed before you could do a bisect between the LTS kernel version and 5.10.9.
Yes, it works from a Suspend to RAM.
My BIOS doesn't have the option to disable S5.
Hopefully after three kernel installs and three tests you should have located which series introduced the issue.
After that we can look at what changed in r8169 for that release or bisect that release to find the causal commit.
5.9.14 : OK
5.10.1 : OK
5.10.2 : BROKEN
5.10.3 : BROKEN
...
There were a lot of config changes between 5.10.1.arch1-1 and 5.10.2.arch1-1 [3].
If you build 5.10.2-arch1 with the config from 5.10.1.arch1-1 does that work?
[1] https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.2
[2] https://git.archlinux.org/linux.git/log/?h=v5.10.2-arch1
[3] https://github.com/archlinux/svntogit-packages/commits/8f8cb52701cf1f2adbe06c1f19158c63c1a636ca
edit: See what loqs said.
And yes, wol works !
The breakage is somewhere in the new kernel options.
I think it should be one of the following, the others look quite unrelated.
"Pick some configuration options from Fedora's default kernel"
"Disable CONFIG_EXPERT"
and maybe "Disable OpenFirmware support"
--
EDIT: Not sure, I double check ...
The culprit is
--
r404491 | heftig | 2020-12-19 00:32:00 +0100 (Sat, 19 Dec 2020) | 12 lines
Pick some configuration options from Fedora's default kernel
Any ideas about the faulty option ?
Alternatively you could bisect the config changes:
- apply first half of the changes and see whether issue persists or not
- if yes, then proceed with first half of the first half
- if not, then proceed with first half of the second half
- ..
NOT SET => WOL works
SET to y => WOL broken
I checked with the latest kernel (5.10.11), I get the same behavior.
Can you apply the following patch to the kernel sources, then build the kernel and re-test with DEBUG_SHIRQ enabled?
diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 7ee974793..08e63ecb1 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4544,8 +4544,10 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
}
- rtl_irq_disable(tp);
- napi_schedule(&tp->napi);
+ if (napi_schedule_prep(&tp->napi)) {
+ rtl_irq_disable(tp);
+ __napi_schedule(&tp->napi);
+ }
out:
rtl_ack_events(tp, status);
--
2.30.0
Unfortunately, it doesn't work. It has no effect with DEBUG_SHIRQ enabled.
If this also doesn't help, then the root cause is not in the r8169 driver.
Thanks for all :)
I just still have no clue what could be the root cause of the issue.
So far phy_disconnect() is called before free_irq().
If DEBUG_SHIRQ is set and irq is shared, then free_irq() creates an "artificial" interrupt.
The "link change" flag is set in the irq status register, as a consequence phy_suspend() is called.
Because the net_device is detached from the PHY already, the PHY driver can't recognize that WoL is set
and powers down the PHY.
The following should fix WoL also under 5.10 and prior kernel versions.
diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 4253d51a9..53c2079c7 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4667,10 +4667,10 @@ static int rtl8169_close(struct net_device *dev)
cancel_work_sync(&tp->wk.work);
- phy_disconnect(tp->phydev);
-
free_irq(pci_irq_vector(pdev, 0), tp);
+ phy_disconnect(tp->phydev);
+
dma_free_coherent(&pdev->dev, R8169_RX_RING_BYTES, tp->RxDescArray,
tp->RxPhyAddr);
dma_free_coherent(&pdev->dev, R8169_TX_RING_BYTES, tp->TxDescArray,
--
2.30.0
It doesn't works every time, and when it works the console displays a GPF message in module r8169 just before powering down.
It doesn't matter, we can wait the imminent 5.11.
It's the last message displayed on the console before powering down and it stays on screen less than 1 second.
I rebooted 5 times and I have not been able to get the GPF error message again.
I notice that sometimes wol works.
The patch makes no difference, excepted I don't see the GPF error without it.
It seems the bug is trigger by a race condition.
If I run the kernel with the parameter maxcpus=1, wol works every time.
Now, the FAULT errors appears in the journal.
Feb 25 20:58:23 xxxxxxxx systemd[1]: Stopping eth0...
Feb 25 20:58:23 xxxxxxxx network[2424]: Stopping network profile 'eth0'...
Feb 25 20:58:23 xxxxxxxx kernel: r8169 0000:22:00.0 eth0: Link is Down
Feb 25 20:58:27 xxxxxxxx kernel: r8169 0000:22:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xdf602c50 flags=0x0020]
Feb 25 20:58:27 xxxxxxxx kernel: r8169 0000:22:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xdf602c60 flags=0x0000]
Feb 25 20:58:27 xxxxxxxx kernel: r8169 0000:22:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xde6b4000 flags=0x0020]
Feb 25 20:58:33 xxxxxxxx network[2424]: Stopped network profile 'eth0'
Feb 25 20:58:33 xxxxxxxx systemd[1]: netctl@eth0.service: Succeeded.
Feb 25 20:58:33 xxxxxxxx systemd[1]: Stopped eth0.
I get the same errors with linux 5.11.1. Perhaps these errors have always been there, but I didn't notice them until now.
The difference is with 5.11 wol works every time, but not with 5.10.
In fact, WOL works only when these errors are displayed.
The IO_PAGE-FAULT error was not happening before the commit :
r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
If you look the journal log extract, how to explain the FAULT is happening 5s after calling sleep(10) ?
No clue where the system is spending these 5s, I don't see this behavior on my Intel-based systems.
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 1be07e45d..2be1736aa 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1168,7 +1168,8 @@ void phy_state_machine(struct work_struct *work)
void phy_mac_interrupt(struct phy_device *phydev)
{
/* Trigger a state machine change */
- phy_trigger_machine(phydev);
+ if (phy_is_started(phydev))
+ phy_trigger_machine(phydev);
}
EXPORT_SYMBOL(phy_mac_interrupt);
--
2.30.1
I tested it and unfortunately the behavior stays identical.
5.10.16 : works 1 time on 10, when it works iommu=soft or pt fixes the FAULT : no error message
5.10.16 with ssleep(10) : works every time with IO_PAGE_FAULT message, iommu=soft or pt fixes the FAULT : no error message
5.11.1 : works every time with IO_PAGE_FAULT message, iommu=soft or pt fixes the FAULT : no error message
Ah, just see it. According to an earlier comment you're using netctl.
f658b90977d2 ("r8169: fix DMA being used after buffer free if WoL is enabled")
It will be backported to 5.10 and 5.11, but this may take 1-2 weeks.