FS#54922 - [linux] 4.12 - bonding module not working with wireless

Attached to Project: Arch Linux
Opened by James (thx1138) - Monday, 24 July 2017, 19:02 GMT
Last edited by Evangelos Foutras (foutrelis) - Sunday, 20 August 2017, 06:39 GMT
Task Type Bug Report
Category Packages: Testing
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 4
Private No

Details

Going from linux 4.11 to 4.12, currently 4.12.3-1, the bonding module received a patch,
[next] bonding: fix active-backup transition
https://patchwork.ozlabs.org/patch/746683/

which requires correct reporting of link speed using the kernel net/core/ethtool.c
__ethtool_get_link_ksettings()
presumably giving an error at
581 err = dev->ethtool_ops->get_settings(dev, &cmd);

Apparently, this function does not play nicely with wireless drivers, at least for the Atheros ath5k and ath9k, and for several Realtek wireless drivers I have tried. The consequences are two-fold.

Note:
drivers/net/bonding/bond_main.c

if (bond_update_speed_duplex(slave)) {
slave->link = BOND_LINK_DOWN;
netdev_warn(bond->dev,
"failed to get link speed/duplex for %s\n",
slave->dev->name);
continue;
}

1) While the wireless drivers work perfectly well alone, and the wired network interfaces continue to work with the bonding module, when used in conjunction with the bonding module, a wireless interface will be put into the "down" state, and will not work with the bonding module.

2) Apparently, this "bond_update_speed_duplex(slave)" function executes 10 times per second, and a) the log file will be "spammed" with "failed to get link speed/duplex for blah" warnings continuously, 10 times per second, and b) these log messages may be sent to the console, 10 times per second, effectively creating a "Denial of Service" at the console. A remote terminal is then needed to reconfigure networking, to remove the wireless slave from the bonding module.

The problem has been communicated privately to the bonding module developers:
Andy Gospodarek <andy@greyhouse.net>
Mahesh Bandewar <mahesh@bandewar.net>
Thomas Davis <tadavis@lbl.gov>

I am not certain whether to blame the kernel ethtool or the wireless drivers for the "get_settings()" error, but the bonding module can be blamed for spamming the log file.

No course of action has yet been determined. For the moment, the options are:
1) revert the patch, or
2) downgrade the kernel
This task depends upon

Closed by  Evangelos Foutras (foutrelis)
Sunday, 20 August 2017, 06:39 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 4.12.8-2
Comment by loqs (loqs) - Monday, 24 July 2017, 19:38 GMT
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit?id=3f3c278c94dd994fe0d9f21679ae19b9c0a55292
Edit:
This commit is a fix for another issue so reverting this will reintroduce that issue correct?
Is the issue still present in linux 4.13-rc2? Is there any public discussion of this issue / bug report?
Comment by James (thx1138) - Monday, 24 July 2017, 21:16 GMT
Yes, it would reintroduce that issue. But then, how important is that issue? And how important is breaking wireless bonding, or DoS'ing the console?

Andy commented, privately:
"To me it's a bit of an interesting problem both technically and
politically. Mahesh's patches were written to address issues where
link-speed could not be calculated/collected and this was (I'm
guessing) causing issues with 802.3ad mode (4) since link-speed is
used to chose the active aggregator."

Clearly, the problem has not been thought-through entirely. And, I have seen other problems where wireless connection speeds are not reported correctly or are not reported at all, with wireless utilities, and even with the bonding module properly determining the "better" network interface when "primary_reselect" is set to "better". There may be a more general problem with wireless drivers failing to report connection speeds properly.

As far as I know, no additional work has been done with the bonding module since 2017 April to address this issue.

There is not yet any public discussion or bug report, mainly because I do not know where else to report this, and the developers have not suggested any venue. So far, only Andy has responded to my emails. And, I have not had much luck using the LKML as a general forum for these kinds of issues. I did send a note to Matthew Wilcox <matthew@wil.cx>, the name listed in net/core/ethtool.c, asking about the relationship between the kernel ethtool and these wireless drivers, but I don't know if Matthew is still involved, since the original date for ethtool.c was 2003.
Comment by loqs (loqs) - Monday, 24 July 2017, 21:55 GMT
What about Signed-off-by: David S. Miller <davem@davemloft.net> the maintainer of the network subsystem? or https://bugzilla.kernel.org
I am not sure arch will do a revert that fixes one thing but breaks another which upstream linux-stable queue has not taken,
that upstream's upstream linux-mainline has not taken and there is no position from upstream what course of action should be taken.
Third option would be disable bonding for the 4.12 series as 4.11 is now EOL until the issue is resolved upstream.
Comment by James (thx1138) - Monday, 24 July 2017, 22:30 GMT
> What about Signed-off-by: David S. Miller...

Andy commented:
"Unfortunately I'm not sure how many are really using bonding and
wireless with this driver, so this might not be a case that has been
tested much."

How many people have automatic wired and wireless switching on their laptops? I like it, but I had to build a custom solution to make it work. So maybe not much testing.

> Third option would be disable bonding for the 4.12 series as 4.11 is now EOL until the issue is resolved upstream.

That may be the most practical. At least, I wanted to make a note about the issue, in case anyone else is running into this.

I can update this thread when I hear back more from the developers.
Comment by Giancarlo Razzolini (grazzolini) - Monday, 31 July 2017, 00:26 GMT
This happens to me too, my card is an intel 7265, and, judging by the patches, this would happen in any case where a wired interface is bonded with a wireless one. The wireless interface is fully functional, it's just the bonding module that spam the logs. I haven't noticed any issue with the functionality, just this annoying spam. Maybe it is time for me to revisit networkmanager with teamd.
Comment by Michael Gwin (oksijun) - Monday, 31 July 2017, 09:09 GMT
With an "Intel Corporation Centrino Ultimate-N 6300", the interface is not functional. Logs are filled with "kernel: bond0: failed to get link speed/duplex for wlan0" messages, and there is no connectivity (DHCP configuration fails and manual IP configuration does not help).
Comment by Giancarlo Razzolini (grazzolini) - Monday, 31 July 2017, 11:33 GMT
I was too fast to claim it does work. The wireless interface itself does work, it does connect to networks. But, in the context of the bonding interface, as soon as there is just the wireless interface active, the bonding interface itself goes down and the log is spammed with this error. So, the wireless interface itself is functional, just the bonding interface isn't.
Comment by James (thx1138) - Monday, 31 July 2017, 13:33 GMT
> What about Signed-off-by: David S. Miller <davem@davemloft.net> the maintainer of the network subsystem? or https://bugzilla.kernel.org

Please follow at:

Bug 196547 - Since 4.12 - bonding module not working with wireless drivers
https://bugzilla.kernel.org/show_bug.cgi?id=196547

As Andy mentioned, this bug may have a political aspect, so please make your voices heard at kernel.org.

Loading...