FS#20542 - [kernel26] iwlagn driver broken with kernel 2.6.35

Attached to Project: Arch Linux
Opened by Can Celasun (dcelasun) - Friday, 20 August 2010, 19:35 GMT
Last edited by Tobias Powalowski (tpowa) - Tuesday, 14 February 2012, 14:31 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 4
Private No

Details

Description:

With the 2.6.35 upgrade in [core], the iwlagn driver drops connection every ~5 minutes and auto-reconnects. dmesg and kernel.log is filled with:

iwlagn 0000:04:00.0: BA scd_flow 0 does not match txq_id 10

The WiFi card is Intel 5100.

Also, this might be related: https://patchwork.kernel.org/patch/112837/

I've set the severity to "high", since I had to downgrade to 2.6.34 for wifi to become stable again and having the kernel in IgnorePkg will eventually break something.

Additional info:
* package version(s)
* config and/or log files etc.


Steps to reproduce:

Upgrade to 2.6.35 kernel and connect to a wifi using an Intel 5100 wifi link.
This task depends upon

Closed by  Tobias Powalowski (tpowa)
Tuesday, 14 February 2012, 14:31 GMT
Reason for closing:  Fixed
Comment by Can Celasun (dcelasun) - Friday, 20 August 2010, 19:36 GMT Comment by Can Celasun (dcelasun) - Wednesday, 25 August 2010, 14:10 GMT
Intel confirmed the problem and they are working on a fix.
Comment by Alphazo (alphazo) - Thursday, 09 September 2010, 21:25 GMT
Don't know if the following is related but after a while I get disconnected with no way to reconnect. Attached is my kernel.log that also includes some traces related to the wifi.

[EDIT] To avoid any confusion I'm creating a new bug report
Comment by Thomas Bächler (brain0) - Friday, 10 September 2010, 09:21 GMT
Yes, alphazo's report is a different wireless device using a different firmware file. The error is also different.

Can, can you please post here when Intel post the new firmware, we can then include it in the linux-firmware package early.
Comment by Can Celasun (dcelasun) - Friday, 10 September 2010, 09:27 GMT
@Thomas: Sure, I'll post a link here.

Also, there is a workaround that I can confirm. I've fetched the iwlagn driver in 2.6.34.3 and replaced the driver in 2.6.35.4 with it. The kernel builds successfully with that configuration. It could be useful for those who can't wait for a firmware update.
Comment by Can Celasun (dcelasun) - Saturday, 11 September 2010, 17:28 GMT
Update:

Those who don't want to wait for the new firmware, Intel posted a workaround. Apparently, the bug in the firmware only causes problems when 802.11n is enabled. So, if you are not using an "n" network (most people don't) you can use kernel 2.6.35 and disable 802.11n with this:

- Get your 802.11n module name ($modinfo iwlagn)
- Reload the iwlagn module with disabling 802.11n (#modprobe iwlagn 11n_disable=1)

Hope this helps.
Comment by Ethan Schoonover (new acct: altercation to match irc) (Thinkpol) - Thursday, 16 September 2010, 22:36 GMT
The fixed listed by Can Celasun seems to work here on my 5100. For those that want to load this at boot time and are not familiar with rc.conf/modprobe (see the wiki), I believe that adding the following line to /etc/modprobe.d/modprobe.conf should work:

options iwlagn 11n_disable=1

and add or uncomment the following line from /etc/mkinitcpio.conf (note that the current default mkinitcpio.conf does not have the modprobe.d path in the example):

FILES="/etc/modprobe.d/modprobe.conf"

followed by a:

# mkinitcpio -p kernel26

If this is incorrect please note. If this is an inappropriate venue for this information, please let me know and I'll post to a more appropriate area next time.
Comment by Can Celasun (dcelasun) - Friday, 17 September 2010, 06:05 GMT
As an alternative method to the one given by Ethan, you can add the following to rc.local:

rmmod iwlagn
modprobe iwlagn 11n_disable=1
Comment by Ethan Schoonover (new acct: altercation to match irc) (Thinkpol) - Friday, 17 September 2010, 20:15 GMT
I recommend Can's method as it's lighterweight for what should (hopefully) be a temporary work around.

Also, as an addendum to the modprobe.conf method listed above, I believe that editing /etc/mkinitcpio.conf to include the modprobe path is actually unnecessary now. All files within /etc/modprobe.d/ are now apparently included by default without explicit path definition.
Comment by Can Celasun (dcelasun) - Tuesday, 05 October 2010, 19:15 GMT
Update:

It seems like the problem lies both within the ucode and the iwlwifi driver. I'm currently in touch with a developer from Intel and he confirmed that the future (unreleased) ucode fixes the issue. For the problem within iwlwifi, we are currently working on a fix. I'm currently testing a patch, if it turns out to be useful, I'll post it here.
Comment by Ethan Schoonover (new acct: altercation to match irc) (Thinkpol) - Tuesday, 05 October 2010, 19:53 GMT
Can, I'm happy to test as well, but I'd need to establish a reproducible failure case. I am still seeing dropouts but I haven't tried to nail down a repro scenario yet (though I'm sure it's there).
Comment by Can Celasun (dcelasun) - Tuesday, 05 October 2010, 20:03 GMT
Ethan, using the iwlagn driver without the workaround seems to be enough to reproduce the issue. We have managed to reproduce it on both machines we've tried.

The guy from Intel prepared the patch against 2.6.36rc6 (since that's what I'm using) so if you'd like to test the patch, you should fetch kernel26-mainline from AUR.

I'll post the patch (and the kernel config) here, hopefully tomorrow morning.
Comment by Ethan Schoonover (new acct: altercation to match irc) (Thinkpol) - Tuesday, 05 October 2010, 21:04 GMT
Can, I'm actually using it without the workaround right now, but I did upgrade to the bleeding-edge compat-wireless release. Right now I'm connecting via 802.11n and it's very stable. It definitely does still fail, but it's much more intermittent. I am trying to induce failure now (running iperf to monitor throughput, dropping into powersave mode on the interface, connected at 802.11n). Hopefully it will drop at the 5 or 10 minute mark.

BTW, I did have greater trouble with the driver when I was on clocksource=hpet. jiffies eliminated a whole class of apparent powersave related issues that I had with iwlagn.

Anyhow, I'm happy to test against 2.6.36rc6 with patch if it seems to be working...
Comment by Can Celasun (dcelasun) - Tuesday, 05 October 2010, 21:15 GMT
Ethan, I do not use compat-wireless so maybe something already got fixed upstream. However, I don't think this is the case since Intel would have informed us if it were.

I don't have an n-router available so I can't test 802.11n, but based on your experience if it still fails from time to time, the problem is still there.

BTW, which kernel are you testing this with?
Comment by Ethan Schoonover (new acct: altercation to match irc) (Thinkpol) - Tuesday, 05 October 2010, 21:34 GMT
I'm on Kernel 2.6.35.7-1, straight from Arch Core repo.

And I agree, problem is definitely not fixed, though perhaps attenuated with the compat-wireless bleeding edge drivers. I actually notice the problem most on one particular AP (a B/G only access point) and will try to test against that first to see if I can consistently get it to fail. I'd really like to be able to get a solid lock on the failure conditions so that I can test the patch with certainty. Will report further in a couple hours if I'm able to do so.
Comment by Ethan Schoonover (new acct: altercation to match irc) (Thinkpol) - Tuesday, 05 October 2010, 22:42 GMT
Looks like I can consistently reproduce failure by connecting to a B/G-only access point with the 5100/iwlagn in bgn-mode (i.e. no workarounds). Ready to test the patch anytime.

EDIT: scratch that... I was looking at the wrong MAC address and was connected to a remote AP. When connecting to a good signal, no dropouts or failures yet. Remote access point does seem to occasionally result in some kind of dropped connection, but not sure if it's related to the total failure noted before I upgraded to current kernel and bleeding-edge compat-wireless drivers.
Comment by Can Celasun (dcelasun) - Wednesday, 06 October 2010, 10:26 GMT
Well, when you can consistently reproduce the issue, let me know and I'll email you the patch (I don't want to post it publicly before talking to Intel).
Comment by Ethan Schoonover (new acct: altercation to match irc) (Thinkpol) - Wednesday, 06 October 2010, 16:17 GMT
Can, will do. I think my previous intermittent outages were due to the 5100 connecting to a remote AP. I do think it handles the low quality signal badly, which could be related, but that's just anecdotal and a guess on my part. I'll keep pushing it and trying alternate access configurations to see if I can trigger bad behavior. For what it's worth 2.6.357-1 and the bleeding edge compat-wireless seems to be a good combination.
Comment by Can Celasun (dcelasun) - Thursday, 07 October 2010, 13:28 GMT
Ethan, I've sent you an email with a patch for iwlwifi (against 2.6.36-rc6) and a kernel config.
Comment by Can Celasun (dcelasun) - Saturday, 09 October 2010, 11:34 GMT
Ethan, were you able to test the patch?
Comment by Ethan Schoonover (new acct: altercation to match irc) (Thinkpol) - Monday, 11 October 2010, 21:16 GMT
Just so I don't appear to have disappeared, this is an update to confirm patch tested and looking good :)
Comment by Serge Buglakov (redetection) - Wednesday, 27 October 2010, 17:10 GMT
smth really broken, look:

[rd@rdbook ~]$ uname -r && iwconfig wlan0 | head -n 1
2.6.32-ARCH
wlan0 IEEE 802.11abgn ESSID:"rdwp"

[rd@rdbook ~]$ uname -r && iwconfig wlan0 | head -n 1
2.6.35-ARCH
wlan0 IEEE 802.11abg ESSID:"rdwp"

my wifi card is
[rd@rdbook ~]$ lspci -nn | grep WiFi
02:00.0 Network controller [0280]: Intel Corporation WiMAX/WiFi Link 5150 Series [8086:423c]

and it really support 802.11n. in kernel 2.6.32 I have about ten megabytes per second, but only three in 2.6.35
Comment by Ethan Schoonover (new acct: altercation to match irc) (Thinkpol) - Friday, 29 October 2010, 21:05 GMT
Can's fix will probably solve this, but I wanted to note some other details for the time being:

On current 2.6.36, the following setterm induces immediate iwlagn / 5100 failure on the console, and the equivalent xset commands below result in immediate failure while in X:

# setterm -blank 1 (for example)

# xset dmps force standby
# xset dmps force suspend
# xset dpms force off

I assume there is some powersave function that is ganged to the screen powersaving function resulting in the iwlagn failure correlation with display blanking.

EDIT: n.b. that the above only applies with clocksource=hpet. on my x100e, clocksource=jiffies results in NO FAILURES of iwlagn/5100 combo. hpet is preferable as clocksource but I'll use jiffies till this is sorted.
Comment by Can Celasun (dcelasun) - Friday, 18 February 2011, 12:35 GMT
@Thomas: You said I should post here when Intel publishes new firmware so you can include it in linux-firmware.

Here's the experimental ucode: http://www.intellinuxwireless.org/?n=experimental
On that page, the ucode for 5xxx devices fixes all problems except those using n-networks. If you can include this in linux-firmware, it would help a lot of users with 5xxx cards.
Comment by Thomas Bächler (brain0) - Friday, 18 February 2011, 12:45 GMT
They certainly took their time, but glad to hear it. Considering that the firmware is 'experimental', I am not comfortable including it into linux-firmware. You could however send feedback to them.

I would prefer if we could wait until they submit these images to the linux-firmware maintainers (http://git.kernel.org/?p=linux/kernel/git/dwmw2/linux-firmware.git;a=summary). In the meantime, you can replace the firmware file on your machine with the experimental ones and prevent pacman from overwriting them on update.
Comment by Can Celasun (dcelasun) - Friday, 18 February 2011, 12:50 GMT
I know it's marked as 'experimental', but Intel says the only changes are relevant to this particular problem and as you can see from the countless comments on my original upstream report [1], no one has experienced any side effects with this. Still, I'll understand if you decide to wait, but can we at least have it in [testing]? Otherwise I think I'll put it up on AUR.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=16691
Comment by Can Celasun (dcelasun) - Friday, 18 February 2011, 12:58 GMT
I've just checked the README for this new ucode, and it requires 'experimental ucode support' to be enabled in kconfig, so it would require a rebuild of the kernel.

I think this rules out any chance of including it in the repos, not even in [testing], right?
Comment by Thomas Bächler (brain0) - Friday, 18 February 2011, 14:11 GMT
My guess is the "experimental ucode support" only affects the file name: It will look under a different file name (...-exp...) first. If we rename this ucode file to the standard name, we could probably use it.
Comment by Can Celasun (dcelasun) - Friday, 18 February 2011, 14:16 GMT
OK then, I'll try it and report back. Also, should I just put this on AUR or would you consider pulling it into [testing]?
Comment by Serge Buglakov (redetection) - Friday, 18 February 2011, 15:14 GMT
>>fixes all problems except those using n-networks.
so what should I do to get n-networks working?
Comment by Can Celasun (dcelasun) - Friday, 18 February 2011, 15:17 GMT
Unfortunately there is not much you can do. N networks are still problematic and even with the experimental ucode, people are experiencing connection drops, high latency and package losses. See the upstream bug to follow the progress regarding n networks.
Comment by Serge Buglakov (redetection) - Wednesday, 23 February 2011, 07:11 GMT
I haven't seen these problems in 2.6.32.. so i just copied in iwl-5000.c contents from iwl5150_agn_cfg to iwl5150_abg_cfg and got n-network working again. :)
Comment by Thomas Bächler (brain0) - Sunday, 27 February 2011, 14:47 GMT Comment by Thomas Bächler (brain0) - Sunday, 27 February 2011, 15:02 GMT
Hrm, this new firmware needs a new iwlwifi version to be used. So this bug might only be fixed after a kernel update. The linux git tree still has '#define IWL5000_UCODE_API_MAX 2', but it would need 5.
Comment by Jelle van der Waa (jelly) - Thursday, 14 April 2011, 21:45 GMT
is this issue still around?
Comment by Can Celasun (dcelasun) - Friday, 15 April 2011, 04:15 GMT
Yes, it's still not resolved. See the upstream bug [1].

[1] https://bugzilla.kernel.org/show_bug.cgi?id=16691
Comment by JM (fijam) - Tuesday, 03 May 2011, 10:12 GMT
Status with 2.6.38?
Comment by Can Celasun (dcelasun) - Thursday, 12 May 2011, 11:29 GMT
This is partially fixed in git head [1]. Connection losses on all hardware using iwlwifi is fixed (Errors like "Received BA when not expected" or "BA scd_flow 0 does not match txq_id 10"). The remaining issues are related to performance regressions on 49xx and 5xxxx wifi cards. We can now close this bug as it will be included upstream. I'm not sure if the merge window is still open for 2.6.39, but it will definitely be included in 2.6.40.

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=bfd36103ec26599557c2bd3225a1f1c9267f8fcb
Comment by Jelle van der Waa (jelly) - Thursday, 16 June 2011, 10:36 GMT
status with .39? Else we will get it with 3.0 ;)
Comment by Can Celasun (dcelasun) - Thursday, 16 June 2011, 13:43 GMT
No idea. It proabably didn't get merged in time, but I'm too lazy to check :)
Comment by Victor (goldeelox) - Friday, 26 August 2011, 13:50 GMT
This bug still affects 3.0. The 11n_disable workaround works. Another work around, for me at least, is to downgrade to networkmanager to 0.8.3-0.

Loading...