FS#26847 - [linux] lots of stalls and finaly freeze
Attached to Project:
Arch Linux
Opened by Mariusz Libera (mar04) - Friday, 11 November 2011, 12:35 GMT
Last edited by Tobias Powalowski (tpowa) - Saturday, 07 April 2012, 06:51 GMT
Opened by Mariusz Libera (mar04) - Friday, 11 November 2011, 12:35 GMT
Last edited by Tobias Powalowski (tpowa) - Saturday, 07 April 2012, 06:51 GMT
|
Details
Description:
Since updating to 3.1 my laptop froze 3 times already. I don't know what is causing it but this time I checked kernel log and there are lots of "INFO: rcu_preempt_state detected stalls on CPUs/tasks:" messages. They start to appear an hour before system freezes. Meanwhile I only noticed that mp3 playback was little choppy. Additional info: * up to date stock Arch packages Steps to reproduce: |
This task depends upon
https://bugs.archlinux.org/task/26674
Collected some dmesgs: https://gist.github.com/1363432
All from ZEN kernel, but hopefully still useful. The Arch kernel hung, as well.
Also running Sandy Bridge.
Don't really know how to report it upstream, bugzilla.kernel.org is down.
I don't find any Sandy Bridge:
00:00.0 Host bridge: Intel Corporation Core Processor DRAM Controller (rev 12)
00:1c.0 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 1 (rev 05)
00:1c.1 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 2 (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation Mobile 5 Series Chipset LPC Interface Controller (rev 05)
ff:00.0 Host bridge: Intel Corporation Core Processor QuickPath Architecture Generic Non-core Registers (rev 02)
ff:00.1 Host bridge: Intel Corporation Core Processor QuickPath Architecture System Address Decoder (rev 02)
ff:02.0 Host bridge: Intel Corporation Core Processor QPI Link 0 (rev 02)
ff:02.1 Host bridge: Intel Corporation Core Processor QPI Physical 0 (rev 02)
ff:02.2 Host bridge: Intel Corporation Core Processor Reserved (rev 02)
ff:02.3 Host bridge: Intel Corporation Core Processor Reserved (rev 02)
01:00.0 Ethernet controller: Broadcom Corporation NetLink BCM57780 Gigabit Ethernet PCIe (rev 01)
02:00.0 Network controller: Atheros Communications Inc. AR928X Wireless Network Adapter (PCI-Express) (rev 01)
For Linux kernel-3.0.7, the wlan0 will disconnect (I am using wpa_supplicant 0.7.3-4) some of the times but it will resume after 2 or 3 seconds. However, with linux-3.1-4, it will stall after a few minutes and I have to kill wpa_supplicant and reload for it to reconnect the wifi. So, there must be some miscommunication between the wpa_supplicant and the Linux kernel. My wpa configuration is as follows:
network={
ssid=something
proto=WPA2
key_mgmt=WPA-PSK
pairwise=TKIP
group=TKIP
psk=something
wpa_ptk_rekey=600
}
But I don't think it's a problem since previous versions of Linux worked! Even Linux-2.6.3x. Someone said it's a problem with acer_wmi, but I find it is of no harm to the wifi connection for Linux-3.0.7 (just the minor problem of disconnection, it resumes very fast though by itself).
messages in your dmesg output? Because you're probably talking about some different issue.
@Jan Steffens: Sandy Bridge here
my laptop CPU is T2050 yonah, not sandy bridge
update:
hi man, every time i go back home using network cable, the stalls and freeze problem is gone. in other words the problem only appears when i use wireless. my wireless card is Intel® PRO/Wireless 3945ABG
wpa_ptk_rekey=600
However, I think Linux kernel 3.1 do have problem since it will hang the wifi connection.
It seems to be somehow network related but I'm not sure how.
Most of the time I'm using ethernet connection and my wireless is switched off,
so it's unlikely that a specific driver is broken.
One thing I can easily reproduce every time is that once stall messages start
to appear switching from X to tty always result in a lockup, from there only sysrq+reisub.
Other issues:
NetworkManager gets broken (at least gnome applet) - on/off doesn't work
and it always say I'm connected. Also I cannot quit firefox properly - next time
I run it it says it's already running. Suspend only locks screen.
I'm attaching one more log from 3.1.1 without vboxdrv and mei modules so it's not tainted.
Also found this:
https://lkml.org/lkml/2011/4/19/588
https://lkml.org/lkml/2011/8/2/415
I also get:
iwl4965 0000:0c:00.0: Error sending REPLY_LEDS_CMD: enqueue_hcmd failed: -5
spammed in /var/log/errors.log when the crash occurs.
Also, when the stalls happen I can't even do a simple "cat" on the logfiles. I have to restart to be able to even read the logs.
I am not on sandy bridge hardware. I have an old T7500 CPU.
* The behaviour is somehow strange. Firefox refuses to load new page, while ping on the command line is still working.
* Cannot logout/reboot from Gnome; System freezes on logout
* Switching to console using CTRL+ALT+F1 freezes the system.
Also confirming this statement:
"One thing I can easily reproduce every time is that once stall messages start
to appear switching from X to tty always result in a lockup,"
errors.log:
Nov 22 15:55:24 localhost kernel: [ 600.592028] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 3, t=126066 jiffies)
Nov 22 15:58:24 localhost kernel: [ 780.248711] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 1, t=180098 jiffies)
Nov 22 16:01:24 localhost kernel: [ 959.905340] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 1, t=234130 jiffies)
Nov 22 16:04:24 localhost kernel: [ 1139.561976] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 3, t=288162 jiffies)
Is there a workaround? Something like installing an old kernel version? I am definitely unproductive since this bug started to occur about one week ago.
lspci (2.6 KiB)
http://www.kernel.org/doc/Documentation/RCU/stallwarn.txt
Seems relevant, especially to include in an upstream bug report.
This means that in order to avoid this, I need to use the hardware switch to turn off my wifi when using an ethernet cable.
Can anyone else confirm this?
My notebook is Thinkpad Edge E320 1298-5VG. I've solved this problem also on another notebook (Thinkpad R61) the same way.
//EDIT
No, the downgrading wasnt the way, details in attached file.
kill_hang.log (19.5 KiB)
it happens with ethernet connection also. Are you guys using ipv6?
Random idea...
WiFi at home: IPv4. - no problems
Cable network at school: IPv4 only (realized that I haven't setup IPv6) - no problems
Seems like it has something to do with IPv6.
I can reproduce this bug on my netbook too (EeePC 901) which has a ralink wireless card, so it's defenetlie not intel related.
I can only reproduce this issue when using my university's wireless network - eduroam.
The issues happen regardless of the connection being on IPv4 or IPv6.
Current settings of eduroam:
Security: WPA2 enterprise
Authentication: Tunnelled TLS
Inner authentication: MSCHAPv2
No CA certificate
I am using networkmanager to connect.
Is anyone experiencing this issue with different settings?
NetworkManager
Security: WPA2 enterprise
Authentication: PEAP (PEAP version: auto)
Inner authentication: MSCHAPv2
CA Certificate
My hardware: Lenovo X120e. Processor is AMD E350 (Zacate) and the wifi card uses the rtl8192ce driver.
I've attached the output of dmesg from the few seconds before and after the error occurred (this is after resuming from suspend and trying to reconnect to the wifi).
Dec 1 07:10:18 ondra-laptop NetworkManager[1237]: <info> Activation (wlan0) Stage 4 of 5 (IP6 Configure Get) scheduled...
Dec 1 07:10:18 ondra-laptop NetworkManager[1237]: <info> Activation (wlan0) Stage 4 of 5 (IP6 Configure Get) started...
Dec 1 07:10:18 ondra-laptop NetworkManager[1237]: <info> Activation (wlan0) Stage 5 of 5 (IP Configure Commit) scheduled...
Dec 1 07:10:18 ondra-laptop NetworkManager[1237]: <info> Activation (wlan0) Stage 4 of 5 (IP6 Configure Get) complete.
Dec 1 07:10:18 ondra-laptop NetworkManager[1237]: <info> Activation (wlan0) Stage 5 of 5 (IP Configure Commit) started...
But I have blacklisted the ipv6 kernel module and so far so good. Even on eduroam (my school's wifi). So I am runnig IPv4 only now.
//EDIT
So I apologise for misinformation eduroam is running dual-stack IPv4 and IPv6 (not only IPv6 as I stated before).
-> No crash since Nov. 22nd
Configuration:
NetworkManager
WPA2 Enterprise
Tunneled TLS
PAP
Certificate
Asus Sabertooh 990FX, UEFI boot, Firmware 813
Phenom II 1100T
This issue also appears network related, as I can generally trigger a cascade of them by starting Deluge (a bittorrent client). The issue seems to be more frequent when using the in-kernel r8169 module, blacklisting it and using r8168 from testing seems to be marginally more stable (maybe 40% less crashy).
This issue does not appear to be related to the GPU despite the X issues. The same issue appears with either a GTX 560 Ti and Nvidia's blob or an AMD Radeon 5770 with either Catalyst 11.11 or the radeon driver.
I am on openSUSE, but as I pretty sure this
is upstream kernel bug, so I think this
information might be relevant.
On my system as on many other bug is triggered
by particular WiFi network. I *never* get
"rcu_preempt_state detected stalls on CPUs/tasks"
in my home WiFi network, but I *always*
get it in my university's WiFi network.
This problematic WiFi network does *not* have any encryption enabled,
but as far as I know some MAC-filtering and some roaming
features are enabled. Interesting thing is that
I see "rcu_preempt_state detected stalls on CPUs/tasks"
in /var/log/messages exactly every three minutes
when my notebook is in university's WiFi network.
So grep'ing for "rcu_preempt" gives this curious picture
(notice that message appears on exactly on 42th second):
12/05/11 04:36:42 PM Starnote kernel [ 311.988158] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 1, t=60002 jiffies)
12/05/11 04:39:42 PM Starnote kernel [ 492.020162] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 1, t=240034 jiffies)
12/05/11 04:42:42 PM Starnote kernel [ 672.052094] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 1, t=420066 jiffies)
12/05/11 04:45:42 PM Starnote kernel [ 852.084161] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 1, t=600098 jiffies)
12/05/11 04:48:42 PM Starnote kernel [ 1032.116160] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 1, t=780130 jiffies)
12/05/11 04:51:42 PM Starnote kernel [ 1212.148111] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 0, t=960162 jiffies)
12/05/11 04:54:42 PM Starnote kernel [ 1392.180148] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 1, t=1140194 jiffies)
I have no idea what does it mean, since most of the time between
this messages there is just silence in /var/log/messages.
Probably I should look for some other logs, not sure what exactly.
Just a wild guess: can it be somehow be connected with
some DHCP lease times or some other network timeout?
There has to be a reason why it happens
exactly every three minutes.
Not sure what information I should provide in this case,
so here is some information about my system:
- Kernel version: 3.1.4-1
- I'm using NetworkManager (within KDE)
- WiFi card is Intel 4965AGN
- IPv4 and IPv6 enabled
- When I try to switch to different virtual terminal (e.g. Alt+Ctrl+F1)
after getting the error and subsequently kernel taint system hangs
and responds in the best case only to "magic sysrq" combinations.
Did not try connecting with wired Ethernet
yet with new kernel, but will try that later too.
It's using kernel 3.1.2 and so far no problems. Same hardware, same network connection,
same desktop environment and apps. Any chance this is Arch specific?
1. In kernel 3.1.4. the bug still exists.
2. The bug is related to wireless, but not limited to a specific driver. With the same machine(M4600) I tried two adaptors, intel 6300(iwlagn) comes with the machine, and atheros AR5212(madwifi, ath5k). Both wireless cards will cause the bug when using Transmission.
3. The bug is not related to encryption(WPA2) since I have also experienced the bug with an unencrypted wireless network.
4. It's a kernel bug not specific to Arch. I have used fedora 16 live cd(32 bit) on the M4600 with intel 6300 card, the same situation.
5. It seems that wired network is not influenced. Everything is OK if I am not using the wireless network.
6. I have not experienced the bug when using fedora 16 on a Thinkpad T60 with intel 3945 adaptor.
7. My workround is to use KVM, bypassing the wireless adaptor to Winxp and taking the winxp in KVM as a gateway. Everything is fine with this approach.
I'm running libvirt with QEMU/KVM myself, which sets up a virtual network (virbr0) and a NAT. Though I only sometimes run a VM.
Also, I set net.core.bpf_jit_enable=1 .
I suggest you try the old kernel26-lts, which may be a good workround for you. A few hours ago I tried the old kernel, in a house which easily triggered the bug(hang, frequently disconnect, unable to associate to ap) on kernel 3.x. After about an hour's test with bittorrent and ftp transfer, I find that the system is fairly stable.
For my Atheros card, ath5k works flawlessly.
For my intel 6300 card, there are still some error messages, but not fatal. What is more, I did not lose connection even if the error message occurs.
The error message is more useful than that of kernel 3.x, as pasted below
----------------
[ 498.520520] ------------[ cut here ]------------
[ 498.520535] WARNING: at include/net/mac80211.h:2206 rate_control_send_low+0xc1/0xe0 [mac80211]()
[ 498.520537] Hardware name: Precision M4600
[ 498.520538] Modules linked in: ipt_MASQUERADE xt_state ipt_REJECT xt_tcpudp iptable_filter nf_nat_h323 nf_conntrack_h323 nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables iwlagn iwlcore aesni_intel cryptd aes_x86_64 aes_generic ipv6 nvidia(P) fuse uvcvideo videodev v4l1_compat v4l2_compat_ioctl32 btusb bluetooth arc4 ecb snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep mac80211 serio_raw snd_pcm i2c_i801 snd_timer psmouse snd soundcore snd_page_alloc iTCO_wdt cfg80211 iTCO_vendor_support i2c_core led_class evdev dell_laptop rfkill ppdev pcspkr parport_pc parport dcdbas wmi video button output ac thermal battery cpufreq_powersave cpufreq_ondemand acpi_cpufreq freq_table processor ext4 mbcache jbd2 crc16 sd_mod ahci libata xhci ehci_hcd scsi_mod usbcore [last unloaded: ath]
[ 498.520574] Pid: 1813, comm: phy1 Tainted: P W 2.6.32.49-1-lts #1
[ 498.520575] Call Trace:
[ 498.520581] [<ffffffff8105f728>] warn_slowpath_common+0x78/0xb0
[ 498.520583] [<ffffffff8105f774>] warn_slowpath_null+0x14/0x20
[ 498.520587] [<ffffffffa022d311>] rate_control_send_low+0xc1/0xe0 [mac80211]
[ 498.520590] [<ffffffffa0ef459b>] rs_get_rate+0x6b/0x280 [iwlagn]
[ 498.520594] [<ffffffffa022d864>] rate_control_get_rate+0xc4/0xe0 [mac80211]
[ 498.520598] [<ffffffffa0233fe6>] invoke_tx_handlers+0x636/0xea0 [mac80211]
[ 498.520601] [<ffffffffa0234a78>] ? ieee80211_tx_prepare+0x178/0x3d0 [mac80211]
[ 498.520604] [<ffffffffa0235445>] ieee80211_tx+0x75/0x230 [mac80211]
[ 498.520608] [<ffffffffa0235727>] ieee80211_xmit+0x127/0x280 [mac80211]
[ 498.520611] [<ffffffffa02364b1>] ieee80211_tx_skb+0x61/0x70 [mac80211]
[ 498.520615] [<ffffffffa0238b2a>] ieee80211_send_probe_req+0x12a/0x180 [mac80211]
[ 498.520618] [<ffffffffa0229e30>] ? ieee80211_sta_work+0x0/0x1150 [mac80211]
[ 498.520621] [<ffffffffa0228480>] ieee80211_mgd_probe_ap_send+0x40/0x70 [mac80211]
[ 498.520624] [<ffffffffa0229e30>] ? ieee80211_sta_work+0x0/0x1150 [mac80211]
[ 498.520627] [<ffffffffa022ab12>] ieee80211_sta_work+0xce2/0x1150 [mac80211]
[ 498.520630] [<ffffffffa0229e30>] ? ieee80211_sta_work+0x0/0x1150 [mac80211]
[ 498.520633] [<ffffffff8107e99d>] worker_thread+0x14d/0x2a0
[ 498.520636] [<ffffffff810848f0>] ? autoremove_wake_function+0x0/0x40
[ 498.520638] [<ffffffff8107e850>] ? worker_thread+0x0/0x2a0
[ 498.520639] [<ffffffff81084188>] kthread+0x88/0x90
[ 498.520642] [<ffffffff8104d478>] ? finish_task_switch+0x48/0xd0
[ 498.520644] [<ffffffff810130aa>] child_rip+0xa/0x20
[ 498.520646] [<ffffffff81084100>] ? kthread+0x0/0x90
[ 498.520648] [<ffffffff810130a0>] ? child_rip+0x0/0x20
[ 498.520649] ---[ end trace 1e87ce0ddc8dcfab ]---
I've had some trouble with NetworkManager and the handling of IPv6 router advertisements. Now, those advertisements are sent every 5 minutes, and the stalls happen every 3 minutes, so they are probably not related. I haven't looked into that further, because Wireshark stops working when the stalls start. But I AM wondering if anybody is experiencing this without using NetworkManager?
Also, I've noticed that the stalls start suspiciously often when I'm browsing Youtube. Might be because Youtube is on IPv6, or maybe Flash is triggering some bug in the IPv6 stack? At least that last option would explain why it doesn't show up very often.
(x64 on r8169)
Here is the backtrace from my computer:
kernel: [ 4052.963054] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 1, t=72034 jiffies)
kernel: [ 4052.963068] sending NMI to all CPUs:
kernel: [ 4052.963081] NMI backtrace for cpu 0
kernel: [ 4052.963088] CPU 0
kernel: [ 4052.963093] Modules linked in: fuse rfcomm bnep cryptd aes_x86_64 aes_generic snd_hda_codec_realtek nvidia(P) snd_hda_intel snd_hda_codec ecb btusb snd_hwdep snd_pcm firewire_ohci bluetooth joydev snd_timer snd edac_core r8169 soundcore rfkill firewire_core snd_page_alloc psmouse i2c_piix4 serio_raw sp5100_tco mii wmi crc_itu_t edac_mce_amd k10temp evdev button pcspkr it87 cn adt7475 hwmon_vid i2c_core cpufreq_ondemand powernow_k8 freq_table processor mperf ipv6 autofs4 hid_microsoft usbhid hid ext4 mbcache jbd2 crc16 sr_mod cdrom sd_mod pata_acpi ohci_hcd ahci libahci pata_atiixp pata_jmicron libata ehci_hcd xhci_hcd scsi_mod usbcore
kernel: [ 4052.963200]
kernel: [ 4052.963206] Pid: 0, comm: swapper Tainted: P 3.1.4-1-ARCH #1 Gigabyte Technology Co., Ltd. GA-770TA-UD3/GA-770TA-UD3
kernel: [ 4052.963219] RIP: 0010:[<ffffffff8103ae5b>] [<ffffffff8103ae5b>] native_safe_halt+0xb/0x10
kernel: [ 4052.963237] RSP: 0018:ffffffff81801e58 EFLAGS: 00000246
kernel: [ 4052.963243] RAX: 0000000000000000 RBX: ffffffff81801ea4 RCX: 0000000000000020
kernel: [ 4052.963249] RDX: 0000000000000000 RSI: 0000000000000086 RDI: 0000000000000086
kernel: [ 4052.963254] RBP: ffffffff81801e58 R08: ffffffff818a5300 R09: 0000000000000000
kernel: [ 4052.963260] R10: 000000000336e391 R11: 0000000000000001 R12: ffffffff81934220
kernel: [ 4052.963265] R13: 0000000000000000 R14: ffffffffffffffff R15: 000000000008c000
kernel: [ 4052.963273] FS: 00007f97c5756880(0000) GS:ffff88012fc00000(0000) knlGS:0000000000000000
kernel: [ 4052.963279] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
kernel: [ 4052.963285] CR2: 00007ff0bd3be000 CR3: 0000000114366000 CR4: 00000000000006f0
kernel: [ 4052.963290] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [ 4052.963296] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
kernel: [ 4052.963303] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8189d020)
kernel: [ 4052.963308] Stack:
kernel: [ 4052.963313] ffffffff81801e88 ffffffff8101d313 ffffffff81801ea4 ffffffff81934220
kernel: [ 4052.963324] ffffffff81800000 ffffffffffffffff ffffffff81801eb8 ffffffff8101db3a
kernel: [ 4052.963334] ffffffff81934220 0000000081800000 ffffffff81801fd8 ffffffff81934220
kernel: [ 4052.963343] Call Trace:
kernel: [ 4052.963353] [<ffffffff8101d313>] default_idle+0x53/0x2a0
kernel: [ 4052.963362] [<ffffffff8101db3a>] amd_e400_idle+0x9a/0x120
kernel: [ 4052.963370] [<ffffffff81013236>] cpu_idle+0xd6/0x120
kernel: [ 4052.963380] [<ffffffff813e6912>] rest_init+0x96/0xa4
kernel: [ 4052.963389] [<ffffffff8194fc15>] start_kernel+0x3bf/0x3cc
kernel: [ 4052.963398] [<ffffffff8194f347>] x86_64_start_reservations+0x132/0x136
kernel: [ 4052.963407] [<ffffffff8194f140>] ? early_idt_handlers+0x140/0x140
kernel: [ 4052.963415] [<ffffffff8194f44d>] x86_64_start_kernel+0x102/0x111
kernel: [ 4052.963420] Code: 55 48 89 e5 66 66 66 66 90 fa 5d c3 0f 1f 40 00 55 48 89 e5 66 66 66 66 90 fb 5d c3 0f 1f 40 00 55 48 89 e5 66 66 66 66 90 fb f4 <5d> c3 0f 1f 00 55 48 89 e5 66 66 66 66 90 f4 5d c3 0f 1f 40 00
kernel: [ 4052.963490] Call Trace:
kernel: [ 4052.963497] [<ffffffff8101d313>] default_idle+0x53/0x2a0
kernel: [ 4052.963504] [<ffffffff8101db3a>] amd_e400_idle+0x9a/0x120
kernel: [ 4052.963512] [<ffffffff81013236>] cpu_idle+0xd6/0x120
kernel: [ 4052.963520] [<ffffffff813e6912>] rest_init+0x96/0xa4
kernel: [ 4052.963528] [<ffffffff8194fc15>] start_kernel+0x3bf/0x3cc
kernel: [ 4052.963537] [<ffffffff8194f347>] x86_64_start_reservations+0x132/0x136
kernel: [ 4052.963545] [<ffffffff8194f140>] ? early_idt_handlers+0x140/0x140
kernel: [ 4052.963553] [<ffffffff8194f44d>] x86_64_start_kernel+0x102/0x111
... repeated for other cpus ...
It also happens with nouveau loaded.
But for kernel 3.x the things are different.
1. With wireless network, I am *always* getting stalls, and I don't think the backtraces appear after the stall message is useful. After stall, the system does not hang immediately, but I can no longer start new programs. Sometimes, the applications I already opened begin to behave strangely. Switching to console(ctrl+alt+Fx) or shutdown command will hang the system immediately.
2. I never get stalls if I am not using wireless network.
3. I have ipv6 enabled on both wired and wireless network, and for the wired network I never stalls. But I still can not exclude the ipv6 reason because I also get the ipv6 address when using wireless.
Here's a package of Linux 3.2.0rc5 to try (named linux-zen).
For me, it completely breaks WiFi (refuses to associate with my AP), but maybe other people have more luck.
PS:
What a bad time for our build server to vanish. :(
Here's a mirror of the package: http://paste.xinu.at/pur/
So maybe this could proof that this problem is introduced since 3.1.x? Or at least on Dell 1390 wireless card? Hope this could help~
00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:01.0 PCI bridge: Intel Corporation Mobile 4 Series Chipset PCI Express Graphics Port (rev 07)
00:1a.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.2 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 3 (rev 03)
00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03)
00:1c.4 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 5 (rev 03)
00:1c.5 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 6 (rev 03)
00:1d.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03)
01:00.0 VGA compatible controller: ATI Technologies Inc M92 [Mobility Radeon HD 4500/5100 Series]
01:00.1 Audio device: ATI Technologies Inc RV710/730 HDMI Audio [Radeon HD 4000 series]
02:00.0 System peripheral: JMicron Technology Corp. SD/MMC Host Controller
02:00.2 SD Host controller: JMicron Technology Corp. Standard SD Host Controller
02:00.3 System peripheral: JMicron Technology Corp. MS Host Controller
02:00.4 System peripheral: JMicron Technology Corp. xD Host Controller
05:00.0 Network controller: Intel Corporation Centrino Wireless-N 1000
08:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)
After a while my WLAN connection gets really slow or halts completely. If I then try to change the network or use ifconfig wlan0 down, the system freezes and i have to kill the Laptop with my power button.
Finally someone who understands my pain... I've been having this bug for about 2 months now (since I installed Arch) and had no idea what is happening.
As mentioned before, the lockups happen usually when connecting to my university Wireless infrastructure: eduroam and jacobs wireless networks (attached config from Network-Manager).
I am running this on my laptop, a Dell N5010 with a Broadcom BCM4313 wireless card, an i7 - 740qm (not Sandy Bridge), 8GB RAM.
I am running the following kernel:
Linux archidea 3.1.5-1-pae #1 SMP PREEMPT Tue Dec 13 11:15:08 EST 2011 i686 Intel(R) Core(TM) i7 CPU Q 740 @ 1.73GHz GenuineIntel GNU/Linux
but the freeze also happens in the non-PAE kernel.
If I can help any more, please let me know. I can trigger the bug whenever I am at my university.
Just a random thought, can this be a bug in NetworkManager? I haven't tried Wicd yet, but I have a friend who claims to have had the same bug and installed Wicd.
Catalin
kernel.log (32.2 KiB)
lsmod (4 KiB)
lspci (42.7 KiB)
eduroam (0.4 KiB)
jacobs (0.2 KiB)
The configuration is identical to the one from NetworkManager, connected to the same AP.
So, switching from NetworkManager to Wicd seems to be a workaround.
Catalin
Attached the patch I used to fix the association. It's a combination of two patches from the linux-wireless list.
) no stalls
) no kernel taints
) Internet connection works
I tried with NetworkMangager and without it (using ifup).
Either variant works fine if IPv6 is turned off.
http://pkgbuild.com/~heftig/linux-zen/linux-zen-3.2.0rc5-2-x86_64.pkg.tar.xz
http://pkgbuild.com/~heftig/linux-zen/linux-zen-3.2.0rc5-2-i686.pkg.tar.xz
Attached dmesg.
1. Newest Arch 3.1.5 kernel with ipv6 disabled-------> no trouble.
2. Newest Arch 3.1.5 kernel with ipv6 enabled--------> stall as always.
3. Newest Fedora 16 kernel (in update repo, version 3.1.5-6.fc16.x86_64) with ipv6 enabled----->no trouble....
I remember that the initial fedora 16 release(livecd) brought stalls when wireless is enabled. So perhaps Fedora has fixed the bug during these days, but the patches are not merged to mainline kernel.
Disabling IPv6 fixes this issue, on Fedora IPv6 doesn't cause problems.
@ruijiangli: do you happen to know what version of the fedora kernel was the broken one? The only commit that I found which might be related seems to be to make ipv6 built-in rather than a module: http://lists.fedoraproject.org/pipermail/kernel/2011-June/003105.html, which was done in 3.0.
However, I am not sure the bug is directly associated with ipv6. Perhaps the bug still exists in the upstream kernel now, the reason we get stalls in Arch may be that, combined with the specific kernel configuration from Fedora, the bug happens to be masked, whereas combined with the kernel configuration of Arch, it does not.
It seems to be aggravated by bulk downloads from IPv6 addresses, causing task stalls followed by complete kernel freeze requiring hard reboot, within 20 minutes of power-on.
File: relevant dmesg output just before complete freeze, boot messages have been taken off.
No, with IPv6 set to Ignore it was stable for an hour (then I had to go), with no relevant kernel messages. So the problem is only when IPv6 is set to 'Automatic' and the wireless network gives it a v4 and a v6 address.
The funny thing is - even if I set IPv6 to Ignore in NM, interface will still receive IPv6 address (via radvd) and system works without problems. Lockups occour only when IPv6 setting in NM is set to IPv6/Automatic.
> wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 metric 1
> inet 152.78.163.209 netmask 255.255.254.0 broadcast 152.78.163.255
> inet6 fec0::b:762f:68ff:fe2c:5d3 prefixlen 64 scopeid 0x40<site>
> inet6 2002:984e:cc34:b:762f:68ff:fe2c:5d3 prefixlen 64 scopeid 0x0<global>
However, there is no IPv6 route. Accessing IPv6 sites does not work for me like this.
When NM is set to "Automatic", IPv6 will be handled by NM instead. Having NM handle IPv6 is probably the proper way. Together with dhclient, it also understands DHCPv6 and takes other networks (e.g. VPNs) into account when setting up the route table.
NM talks to the kernel via netlink. Maybe the issue is here? Also check syslog/journal for messages from NM.
TODO: Check if NetworkManager with libnl3 is any better.
Versions of the packages (64 bit distribution):
linux-3.2.5-1
networkmanager 0.9.2.0-1
kdeplasma-applets-networkmanagement 1:0.9.0rc4-1
Maybe this problem has been 'adopted' by the changes between 3.0.20 & 3.0.21, this should be a clue for those who are working on this problem.
https://bugzilla.kernel.org/show_bug.cgi?id=42780
Packages at http://pkgbuild.com/~heftig/linux-zen/
Eric Dumazet (3):
net: bpf_jit: fix BPF_S_LDX_B_MSH compilation
net: fix napi_reuse_skb() skb reserve
net: fix a potential rcu_read_lock() imbalance in rt6_fill_node()
Last line for Eric Dumazet is this bug fixing. Can't wait to test linux 3.3.1