FS#10497 : Kernel bug in 2.6.25 with XFS and device-mapper

FS#10497 - Kernel bug in 2.6.25 with XFS and device-mapper

Attached to Project: Arch Linux
Opened by Jeroen Maris (jealma) - Sunday, 25 May 2008, 21:59 GMT
Last edited by Aaron Griffin (phrakture) - Sunday, 09 November 2008, 05:28 GMT

Task Type	Bug Report
Category	Packages: Core
Status	Closed
Assigned To	Tobias Powalowski (tpowa) Thomas Bächler (brain0)
Architecture	x86_64
Severity	Critical
Priority	Normal
Reported Version	2007.08-2
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	1 john mcullan (mullman) (2008-07-15)
Private	No

Details

Description:
Just installed Arch64 with kernel 2.6.25 on my server (moved from Debian Etch) and got the errors below:
This is with copying lots and lots of data from an internal RAID5-array (Areca ARC1220) with XFS to another external SATA-hdd connected with Infiniband. The external SATA-disk connected with Infiniband is encrypted with LUKS (AES) and also contains an XFS-filesystem, approximately 1TB. I was using rsync to copy stuff from the raid-array to the external disk. The XFS-filesystem on the external disk is corrupt by now.
-----------------------------
00000000: 34 49 4e 97 b1 26 0f 24 fc 3c 45 37 79 77 02 ec 4IN..&.$.<E7yw..
Filesystem "dm-0": XFS internal error xfs_ialloc_read_agi at line 1384 of file fs/xfs/xfs_ialloc.c. Caller 0xffffffff881b4c2a
Pid: 17060, comm: rsync Tainted: G D 2.6.25-ARCH #1

Call Trace:
[<ffffffff881b4c2a>] :xfs:xfs_ialloc_ag_select+0x22a/0x330
[<ffffffff881b3f81>] :xfs:xfs_ialloc_read_agi+0xe1/0x140
[<ffffffff881b4c2a>] :xfs:xfs_ialloc_ag_select+0x22a/0x330
[<ffffffff881b4c2a>] :xfs:xfs_ialloc_ag_select+0x22a/0x330
[<ffffffff881b56dc>] :xfs:xfs_dialloc+0x30c/0xa30
[<ffffffff881a89b0>] :xfs:xfs_dir_lookup+0x160/0x1b0
[<ffffffff881c4f80>] :xfs:xfs_log_release_iclog+0x10/0x40
[<ffffffff80333501>] __up_read+0x21/0xb0
[<ffffffff881bde0d>] :xfs:xfs_ialloc+0x6d/0x6a0
[<ffffffff881d4ea8>] :xfs:xfs_dir_ialloc+0xa8/0x370
[<ffffffff804605d2>] __down_write_nested+0xb2/0xc0
[<ffffffff881d280b>] :xfs:xfs_trans_reserve+0xab/0x240
[<ffffffff881d9937>] :xfs:xfs_mkdir+0x437/0x560
[<ffffffff881e534b>] :xfs:xfs_vn_mknod+0x20b/0x310
[<ffffffff802ae399>] vfs_mkdir+0xe9/0x130
[<ffffffff802b0b9b>] sys_mkdirat+0xeb/0x140
[<ffffffff802bf1cf>] mntput_no_expire+0x1f/0x90
[<ffffffff802a1c7b>] filp_close+0x5b/0x90
[<ffffffff802a1d61>] sys_close+0xb1/0x120
[<ffffffff8020c59a>] system_call_after_swapgs+0x8a/0x8f

00000000: 34 49 4e 97 b1 26 0f 24 fc 3c 45 37 79 77 02 ec 4IN..&.$.<E7yw..
Filesystem "dm-0": XFS internal error xfs_iunlink at line 1949 of file fs/xfs/xfs_inode.c. Caller 0xffffffff881da156
Pid: 17060, comm: rsync Tainted: G D 2.6.25-ARCH #1

Call Trace:
[<ffffffff881da156>] :xfs:xfs_remove+0x366/0x3f0
[<ffffffff881bd0a3>] :xfs:xfs_iunlink+0x113/0x220
[<ffffffff881da156>] :xfs:xfs_remove+0x366/0x3f0
[<ffffffff881e5cd2>] :xfs:xfs_ichgtime+0x22/0xe0
[<ffffffff881da156>] :xfs:xfs_remove+0x366/0x3f0
[<ffffffff881e5075>] :xfs:xfs_vn_unlink+0x25/0x60
[<ffffffff802adc00>] vfs_unlink+0x100/0x150
[<ffffffff802b0881>] do_unlinkat+0x121/0x1d0
[<ffffffff802bf1cf>] mntput_no_expire+0x1f/0x90
[<ffffffff802a1c7b>] filp_close+0x5b/0x90
[<ffffffff802a1d61>] sys_close+0xb1/0x120
[<ffffffff8020c59a>] system_call_after_swapgs+0x8a/0x8f

Filesystem "dm-0": XFS internal error xfs_trans_cancel at line 1163 of file fs/xfs/xfs_trans.c. Caller 0xffffffff881da177
Pid: 17060, comm: rsync Tainted: G D 2.6.25-ARCH #1

Call Trace:
[<ffffffff881da177>] :xfs:xfs_remove+0x387/0x3f0
[<ffffffff881d1c40>] :xfs:xfs_trans_cancel+0xf0/0x110
[<ffffffff881da177>] :xfs:xfs_remove+0x387/0x3f0
[<ffffffff881e5075>] :xfs:xfs_vn_unlink+0x25/0x60
[<ffffffff802adc00>] vfs_unlink+0x100/0x150
[<ffffffff802b0881>] do_unlinkat+0x121/0x1d0
[<ffffffff802bf1cf>] mntput_no_expire+0x1f/0x90
[<ffffffff802a1c7b>] filp_close+0x5b/0x90
[<ffffffff802a1d61>] sys_close+0xb1/0x120
[<ffffffff8020c59a>] system_call_after_swapgs+0x8a/0x8f

xfs_force_shutdown(dm-0,0x8) called from line 1164 of file fs/xfs/xfs_trans.c. Return address = 0xffffffff881d1c59
Filesystem "dm-0": Corruption of in-memory data detected. Shutting down filesystem: dm-0
Please umount the filesystem, and rectify the problem(s)
-----------------------------

Additional info:
* package version(s)
cryptsetup 1.0.6-1
kernel26 2.6.25.4-1
xfsprogs 2.9.7-1
xfsdump 2.2.46-3

* config and/or log files etc.

Steps to reproduce:
- That can be a problem...

This task depends upon

Closed by Aaron Griffin (phrakture)
Sunday, 09 November 2008, 05:28 GMT
Reason for closing: None

Comment by Jan de Groot (JGC) - Monday, 26 May 2008, 07:01 GMT

Are you sure this is not caused by a corrupted XFS filesystem? XFS is known to generate lots of bad backtraces and error messages to dmesg when there's something wrong with the filesystem.

Comment by Jeroen Maris (jealma) - Monday, 26 May 2008, 07:53 GMT

Yes. The filesystems on both the RAID5-array and the external SATA-hdd were cleanly mounted and unmounted and worked fine without any problems and notifications of errors on the Debian installation (with kernel 2.6.18) some hours before. As soon as I started some intensive operations on the external SATA-hdd (the one encrypted with LUKS, via device-mapper), I got these errors.

Comment by Jan de Groot (JGC) - Monday, 26 May 2008, 09:00 GMT

cleanly mounted and unmounted with XFS doesn't say anything about the internal state of the filesystem. I had servers crashing when upgrading from 2.6.18 to 2.6.22 because of filesystem corruption: 2.6.18 would just skip the I/O operation without any error or warning, while 2.6.22 detected inconsistency and called xfs_shutdown() to prevent further damage to the filesystem.

Comment by Jeroen Maris (jealma) - Thursday, 29 May 2008, 09:55 GMT

An update on my findings: Since encountering this bug, I switched back to Debian 32-bit temporary (I had an image as backup) and yesterday I installed Debian Etch 64-bit version. After mounting the external drives, encrypted with LUKS, I started rsync to backup some stuff and again I got the same problem as mentioned above. Isn't it very coincidental that this problem is happening on two very different distro's with very different kernels (2.6.18 and 2.6.25), but both 64-bit? Anyone got a clue as to what's going on here?

Comment by Hannes Rist (hrist) - Tuesday, 03 June 2008, 16:05 GMT

Just a guess, which only applies if you have more than 4GB ram in the faulty machine,
what one of the RAM modules is broken and this broken part(region, whatever you name it) is only addressed with the 64bit kernel?

I came to that idea because your trace mentions "Corruption of in-memory data detected."

I recommend running memtest to check if it's the RAM.

Comment by Jeroen Maris (jealma) - Tuesday, 03 June 2008, 16:47 GMT

No sorry, it only has 1GB of ram. I'm pretty sure the ram is OK, but I'll run a memtest anyway, just to be sure.

Comment by Glenn Matthys (RedShift) - Tuesday, 17 June 2008, 11:44 GMT

Have you ran your memtest and an fsck?

Comment by Jeroen Maris (jealma) - Tuesday, 17 June 2008, 12:57 GMT

No, but I've been able to reproduce the problem on other computers that I have memtested. I've been a little busy lately, but every test I did indicates that the problems I experience are caused by rsync. I've resorted to another computer, to not using device mapper, to not use external storage with infiniband and I can still reproduce the problem. The problem arises when I use rsync on a client to fetch data from an rsync daemon running on another pc. My dmesg fills up with all kinds of errors, mostly swapper failures. All these errors mentioned "e1000", the driver for my network card. After using a completely different network card, I still get the same errors, but now with the "r8169"-driver mentioned. When copying data from the same server with the same client, but not using the rsync-daemon but an NFS-share, the errors do not appear. I've tried recompiling rsync from ABS, to make sure the package is not corrupted, but to no avail. Using the same rsync-version on another linux distro (aka Debian and Ubuntu), both with kernel 2.6.24, I cannot reproduce the problem. I definately think rsync triggers something in Archlinux' kernel that generates these messages, and these problems probably led to the xfs-errors mentioned above. The exact errors that happen now are these:

Call Trace:
<IRQ> [<ffffffff8027d86e>] __alloc_pages+0x2ee/0x3c0
[<ffffffff8020cb4d>] ret_from_intr+0x0/0x19
[<ffffffff8029fabc>] __slab_alloc+0x48c/0x7c0
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff802a10e6>] __kmalloc_track_caller+0xe6/0x140
[<ffffffff803ca932>] __alloc_skb+0x72/0x160
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff8820d4db>] :e1000e:e1000_alloc_rx_buffers+0x20b/0x260
[<ffffffff8820d96f>] :e1000e:e1000_clean_rx_irq+0x26f/0x410
[<ffffffff8820b16b>] :e1000e:e1000_clean+0x16b/0x240
[<ffffffff802552cb>] hrtimer_get_next_event+0xdb/0xf0
[<ffffffff803cec61>] net_rx_action+0x131/0x290
[<ffffffff80240d2a>] __do_softirq+0x7a/0xf0
[<ffffffff8020d9dc>] call_softirq+0x1c/0x30
[<ffffffff8020fc1d>] do_softirq+0x4d/0x90
[<ffffffff80240ab5>] irq_exit+0xa5/0xb0
[<ffffffff8020feb1>] do_IRQ+0x81/0x100
[<ffffffff8020a060>] mwait_idle+0x0/0x50
[<ffffffff8020b570>] default_idle+0x0/0x70
[<ffffffff8020cb4d>] ret_from_intr+0x0/0x19
<EOI> [<ffffffff8020a09c>] mwait_idle+0x3c/0x50
[<ffffffff8020b4e0>] cpu_idle+0x90/0x120

Mem-info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 31
CPU 1: hi: 186, btch: 31 usd: 160
Active:231493 inactive:242557 dirty:9526 writeback:2714 unstable:0
free:10637 slab:16691 mapped:18348 pagetables:3198 bounce:0
DMA free:8044kB min:28kB low:32kB high:40kB active:704kB inactive:1224kB present:11092kB pages_scanned:32 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
DMA32 free:34504kB min:5712kB low:7140kB high:8568kB active:925268kB inactive:969004kB present:2051184kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 51*4kB 51*8kB 117*16kB 48*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 8052kB
DMA32: 3240*4kB 1702*8kB 480*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 34480kB
354912 total pagecache pages
Swap cache: add 1391017, delete 1348713, find 517305/600384
Free swap = 2686640kB
Total swap = 2931852kB
Free swap: 2686640kB
524000 pages of RAM
9353 reserved pages
94037 pages shared
42304 pages swap cached
swapper: page allocation failure. order:3, mode:0x4020
Pid: 0, comm: swapper Tainted: P 2.6.25-ARCH #1

Call Trace:
<IRQ> [<ffffffff8027d86e>] __alloc_pages+0x2ee/0x3c0
[<ffffffff8020d402>] apic_timer_interrupt+0x72/0x80
[<ffffffff8029fabc>] __slab_alloc+0x48c/0x7c0
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff802a10e6>] __kmalloc_track_caller+0xe6/0x140
[<ffffffff803ca932>] __alloc_skb+0x72/0x160
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff8820d4db>] :e1000e:e1000_alloc_rx_buffers+0x20b/0x260
[<ffffffff8820d96f>] :e1000e:e1000_clean_rx_irq+0x26f/0x410
[<ffffffff8820b16b>] :e1000e:e1000_clean+0x16b/0x240
[<ffffffff802552cb>] hrtimer_get_next_event+0xdb/0xf0
[<ffffffff803cec61>] net_rx_action+0x131/0x290
[<ffffffff80240d2a>] __do_softirq+0x7a/0xf0
[<ffffffff8020d9dc>] call_softirq+0x1c/0x30
[<ffffffff8020fc1d>] do_softirq+0x4d/0x90
[<ffffffff80240ab5>] irq_exit+0xa5/0xb0
[<ffffffff8020feb1>] do_IRQ+0x81/0x100
[<ffffffff8020a060>] mwait_idle+0x0/0x50
[<ffffffff8020b570>] default_idle+0x0/0x70
[<ffffffff8020cb4d>] ret_from_intr+0x0/0x19
<EOI> [<ffffffff8020a09c>] mwait_idle+0x3c/0x50
[<ffffffff8020b4e0>] cpu_idle+0x90/0x120

Mem-info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 31
CPU 1: hi: 186, btch: 31 usd: 159
Active:231493 inactive:242557 dirty:9526 writeback:2714 unstable:0
free:10637 slab:16691 mapped:18348 pagetables:3198 bounce:0
DMA free:8044kB min:28kB low:32kB high:40kB active:704kB inactive:1224kB present:11092kB pages_scanned:32 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
DMA32 free:34504kB min:5712kB low:7140kB high:8568kB active:925268kB inactive:969004kB present:2051184kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 51*4kB 51*8kB 117*16kB 48*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 8052kB
DMA32: 3240*4kB 1702*8kB 480*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 34480kB
354912 total pagecache pages
Swap cache: add 1391017, delete 1348713, find 517305/600384
Free swap = 2686640kB
Total swap = 2931852kB
Free swap: 2686640kB
524000 pages of RAM
9353 reserved pages
94037 pages shared
42304 pages swap cached
printk: 1161 messages suppressed.
swapper: page allocation failure. order:3, mode:0x4020
Pid: 0, comm: swapper Tainted: P 2.6.25-ARCH #1

Comment by Glenn Matthys (RedShift) - Tuesday, 17 June 2008, 13:26 GMT

Have you tried with a new XFS filesystem on both the client and the server?

Comment by Jeroen Maris (jealma) - Tuesday, 17 June 2008, 14:31 GMT

I've reinstalled the client recently and it has a new xfs filesystem. The problem also occurs when copying the data to a non-xfs filesystem (my client is multi-boot linux with both jfs and xfs). I am unable to put a new filesystem on the server's disks, as it is continuously in use. This is also the reason that I switched to another pc to test and track down this bug, as I can't afford to have the server unavailable for more than a few hours. Besides there is far to much data to even be able to backup it all.

Comment by Glenn Matthys (RedShift) - Tuesday, 17 June 2008, 14:38 GMT

Can you try and disable APIC on the client computer? (The one that produces the kernel messages in this http://bugs.archlinux.org/task/10497#comment29411 comment). Can you also try putting the NIC in another PCI slot?

Comment by Jeroen Maris (jealma) - Monday, 07 July 2008, 19:45 GMT

The NIC is not the problem, as using another NIC or an onboard NIC gives exactly the same problem. Since there's a new 2.6.25.10 and a new rsync 3.0.3 out, I'll check to see if the problem is still there.

Comment by Jeroen Maris (jealma) - Sunday, 03 August 2008, 19:17 GMT

Today I tested again, now with kernel 2.6.25.11 and rsync 3.0.3, but the problem's still there. I first tested with a partition formatted with xfs, then reformatted it with ext3 (default settings) and both times, the errors appear. I think the bug-report title should be changed to reflect the issue, as both XFS and device-mapper have been eliminated as a possible problem in this situation. This is clearly a bug in the kernel or in rsync.

These are the errors I keep getting:
Call Trace:
<IRQ> [<ffffffff8027d996>] __alloc_pages+0x2e6/0x3c0
[<ffffffff8041992d>] tcp_v4_do_rcv+0xdd/0x250
[<ffffffff8029fcfc>] __slab_alloc+0x48c/0x7c0
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff802a1326>] __kmalloc_track_caller+0xe6/0x140
[<ffffffff803ce9f2>] __alloc_skb+0x72/0x160
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff881044bb>] :e1000e:e1000_alloc_rx_buffers+0x20b/0x260
[<ffffffff8810494f>] :e1000e:e1000_clean_rx_irq+0x26f/0x410
[<ffffffff8810213b>] :e1000e:e1000_clean+0x16b/0x240
[<ffffffff803d2d31>] net_rx_action+0x131/0x290
[<ffffffff80240d9a>] __do_softirq+0x7a/0xf0
[<ffffffff8020d9ec>] call_softirq+0x1c/0x30
[<ffffffff8020fc2d>] do_softirq+0x4d/0x90
[<ffffffff80240b25>] irq_exit+0xa5/0xb0
[<ffffffff8020fec1>] do_IRQ+0x81/0x100
[<ffffffff8020b580>] default_idle+0x0/0x70
[<ffffffff8020cb5d>] ret_from_intr+0x0/0x19
<EOI> [<ffffffff802206c0>] lapic_next_event+0x0/0x10
[<ffffffff80225cc2>] native_safe_halt+0x2/0x10
[<ffffffff8020b5bb>] default_idle+0x3b/0x70
[<ffffffff8020b580>] default_idle+0x0/0x70
[<ffffffff8020b4f0>] cpu_idle+0x90/0x120

Mem-info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 152
CPU 1: hi: 186, btch: 31 usd: 93
Active:60734 inactive:421961 dirty:47123 writeback:1758 unstable:0
free:2528 slab:19953 mapped:12653 pagetables:2331 bounce:0
DMA free:8020kB min:28kB low:32kB high:40kB active:0kB inactive:784kB present:11072kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
DMA32 free:2092kB min:5712kB low:7140kB high:8568kB active:242936kB inactive:1687060kB present:2051184kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 5*4kB 2*8kB 97*16kB 11*32kB 13*64kB 5*128kB 2*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 8020kB
DMA32: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 0*128kB 2*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2072kB
435419 total pagecache pages
Swap cache: add 0, delete 0, find 0/0
Free swap = 3903752kB
Total swap = 3903752kB
Free swap: 3903752kB
524000 pages of RAM
9359 reserved pages
487348 pages shared
0 pages swap cached

Comment by Glenn Matthys (RedShift) - Sunday, 03 August 2008, 19:23 GMT

It's still crashing on your ethernet card. Can you try another ethernet card that is not intel?

Comment by Jan de Groot (JGC) - Sunday, 03 August 2008, 19:42 GMT

New NIC didn't help as posted above. It looks like there's something wrong with IRQ processing. Could you try the 2.6.26 kernel from testing? Other option is to turn off APIC.

Comment by Glenn Matthys (RedShift) - Sunday, 03 August 2008, 19:45 GMT

JGC: if you read the kernel crash output you'll notice e1000 always pops up. If it really is not the ethernet card, I wanna see crash output where the e1000 driver is not present.

Comment by Jan de Groot (JGC) - Sunday, 03 August 2008, 19:49 GMT

From one of the comments:
All these errors mentioned "e1000", the driver for my network card. After using a completely different network card, I still get the same errors, but now with the "r8169"-driver mentioned.

A different PCI(-e) slot could make a difference still though.

Comment by Glenn Matthys (RedShift) - Sunday, 03 August 2008, 20:00 GMT

I still want to see the output of the kernel with the r8169 crash.

Comment by Jeroen Maris (jealma) - Sunday, 03 August 2008, 20:20 GMT

Allright, I've been a little slow the last few weeks, but I'll get back within hours with test results from r8169, r8169 + noapic, r8169 + kernel26 2.6.26 from testing.

Comment by Jeroen Maris (jealma) - Sunday, 03 August 2008, 23:55 GMT

Well, I've been testing some hours now, and I can't seem to reproduce the problem with the r8169-driver the way I did before. With the e1000e-driver, I can reproduce it, but only after maybe 100GB of syncing and that takes a while. I'll be testing some more and post it here soon.

Comment by Jeroen Maris (jealma) - Monday, 04 August 2008, 00:21 GMT

So, here is the same problem as mentioned before, but now with the r8169-driver. I will test with noapic and with the new 2.6.26-kernel from testing as well.

Call Trace:
<IRQ> [<ffffffff8027d996>] __alloc_pages+0x2e6/0x3c0
[<ffffffff8029fcfc>] __slab_alloc+0x48c/0x7c0
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff8029f9d5>] __slab_alloc+0x165/0x7c0
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff802a1326>] __kmalloc_track_caller+0xe6/0x140
[<ffffffff803ce9f2>] __alloc_skb+0x72/0x160
[<ffffffff803ff8da>] ip_queue_xmit+0x24a/0x460
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff88b713fe>] :r8169:rtl8169_rx_fill+0xbe/0x1f0
[<ffffffff88b71867>] :r8169:rtl8169_rx_interrupt+0x337/0x490
[<ffffffff88b72cc7>] :r8169:rtl8169_interrupt+0x297/0x4e0
[<ffffffff802701ec>] handle_IRQ_event+0x3c/0x80
[<ffffffff8027188a>] handle_fasteoi_irq+0x8a/0x100
[<ffffffff8020febc>] do_IRQ+0x7c/0x100
[<ffffffff8020b580>] default_idle+0x0/0x70
[<ffffffff8020cb5d>] ret_from_intr+0x0/0x19
<EOI> [<ffffffff802206c0>] lapic_next_event+0x0/0x10
[<ffffffff80225cc2>] native_safe_halt+0x2/0x10
[<ffffffff8020b5bb>] default_idle+0x3b/0x70
[<ffffffff8020b580>] default_idle+0x0/0x70
[<ffffffff8020b4f0>] cpu_idle+0x90/0x120

Mem-info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 153
CPU 1: hi: 186, btch: 31 usd: 160
Active:59844 inactive:422839 dirty:48724 writeback:68 unstable:0
free:2553 slab:19899 mapped:12657 pagetables:2336 bounce:0
DMA free:8040kB min:28kB low:32kB high:40kB active:4kB inactive:1392kB present:11072kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
DMA32 free:2172kB min:5712kB low:7140kB high:8568kB active:239372kB inactive:1689964kB present:2051184kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB 10*8kB 91*16kB 51*32kB 8*64kB 2*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 8040kB
DMA32: 1*4kB 0*8kB 1*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2164kB
436207 total pagecache pages
Swap cache: add 0, delete 0, find 0/0
Free swap = 3903752kB
Total swap = 3903752kB
Free swap: 3903752kB
524000 pages of RAM
9359 reserved pages
488168 pages shared
0 pages swap cached

Comment by Jeroen Maris (jealma) - Monday, 04 August 2008, 11:05 GMT

I've just tested again with R8169 driver and again the same errors. Difference with previous test with r8169 is that this time, I removed the e1000e-driver by blacklisting it and I removed the nvidia binary driver and resorted to using only the vc's. Still the same errors. I've also tried booting with the `noapic`-parameter to the kernel, but then I couldn't get my interface up. When issuing `ifconfig eth0 up`, I get the message "SIOCSIFFLAGS: Device or resource busy". Attached is my dmesg with some more error stuff when rsyncing using the r8169-driver and not having e1000e and nvidia loaded.

dmesg-errors-r8169-nonvidia-n... (161.7 KiB)

Comment by Glenn Matthys (RedShift) - Monday, 04 August 2008, 12:53 GMT

Have you tried another PCI slot?

Comment by Jeroen Maris (jealma) - Monday, 04 August 2008, 13:16 GMT

Well, the e1000e is an Intel Pro1000PT PCI-e x1, and I could move it to another slot, although it won't matter, because I can reproduce the same problem on three completely different computers. The r8169 is an onboard card, so that rules out the slot issue.

I've done some testing with kernel 2.6.26-2 from [testing] and the problems present. Here are dmesg's from testing with 2.6.26-2, with both e1000e and r8169, using ext3 filesystem and not using nvidia driver (console only).
If you need more information or test results, please let me know.

dmesg-errors-e1000e-nonvidia-... (50.5 KiB)

dmesg-errors-r8169-nonvidia-n... (27.7 KiB)

lspci-vv.log (22.5 KiB)

Comment by Jeroen Maris (jealma) - Wednesday, 27 August 2008, 18:24 GMT

Just needed to sync my backup again. I used an up-to-date Debian Lenny-installation with kernel 2.6.25 and rsync 3.0.3, XFS filesystems and the onboard r8169-networkchip and transferred 270+GB of data and there was not a single error in dmesg on the client. The server I leeched from runs an up-to-date Archlinux-installation with kernel26-2.6.26.2, NFS and intel pro1000 with e1000e-driver. The server has not given any errors either. Dare I say that the problem is with Archlinux and it's not necessarily the kernel or rsync (because the same versions of kernel and rsync are also used in Debian)? Does anyone have any clue as to what the problem might be? So far we've ruled out LUKS, XFS, e1000e and r8169-driver, network-cards, rsync 3.0.2 and rsync 3.0.3, kernel 2.6.25 and kernel 2.6.26 (unless the problem is caused by a patch that arch applied but debian did not or visa versa). Any help is appreciated.

Comment by Jonathan Ross (jonathanross) - Tuesday, 23 September 2008, 12:07 GMT

Hi,

You're not alone ! I have a similar problem on a SPARC box running Gentoo:

Linux Loopy 2.6.25-gentoo-r7 #4 SMP Sun Sep 7 19:22:55 BST 2008 sparc64 sun4u TI UltraSparc IIe (Hummingbird) GNU/Linux

It's a new build as you can see and I've had multiple page allocation errors on three different ocassions. The last two occasions were firstly during a "emerge --sync" and secondly today when doing an rsync of the differences of a disk partition.

I suspect rsync is at least a little to blame so I've just downgraded to net-misc/rsync-3.0.2 from 3.0.3 (SPARC is a little behind with its packages and kernel numbers for obvious reasons).

I admin two other identical Servers running different kernels but the new version of rsync without any errors so I suspect it's a combination of rsync and the newer kernel logging more than older kernels.

Thankfully it doesn't make the box unstable but just fills logs up a bit.

Swapper, kswapd0 and qmail-smtpd as well as rsync have reported page allocation errors. I'm convinced the hardware is working fine as it's been in production running older Gentoo versions for a couple of years without errors.

[99642.452773] Call Trace:
[99642.452786] [00000000004b7574] __slab_alloc+0x1b4/0x5f4
[99642.452829] [00000000004b95d0] __kmalloc_track_caller+0x98/0xf0
[99642.452853] [0000000000625a38] __alloc_skb+0x5c/0x108
[99642.452884] [00000000100003f0] tulip_interrupt+0x2a8/0xd94 [tulip]
[99642.452945] [000000000048fd58] handle_IRQ_event+0x34/0x74
[99642.452972] [000000000049145c] handle_fasteoi_irq+0xe0/0x13c
[99642.452997] [000000000042db94] handler_irq+0x8c/0xb4
[99642.453033] [00000000004208b4] tl0_irq5+0x1c/0x20
[99642.453058] [00000000004cbc44] prune_dcache+0xc0/0x1ec
[99642.453088] [00000000004cbd9c] shrink_dcache_memory+0x2c/0x60
[99642.453113] [000000000049f7cc] shrink_slab+0xcc/0x164
[99642.453139] [000000000049fbd4] kswapd+0x370/0x4e8
[99642.453160] [000000000047cf04] kthread+0x4c/0x78
[99642.453185] [00000000004271f8] kernel_thread+0x38/0x48
[99642.453207] [000000000047cd40] kthreadd+0xb8/0x180

The tulip driver mentioned is a somewhat notorious Gentoo/Sun problem where the NICs use an 'unsupported' driver. Ifconfig shows just 11 packet drops in a GB of traffic though and from other tulip NICs I've seen that means the NIC is doing pretty well !

I can give you eye-strain with more debug info if you need it.

Any help from your side appreciated too :-)

JR

Comment by Jeroen Maris (jealma) - Wednesday, 24 September 2008, 17:56 GMT

The errors don't make my box unstable either, and the files I transferred were transferred correctly (did an sha256sum on all files). It is very irritating though and I think the problem is with some patch from archlinux, as I don't get the error with Debian Lenny, that also uses kernel 2.6.25, 2.6.26 and rsync 3.0.3.

Comment by Jonathan Ross (jonathanross) - Wednesday, 24 September 2008, 18:26 GMT

Hi :)

Well, so far so good ...

I had 2.6.23-gentoo-r9 on that box too and booted into that, leaving rsync-3.0.2 installed.

All seems to be going good in terms of dmesg and syslog errors. That's been up about 36 hours.

Unless someone knows of exploits in 2.6.23-gentoo-r9 I think I'll stick with that kernel for now and hope this is fixed later on in newer kernels !

JR

Comment by Tobias Powalowski (tpowa) - Saturday, 11 October 2008, 21:16 GMT

status on .27 kernel?

Comment by Jonathan Ross (jonathanross) - Sunday, 12 October 2008, 08:54 GMT

FYI I haven't had any errors at all on the 2.6.23-gentoo-r9 kernel since the last posts.

JR

Comment by Jeroen Maris (jealma) - Sunday, 12 October 2008, 09:53 GMT

I don't want to use a 2.6.23.x kernel, just because newer kernels have errors when using rsync. That would be ridiculous. Monday or tuesday, I'll have time to test rsync with a 2.6.27-kernel, I very much hope this issue is resolved in 2.6.27.

Comment by Jonathan Ross (jonathanross) - Sunday, 12 October 2008, 13:25 GMT

What might be ridiculous is assuming that rsync is the problem and not the kernel version.

Also what might be considered ridiculous is that without looking them up you can't even tell us what extra features the newer kernel has that you explicitly need.

Comment by Jeroen Maris (jealma) - Sunday, 12 October 2008, 19:20 GMT

Jonathan, I take no offence, but I do want to mention the reasons why I want to use a recent kernel. I want to use a pretty new kernel version, because the onboard intel gbit nic on my Intel DG45ID is only supported with the e1000e since 2.6.26. Besides that, since 2.6.24 (if I remember correctly), there were some important changes to the iwl-driver and the wlan stack in general, also in 2.6.27. These are the most important reasons that I want a newer kernel as 2.6.23.x.

About rsync vs. kernel being the problem, I don't know for sure, but I have a vague feeling that one of Arch's patches causes the problems, as Debian Lenny's kernel is now also 2.6.26 and also has rsync 3.0.3, but doesn't suffer from the problems that my Arch installation does.

Comment by Thomas Bächler (brain0) - Sunday, 12 October 2008, 21:54 GMT

The only thing responsible could be the patches aufs needs, as those are the only filesystem related things. However, they only add small features (that only squashfs and aufs use) and don't really change anything. Apart from that, we add squashfs and change two things in ACPI and that's it. If you find the patch responsible, please share it with me.

It may also be a problem in kernel configuration and not the patches. If you could build a kernel with the -ARCH patch removed (just comment out one line in the PKGBUILD) and try that, that might be helpful.

	Tasks related to this task (0)

Duplicate tasks of this task (0)

Arch Linux

FS#10497 - Kernel bug in 2.6.25 with XFS and device-mapper

Details

Loading...