Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
FS#10497 - Kernel bug in 2.6.25 with XFS and device-mapper
Attached to Project:
Arch Linux
Opened by Jeroen Maris (jealma) - Sunday, 25 May 2008, 21:59 GMT
Last edited by Aaron Griffin (phrakture) - Sunday, 09 November 2008, 05:28 GMT
Opened by Jeroen Maris (jealma) - Sunday, 25 May 2008, 21:59 GMT
Last edited by Aaron Griffin (phrakture) - Sunday, 09 November 2008, 05:28 GMT
|
DetailsDescription:
Just installed Arch64 with kernel 2.6.25 on my server (moved from Debian Etch) and got the errors below: This is with copying lots and lots of data from an internal RAID5-array (Areca ARC1220) with XFS to another external SATA-hdd connected with Infiniband. The external SATA-disk connected with Infiniband is encrypted with LUKS (AES) and also contains an XFS-filesystem, approximately 1TB. I was using rsync to copy stuff from the raid-array to the external disk. The XFS-filesystem on the external disk is corrupt by now. ----------------------------- 00000000: 34 49 4e 97 b1 26 0f 24 fc 3c 45 37 79 77 02 ec 4IN..&.$.<E7yw.. Filesystem "dm-0": XFS internal error xfs_ialloc_read_agi at line 1384 of file fs/xfs/xfs_ialloc.c. Caller 0xffffffff881b4c2a Pid: 17060, comm: rsync Tainted: G D 2.6.25-ARCH #1 Call Trace: [<ffffffff881b4c2a>] :xfs:xfs_ialloc_ag_select+0x22a/0x330 [<ffffffff881b3f81>] :xfs:xfs_ialloc_read_agi+0xe1/0x140 [<ffffffff881b4c2a>] :xfs:xfs_ialloc_ag_select+0x22a/0x330 [<ffffffff881b4c2a>] :xfs:xfs_ialloc_ag_select+0x22a/0x330 [<ffffffff881b56dc>] :xfs:xfs_dialloc+0x30c/0xa30 [<ffffffff881a89b0>] :xfs:xfs_dir_lookup+0x160/0x1b0 [<ffffffff881c4f80>] :xfs:xfs_log_release_iclog+0x10/0x40 [<ffffffff80333501>] __up_read+0x21/0xb0 [<ffffffff881bde0d>] :xfs:xfs_ialloc+0x6d/0x6a0 [<ffffffff881d4ea8>] :xfs:xfs_dir_ialloc+0xa8/0x370 [<ffffffff804605d2>] __down_write_nested+0xb2/0xc0 [<ffffffff881d280b>] :xfs:xfs_trans_reserve+0xab/0x240 [<ffffffff881d9937>] :xfs:xfs_mkdir+0x437/0x560 [<ffffffff881e534b>] :xfs:xfs_vn_mknod+0x20b/0x310 [<ffffffff802ae399>] vfs_mkdir+0xe9/0x130 [<ffffffff802b0b9b>] sys_mkdirat+0xeb/0x140 [<ffffffff802bf1cf>] mntput_no_expire+0x1f/0x90 [<ffffffff802a1c7b>] filp_close+0x5b/0x90 [<ffffffff802a1d61>] sys_close+0xb1/0x120 [<ffffffff8020c59a>] system_call_after_swapgs+0x8a/0x8f 00000000: 34 49 4e 97 b1 26 0f 24 fc 3c 45 37 79 77 02 ec 4IN..&.$.<E7yw.. Filesystem "dm-0": XFS internal error xfs_iunlink at line 1949 of file fs/xfs/xfs_inode.c. Caller 0xffffffff881da156 Pid: 17060, comm: rsync Tainted: G D 2.6.25-ARCH #1 Call Trace: [<ffffffff881da156>] :xfs:xfs_remove+0x366/0x3f0 [<ffffffff881bd0a3>] :xfs:xfs_iunlink+0x113/0x220 [<ffffffff881da156>] :xfs:xfs_remove+0x366/0x3f0 [<ffffffff881e5cd2>] :xfs:xfs_ichgtime+0x22/0xe0 [<ffffffff881da156>] :xfs:xfs_remove+0x366/0x3f0 [<ffffffff881e5075>] :xfs:xfs_vn_unlink+0x25/0x60 [<ffffffff802adc00>] vfs_unlink+0x100/0x150 [<ffffffff802b0881>] do_unlinkat+0x121/0x1d0 [<ffffffff802bf1cf>] mntput_no_expire+0x1f/0x90 [<ffffffff802a1c7b>] filp_close+0x5b/0x90 [<ffffffff802a1d61>] sys_close+0xb1/0x120 [<ffffffff8020c59a>] system_call_after_swapgs+0x8a/0x8f Filesystem "dm-0": XFS internal error xfs_trans_cancel at line 1163 of file fs/xfs/xfs_trans.c. Caller 0xffffffff881da177 Pid: 17060, comm: rsync Tainted: G D 2.6.25-ARCH #1 Call Trace: [<ffffffff881da177>] :xfs:xfs_remove+0x387/0x3f0 [<ffffffff881d1c40>] :xfs:xfs_trans_cancel+0xf0/0x110 [<ffffffff881da177>] :xfs:xfs_remove+0x387/0x3f0 [<ffffffff881e5075>] :xfs:xfs_vn_unlink+0x25/0x60 [<ffffffff802adc00>] vfs_unlink+0x100/0x150 [<ffffffff802b0881>] do_unlinkat+0x121/0x1d0 [<ffffffff802bf1cf>] mntput_no_expire+0x1f/0x90 [<ffffffff802a1c7b>] filp_close+0x5b/0x90 [<ffffffff802a1d61>] sys_close+0xb1/0x120 [<ffffffff8020c59a>] system_call_after_swapgs+0x8a/0x8f xfs_force_shutdown(dm-0,0x8) called from line 1164 of file fs/xfs/xfs_trans.c. Return address = 0xffffffff881d1c59 Filesystem "dm-0": Corruption of in-memory data detected. Shutting down filesystem: dm-0 Please umount the filesystem, and rectify the problem(s) ----------------------------- Additional info: * package version(s) cryptsetup 1.0.6-1 kernel26 2.6.25.4-1 xfsprogs 2.9.7-1 xfsdump 2.2.46-3 * config and/or log files etc. Steps to reproduce: - That can be a problem... |
This task depends upon
what one of the RAM modules is broken and this broken part(region, whatever you name it) is only addressed with the 64bit kernel?
I came to that idea because your trace mentions "Corruption of in-memory data detected."
I recommend running memtest to check if it's the RAM.
Call Trace:
<IRQ> [<ffffffff8027d86e>] __alloc_pages+0x2ee/0x3c0
[<ffffffff8020cb4d>] ret_from_intr+0x0/0x19
[<ffffffff8029fabc>] __slab_alloc+0x48c/0x7c0
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff802a10e6>] __kmalloc_track_caller+0xe6/0x140
[<ffffffff803ca932>] __alloc_skb+0x72/0x160
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff8820d4db>] :e1000e:e1000_alloc_rx_buffers+0x20b/0x260
[<ffffffff8820d96f>] :e1000e:e1000_clean_rx_irq+0x26f/0x410
[<ffffffff8820b16b>] :e1000e:e1000_clean+0x16b/0x240
[<ffffffff802552cb>] hrtimer_get_next_event+0xdb/0xf0
[<ffffffff803cec61>] net_rx_action+0x131/0x290
[<ffffffff80240d2a>] __do_softirq+0x7a/0xf0
[<ffffffff8020d9dc>] call_softirq+0x1c/0x30
[<ffffffff8020fc1d>] do_softirq+0x4d/0x90
[<ffffffff80240ab5>] irq_exit+0xa5/0xb0
[<ffffffff8020feb1>] do_IRQ+0x81/0x100
[<ffffffff8020a060>] mwait_idle+0x0/0x50
[<ffffffff8020b570>] default_idle+0x0/0x70
[<ffffffff8020cb4d>] ret_from_intr+0x0/0x19
<EOI> [<ffffffff8020a09c>] mwait_idle+0x3c/0x50
[<ffffffff8020b4e0>] cpu_idle+0x90/0x120
Mem-info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 31
CPU 1: hi: 186, btch: 31 usd: 160
Active:231493 inactive:242557 dirty:9526 writeback:2714 unstable:0
free:10637 slab:16691 mapped:18348 pagetables:3198 bounce:0
DMA free:8044kB min:28kB low:32kB high:40kB active:704kB inactive:1224kB present:11092kB pages_scanned:32 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
DMA32 free:34504kB min:5712kB low:7140kB high:8568kB active:925268kB inactive:969004kB present:2051184kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 51*4kB 51*8kB 117*16kB 48*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 8052kB
DMA32: 3240*4kB 1702*8kB 480*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 34480kB
354912 total pagecache pages
Swap cache: add 1391017, delete 1348713, find 517305/600384
Free swap = 2686640kB
Total swap = 2931852kB
Free swap: 2686640kB
524000 pages of RAM
9353 reserved pages
94037 pages shared
42304 pages swap cached
swapper: page allocation failure. order:3, mode:0x4020
Pid: 0, comm: swapper Tainted: P 2.6.25-ARCH #1
Call Trace:
<IRQ> [<ffffffff8027d86e>] __alloc_pages+0x2ee/0x3c0
[<ffffffff8020d402>] apic_timer_interrupt+0x72/0x80
[<ffffffff8029fabc>] __slab_alloc+0x48c/0x7c0
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff802a10e6>] __kmalloc_track_caller+0xe6/0x140
[<ffffffff803ca932>] __alloc_skb+0x72/0x160
[<ffffffff803cb837>] __netdev_alloc_skb+0x17/0x40
[<ffffffff8820d4db>] :e1000e:e1000_alloc_rx_buffers+0x20b/0x260
[<ffffffff8820d96f>] :e1000e:e1000_clean_rx_irq+0x26f/0x410
[<ffffffff8820b16b>] :e1000e:e1000_clean+0x16b/0x240
[<ffffffff802552cb>] hrtimer_get_next_event+0xdb/0xf0
[<ffffffff803cec61>] net_rx_action+0x131/0x290
[<ffffffff80240d2a>] __do_softirq+0x7a/0xf0
[<ffffffff8020d9dc>] call_softirq+0x1c/0x30
[<ffffffff8020fc1d>] do_softirq+0x4d/0x90
[<ffffffff80240ab5>] irq_exit+0xa5/0xb0
[<ffffffff8020feb1>] do_IRQ+0x81/0x100
[<ffffffff8020a060>] mwait_idle+0x0/0x50
[<ffffffff8020b570>] default_idle+0x0/0x70
[<ffffffff8020cb4d>] ret_from_intr+0x0/0x19
<EOI> [<ffffffff8020a09c>] mwait_idle+0x3c/0x50
[<ffffffff8020b4e0>] cpu_idle+0x90/0x120
Mem-info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 31
CPU 1: hi: 186, btch: 31 usd: 159
Active:231493 inactive:242557 dirty:9526 writeback:2714 unstable:0
free:10637 slab:16691 mapped:18348 pagetables:3198 bounce:0
DMA free:8044kB min:28kB low:32kB high:40kB active:704kB inactive:1224kB present:11092kB pages_scanned:32 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
DMA32 free:34504kB min:5712kB low:7140kB high:8568kB active:925268kB inactive:969004kB present:2051184kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 51*4kB 51*8kB 117*16kB 48*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 8052kB
DMA32: 3240*4kB 1702*8kB 480*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 34480kB
354912 total pagecache pages
Swap cache: add 1391017, delete 1348713, find 517305/600384
Free swap = 2686640kB
Total swap = 2931852kB
Free swap: 2686640kB
524000 pages of RAM
9353 reserved pages
94037 pages shared
42304 pages swap cached
printk: 1161 messages suppressed.
swapper: page allocation failure. order:3, mode:0x4020
Pid: 0, comm: swapper Tainted: P 2.6.25-ARCH #1
These are the errors I keep getting:
Call Trace:
<IRQ> [<ffffffff8027d996>] __alloc_pages+0x2e6/0x3c0
[<ffffffff8041992d>] tcp_v4_do_rcv+0xdd/0x250
[<ffffffff8029fcfc>] __slab_alloc+0x48c/0x7c0
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff802a1326>] __kmalloc_track_caller+0xe6/0x140
[<ffffffff803ce9f2>] __alloc_skb+0x72/0x160
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff881044bb>] :e1000e:e1000_alloc_rx_buffers+0x20b/0x260
[<ffffffff8810494f>] :e1000e:e1000_clean_rx_irq+0x26f/0x410
[<ffffffff8810213b>] :e1000e:e1000_clean+0x16b/0x240
[<ffffffff803d2d31>] net_rx_action+0x131/0x290
[<ffffffff80240d9a>] __do_softirq+0x7a/0xf0
[<ffffffff8020d9ec>] call_softirq+0x1c/0x30
[<ffffffff8020fc2d>] do_softirq+0x4d/0x90
[<ffffffff80240b25>] irq_exit+0xa5/0xb0
[<ffffffff8020fec1>] do_IRQ+0x81/0x100
[<ffffffff8020b580>] default_idle+0x0/0x70
[<ffffffff8020cb5d>] ret_from_intr+0x0/0x19
<EOI> [<ffffffff802206c0>] lapic_next_event+0x0/0x10
[<ffffffff80225cc2>] native_safe_halt+0x2/0x10
[<ffffffff8020b5bb>] default_idle+0x3b/0x70
[<ffffffff8020b580>] default_idle+0x0/0x70
[<ffffffff8020b4f0>] cpu_idle+0x90/0x120
Mem-info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 152
CPU 1: hi: 186, btch: 31 usd: 93
Active:60734 inactive:421961 dirty:47123 writeback:1758 unstable:0
free:2528 slab:19953 mapped:12653 pagetables:2331 bounce:0
DMA free:8020kB min:28kB low:32kB high:40kB active:0kB inactive:784kB present:11072kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
DMA32 free:2092kB min:5712kB low:7140kB high:8568kB active:242936kB inactive:1687060kB present:2051184kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 5*4kB 2*8kB 97*16kB 11*32kB 13*64kB 5*128kB 2*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 8020kB
DMA32: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 0*128kB 2*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2072kB
435419 total pagecache pages
Swap cache: add 0, delete 0, find 0/0
Free swap = 3903752kB
Total swap = 3903752kB
Free swap: 3903752kB
524000 pages of RAM
9359 reserved pages
487348 pages shared
0 pages swap cached
All these errors mentioned "e1000", the driver for my network card. After using a completely different network card, I still get the same errors, but now with the "r8169"-driver mentioned.
A different PCI(-e) slot could make a difference still though.
Call Trace:
<IRQ> [<ffffffff8027d996>] __alloc_pages+0x2e6/0x3c0
[<ffffffff8029fcfc>] __slab_alloc+0x48c/0x7c0
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff8029f9d5>] __slab_alloc+0x165/0x7c0
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff802a1326>] __kmalloc_track_caller+0xe6/0x140
[<ffffffff803ce9f2>] __alloc_skb+0x72/0x160
[<ffffffff803ff8da>] ip_queue_xmit+0x24a/0x460
[<ffffffff803cf8f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff88b713fe>] :r8169:rtl8169_rx_fill+0xbe/0x1f0
[<ffffffff88b71867>] :r8169:rtl8169_rx_interrupt+0x337/0x490
[<ffffffff88b72cc7>] :r8169:rtl8169_interrupt+0x297/0x4e0
[<ffffffff802701ec>] handle_IRQ_event+0x3c/0x80
[<ffffffff8027188a>] handle_fasteoi_irq+0x8a/0x100
[<ffffffff8020febc>] do_IRQ+0x7c/0x100
[<ffffffff8020b580>] default_idle+0x0/0x70
[<ffffffff8020cb5d>] ret_from_intr+0x0/0x19
<EOI> [<ffffffff802206c0>] lapic_next_event+0x0/0x10
[<ffffffff80225cc2>] native_safe_halt+0x2/0x10
[<ffffffff8020b5bb>] default_idle+0x3b/0x70
[<ffffffff8020b580>] default_idle+0x0/0x70
[<ffffffff8020b4f0>] cpu_idle+0x90/0x120
Mem-info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 153
CPU 1: hi: 186, btch: 31 usd: 160
Active:59844 inactive:422839 dirty:48724 writeback:68 unstable:0
free:2553 slab:19899 mapped:12657 pagetables:2336 bounce:0
DMA free:8040kB min:28kB low:32kB high:40kB active:4kB inactive:1392kB present:11072kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
DMA32 free:2172kB min:5712kB low:7140kB high:8568kB active:239372kB inactive:1689964kB present:2051184kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB 10*8kB 91*16kB 51*32kB 8*64kB 2*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 8040kB
DMA32: 1*4kB 0*8kB 1*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2164kB
436207 total pagecache pages
Swap cache: add 0, delete 0, find 0/0
Free swap = 3903752kB
Total swap = 3903752kB
Free swap: 3903752kB
524000 pages of RAM
9359 reserved pages
488168 pages shared
0 pages swap cached
I've done some testing with kernel 2.6.26-2 from [testing] and the problems present. Here are dmesg's from testing with 2.6.26-2, with both e1000e and r8169, using ext3 filesystem and not using nvidia driver (console only).
If you need more information or test results, please let me know.
You're not alone ! I have a similar problem on a SPARC box running Gentoo:
Linux Loopy 2.6.25-gentoo-r7 #4 SMP Sun Sep 7 19:22:55 BST 2008 sparc64 sun4u TI UltraSparc IIe (Hummingbird) GNU/Linux
It's a new build as you can see and I've had multiple page allocation errors on three different ocassions. The last two occasions were firstly during a "emerge --sync" and secondly today when doing an rsync of the differences of a disk partition.
I suspect rsync is at least a little to blame so I've just downgraded to net-misc/rsync-3.0.2 from 3.0.3 (SPARC is a little behind with its packages and kernel numbers for obvious reasons).
I admin two other identical Servers running different kernels but the new version of rsync without any errors so I suspect it's a combination of rsync and the newer kernel logging more than older kernels.
Thankfully it doesn't make the box unstable but just fills logs up a bit.
Swapper, kswapd0 and qmail-smtpd as well as rsync have reported page allocation errors. I'm convinced the hardware is working fine as it's been in production running older Gentoo versions for a couple of years without errors.
[99642.452773] Call Trace:
[99642.452786] [00000000004b7574] __slab_alloc+0x1b4/0x5f4
[99642.452829] [00000000004b95d0] __kmalloc_track_caller+0x98/0xf0
[99642.452853] [0000000000625a38] __alloc_skb+0x5c/0x108
[99642.452884] [00000000100003f0] tulip_interrupt+0x2a8/0xd94 [tulip]
[99642.452945] [000000000048fd58] handle_IRQ_event+0x34/0x74
[99642.452972] [000000000049145c] handle_fasteoi_irq+0xe0/0x13c
[99642.452997] [000000000042db94] handler_irq+0x8c/0xb4
[99642.453033] [00000000004208b4] tl0_irq5+0x1c/0x20
[99642.453058] [00000000004cbc44] prune_dcache+0xc0/0x1ec
[99642.453088] [00000000004cbd9c] shrink_dcache_memory+0x2c/0x60
[99642.453113] [000000000049f7cc] shrink_slab+0xcc/0x164
[99642.453139] [000000000049fbd4] kswapd+0x370/0x4e8
[99642.453160] [000000000047cf04] kthread+0x4c/0x78
[99642.453185] [00000000004271f8] kernel_thread+0x38/0x48
[99642.453207] [000000000047cd40] kthreadd+0xb8/0x180
The tulip driver mentioned is a somewhat notorious Gentoo/Sun problem where the NICs use an 'unsupported' driver. Ifconfig shows just 11 packet drops in a GB of traffic though and from other tulip NICs I've seen that means the NIC is doing pretty well !
I can give you eye-strain with more debug info if you need it.
Any help from your side appreciated too :-)
JR
Well, so far so good ...
I had 2.6.23-gentoo-r9 on that box too and booted into that, leaving rsync-3.0.2 installed.
All seems to be going good in terms of dmesg and syslog errors. That's been up about 36 hours.
Unless someone knows of exploits in 2.6.23-gentoo-r9 I think I'll stick with that kernel for now and hope this is fixed later on in newer kernels !
JR
JR
Also what might be considered ridiculous is that without looking them up you can't even tell us what extra features the newer kernel has that you explicitly need.
About rsync vs. kernel being the problem, I don't know for sure, but I have a vague feeling that one of Arch's patches causes the problems, as Debian Lenny's kernel is now also 2.6.26 and also has rsync 3.0.3, but doesn't suffer from the problems that my Arch installation does.
It may also be a problem in kernel configuration and not the patches. If you could build a kernel with the -ARCH patch removed (just comment out one line in the PKGBUILD) and try that, that might be helpful.