Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#63965 - Intel Gigabit Ethernet is unstable. e1000e: Detected Hardware Unit Hang

Attached to Project: Arch Linux
Opened by Tomasz Jankowski (goofy) - Monday, 30 September 2019, 19:17 GMT
Last edited by freswa (frederik) - Friday, 21 February 2020, 21:53 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To No-one
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:
I run up-to-date Arch Linux on Intel NUC D54250WYK with additional Ethernet-USB adapter, with following setup:

-----------
| |---\Internal NIC\----------- ISP
| NUC |
| |---\Ethernet USB adapter\--- LAN
-----------

The "internal NIC" breaks when I generate traffic (e.g. send 1GB+ files using Samba) between Intel NUC and other LAN node. The "internal NIC" loses assigned IP address and logs following message:

Sep 29 21:16:51 tusiec kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
TDH <c0>
TDT <c8>
next_to_use <c8>
next_to_clean <c0>
buffer_info[next_to_clean]:
time_stamp <10005fd8e>
next_to_watch <c0>
jiffies <10005ff00>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
Sep 29 21:16:53 tusiec kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
TDH <c0>
TDT <c8>
next_to_use <c8>
next_to_clean <c0>
buffer_info[next_to_clean]:
time_stamp <10005fd8e>
next_to_watch <c0>
jiffies <100060180>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
Sep 29 21:16:56 tusiec kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
TDH <c0>
TDT <c8>
next_to_use <c8>
next_to_clean <c0>
buffer_info[next_to_clean]:
time_stamp <10005fd8e>
next_to_watch <c0>
jiffies <100060400>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
Sep 29 21:16:56 tusiec kernel: ------------[ cut here ]------------
Sep 29 21:16:56 tusiec kernel: NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
Sep 29 21:16:56 tusiec kernel: WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x26a/0x280
Sep 29 21:16:56 tusiec kernel: Modules linked in: tcp_diag udp_diag raw_diag inet_diag netlink_diag wireguard(OE) ip6_udp_tunnel udp_tunnel xt_conntrack xt_tcpudp iptable_filter xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv>
Sep 29 21:16:56 tusiec kernel: i2c_i801 lpc_ich evdev pcspkr mac_hid wmi ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid hid sd_mod ahci libahci libata xhci_pci scsi_mod crc32c_intel xhci_hcd
Sep 29 21:16:56 tusiec kernel: CPU: 1 PID: 0 Comm: swapper/1 Tainted: G OE 5.3.1-arch1-1-ARCH #1
Sep 29 21:16:56 tusiec kernel: Hardware name: /D54250WYK, BIOS WYLPT10H.86A.0052.2019.0528.1756 05/28/2019
Sep 29 21:16:56 tusiec kernel: RIP: 0010:dev_watchdog+0x26a/0x280
Sep 29 21:16:56 tusiec kernel: Code: ec d5 84 ff eb 88 4c 89 f7 c6 05 30 fd b5 00 01 e8 bb 04 fb ff 44 89 e9 4c 89 f6 48 c7 c7 68 93 96 a1 48 89 c2 e8 65 72 8d ff <0f> 0b e9 66 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
Sep 29 21:16:56 tusiec kernel: RSP: 0018:ffffb1b1c00fcdf8 EFLAGS: 00010282
Sep 29 21:16:56 tusiec kernel: RAX: 0000000000000000 RBX: ffff96adc8c5f200 RCX: 0000000000000000
Sep 29 21:16:56 tusiec kernel: RDX: 0000000000000103 RSI: 0000000000000082 RDI: 00000000ffffffff
Sep 29 21:16:56 tusiec kernel: RBP: ffff96adc84ec45c R08: 0000000000000317 R09: 0000000000000001
Sep 29 21:16:56 tusiec kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff96adc84ec480
Sep 29 21:16:56 tusiec kernel: R13: 0000000000000000 R14: ffff96adc84ec000 R15: ffff96adc8c5f280
Sep 29 21:16:56 tusiec kernel: FS: 0000000000000000(0000) GS:ffff96adcf880000(0000) knlGS:0000000000000000
Sep 29 21:16:56 tusiec kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 29 21:16:56 tusiec kernel: CR2: 00007f60bd230710 CR3: 000000032660a001 CR4: 00000000001606e0
Sep 29 21:16:56 tusiec kernel: Call Trace:
Sep 29 21:16:56 tusiec kernel: <IRQ>
Sep 29 21:16:56 tusiec kernel: ? qdisc_put_unlocked+0x30/0x30
Sep 29 21:16:56 tusiec kernel: call_timer_fn+0x2d/0x160
Sep 29 21:16:56 tusiec kernel: ? qdisc_put_unlocked+0x30/0x30
Sep 29 21:16:56 tusiec kernel: expire_timers+0xa7/0x120
Sep 29 21:16:56 tusiec kernel: run_timer_softirq+0xb5/0x1b0
Sep 29 21:16:56 tusiec kernel: __do_softirq+0x114/0x332
Sep 29 21:16:56 tusiec kernel: irq_exit+0xd4/0xf0
Sep 29 21:16:56 tusiec kernel: smp_apic_timer_interrupt+0xa6/0x1b0
Sep 29 21:16:56 tusiec kernel: apic_timer_interrupt+0xf/0x20
Sep 29 21:16:56 tusiec kernel: </IRQ>
Sep 29 21:16:56 tusiec kernel: RIP: 0010:cpuidle_enter_state+0xc4/0x480
Sep 29 21:16:56 tusiec kernel: Code: e8 a1 e5 9b ff 80 7c 24 0f 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 93 03 00 00 31 ff e8 03 76 a2 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 be 02 00 00 49 63 cc 4c 2b 6c 24 10 48 8d 04 49 48
Sep 29 21:16:56 tusiec kernel: RSP: 0018:ffffb1b1c00a7e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Sep 29 21:16:56 tusiec kernel: RAX: ffff96adcf880000 RBX: ffffffffa1abcd20 RCX: 000000000000001f
Sep 29 21:16:56 tusiec kernel: RDX: 0000000000000000 RSI: 0000000043863ce7 RDI: 0000000000000000
Sep 29 21:16:56 tusiec kernel: RBP: ffff96adcf8b4118 R08: 00000177ecac783d R09: 000000007fffffff
Sep 29 21:16:56 tusiec kernel: R10: ffff96adcf8a9344 R11: ffff96adcf8a9324 R12: 0000000000000001
Sep 29 21:16:56 tusiec kernel: R13: 00000177ecac783d R14: 0000000000000001 R15: ffff96adcde89e40
Sep 29 21:16:56 tusiec kernel: ? cpuidle_enter_state+0x9f/0x480
Sep 29 21:16:56 tusiec kernel: cpuidle_enter+0x29/0x40
Sep 29 21:16:56 tusiec kernel: do_idle+0x1ec/0x270
Sep 29 21:16:56 tusiec kernel: cpu_startup_entry+0x19/0x20
Sep 29 21:16:56 tusiec kernel: start_secondary+0x185/0x1d0
Sep 29 21:16:56 tusiec kernel: secondary_startup_64+0xa4/0xb0
Sep 29 21:16:56 tusiec kernel: ---[ end trace c995baa2cf2b2eb3 ]---
Sep 29 21:16:56 tusiec kernel: e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly
Sep 29 21:17:04 tusiec kernel: e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

After few seconds it recovers, gets IP from ISP and continue working "normally". The issue happens only when I generate traffic between Intel NUC and LAN and it is 100% reproducipble. I can pass huge amount of files between Intel NUC and internet without any harm to "Internal NIC".

Additional info:
- Linux tusiec 5.3.1-arch1-1-ARCH #1 SMP PREEMPT Sat Sep 21 11:33:49 UTC 2019 x86_64 GNU/Linux
- I searched Google and tried many solutions including similar Arch bug report: https://bugs.archlinux.org/task/62699

Please let me know if you need more information, configuration details etc.
This task depends upon

Closed by  freswa (frederik)
Friday, 21 February 2020, 21:53 GMT
Reason for closing:  No response
Comment by Tomasz Jankowski (goofy) - Monday, 30 September 2019, 19:20 GMT
My ASCII diagram in original description is displayed incorrectly, so let me explain my setup here. Intel NUC has two NICs: builtin and additional Ethernet-USB adapter. Internal/builtin NIC is connected to my ISP and Ethernet-USB is connected to LAN.
Comment by loqs (loqs) - Tuesday, 01 October 2019, 09:44 GMT
Is the issue also present under linux-lts? Has the issue always been present or has been introduced by some update?
Comment by Tomasz Jankowski (goofy) - Tuesday, 01 October 2019, 20:58 GMT
I run regular linux kernel on Arch (the "linux" package). Hard to say, I've noticed the problem once I installed Samba and started transferring data to/from LAN.
Comment by Gerald H. (ArchAny) - Sunday, 29 December 2019, 23:25 GMT
Did this start with kernel 5.3 for you? It sounds a lot like the bug introduced with kernel 5.3 which is discussed here:
https://bugzilla.kernel.org/show_bug.cgi?id=205047

For me the failure is quite similar, it happens when traffic is sent to wireguard interfaces I think. So at least somewhat similar to your second USB-LAN interface...

Loading...