FS#65138 - r8168/r8169 Issues bringing network online/massive packet loss kernel trace
Attached to Project:
Arch Linux
Opened by Daniel Gray (dngray) - Tuesday, 14 January 2020, 07:31 GMT
Last edited by freswa (frederik) - Thursday, 20 February 2020, 22:01 GMT
Opened by Daniel Gray (dngray) - Tuesday, 14 January 2020, 07:31 GMT
Last edited by freswa (frederik) - Thursday, 20 February 2020, 22:01 GMT
Details
Hi,
I have a B450 TOMAHAWK MAX (MS-7C02) which has a RTL8111H network interface on linux-5.4.11-arch1-1 22:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) Subsystem: Micro-Star International Co., Ltd. [MSI] RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 60 Region 0: I/O ports at f000 [size=256] Region 2: Memory at fcc04000 (64-bit, non-prefetchable) [size=4K] Region 4: Memory at fcc00000 (64-bit, non-prefetchable) [size=16K] Capabilities: <access denied> Kernel driver in use: r8168 Kernel modules: r8169, r8168 When using the r8168 8.047.05-14 driver sometimes when the machine boots no route can be found: r8168 Gigabit Ethernet driver 8.047.05-NAPI loaded r8168 0000:22:00.0: enabling device (0000 -> 0003) r8168: This product is covered by one or more of the following patents: US6,570,884, US6,115,776, and US6,327,625. r8168 Copyright (C) 2019 Realtek NIC software team <nicfae@realtek.com> r8168 0000:22:00.0 enp34s0: renamed from eth0 r8168: enp34s0: link up PING 192.168.3.1 (192.168.3.1) 56(84) bytes of data. From 192.168.3.26 icmp_seq=1 Destination Host Unreachable From 192.168.3.26 icmp_seq=2 Destination Host Unreachable From 192.168.3.26 icmp_seq=3 Destination Host Unreachable From 192.168.3.26 icmp_seq=4 Destination Host Unreachable From 192.168.3.26 icmp_seq=5 Destination Host Unreachable --- 192.168.3.1 ping statistics --- 5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 4058ms Despite having an ip address and a route: default via 192.168.3.1 dev enp34s0.3 proto dhcp metric 20400 192.168.3.0/24 dev enp34s0.3 proto kernel scope link src 192.168.3.26 metric 400 When using the in kernel driver, ie if r8168 is removed, we see a massive amount of packet loss: 64 bytes from 192.168.3.1: icmp_seq=136 ttl=64 time=5753 ms 64 bytes from 192.168.3.1: icmp_seq=137 ttl=64 time=4740 ms 64 bytes from 192.168.3.1: icmp_seq=138 ttl=64 time=3726 ms 64 bytes from 192.168.3.1: icmp_seq=139 ttl=64 time=2714 ms From 192.168.3.26 icmp_seq=143 Destination Host Unreachable From 192.168.3.26 icmp_seq=144 Destination Host Unreachable From 192.168.3.26 icmp_seq=145 Destination Host Unreachable From 192.168.3.26 icmp_seq=146 Destination Host Unreachable From 192.168.3.26 icmp_seq=147 Destination Host Unreachable 64 bytes from 192.168.3.1: icmp_seq=141 ttl=64 time=8713 ms 64 bytes from 192.168.3.1: icmp_seq=142 ttl=64 time=7700 ms 64 bytes from 192.168.3.1: icmp_seq=159 ttl=64 time=5420 ms 64 bytes from 192.168.3.1: icmp_seq=160 ttl=64 time=4406 ms 64 bytes from 192.168.3.1: icmp_seq=161 ttl=64 time=3393 ms Even saw this come up once: ping: sendmsg: No buffer space available ping: sendmsg: No buffer space available ping: sendmsg: No buffer space available ping: sendmsg: No buffer space available In the kernel logs I observed this: kernel: ------------[ cut here ]------------ kernel: NETDEV WATCHDOG: enp34s0 (r8169): transmit queue 0 timed out kernel: WARNING: CPU: 6 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x26a/0x280 kernel: Modules linked in: fuse 8021q garp mrp stp llc rfkill amdgpu snd_hda_codec_realtek gpu_sched i2c_algo_bit snd_hda_codec_generic ttm ledtrig_audio snd_hda_codec_hdmi nls_iso8859_1 drm_kms_helper snd_hda_intel nls_cp437 snd_i> kernel: CPU: 6 PID: 0 Comm: swapper/6 Tainted: G OE 5.4.11-arch1-1 #1 kernel: Hardware name: Micro-Star International Co., Ltd MS-7C02/B450 TOMAHAWK MAX (MS-7C02), BIOS 3.50 11/07/2019 kernel: RIP: 0010:dev_watchdog+0x26a/0x280 kernel: Code: 1c 3d 82 ff eb 88 4c 89 f7 c6 05 1b 6f b3 00 01 e8 fb c6 fa ff 44 89 e9 4c 89 f6 48 c7 c7 70 34 57 b4 48 89 c2 e8 04 f9 8a ff <0f> 0b e9 66 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 kernel: RSP: 0018:ffffad6f803a8e60 EFLAGS: 00010286 kernel: RAX: 0000000000000000 RBX: ffff9bdb02772400 RCX: 0000000000000000 kernel: RDX: 0000000000000103 RSI: ffff9bdb0e997708 RDI: 00000000ffffffff kernel: RBP: ffff9bdb01c0a45c R08: 0000000000000546 R09: 0000000000000004 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff9bdb01c0a480 kernel: R13: 0000000000000000 R14: ffff9bdb01c0a000 R15: ffff9bdb02772480 kernel: FS: 0000000000000000(0000) GS:ffff9bdb0e980000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000330f8905ec20 CR3: 00000003f2916000 CR4: 0000000000340ee0 kernel: Call Trace: kernel: <IRQ> kernel: ? qdisc_put_unlocked+0x30/0x30 kernel: call_timer_fn+0x2d/0x160 kernel: run_timer_softirq+0x1ad/0x510 kernel: ? qdisc_put_unlocked+0x30/0x30 kernel: __do_softirq+0x111/0x34d kernel: irq_exit+0xac/0xd0 kernel: smp_apic_timer_interrupt+0xa6/0x1b0 kernel: apic_timer_interrupt+0xf/0x20 kernel: </IRQ> kernel: RIP: 0010:cpuidle_enter_state+0xc4/0x480 kernel: Code: e8 31 fa 99 ff 80 7c 24 0f 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 93 03 00 00 31 ff e8 03 51 a0 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 be 02 00 00 49 63 cc 4c 2b 6c 24 10 48 8d 04 49 48 kernel: RSP: 0018:ffffad6f80177e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 kernel: RAX: ffff9bdb0e980000 RBX: ffffffffb46c1680 RCX: 000000000000001f kernel: RDX: 0000000000000000 RSI: 00000000238e3a2f RDI: 0000000000000000 kernel: RBP: ffff9bdb04eb2400 R08: 000000069cf8bcb4 R09: 00000000000163bb kernel: R10: ffff9bdb0e9a97e0 R11: ffff9bdb0e9a97c0 R12: 0000000000000002 kernel: R13: 000000069cf8bcb4 R14: 0000000000000002 R15: ffff9bdb0c4adac0 kernel: ? cpuidle_enter_state+0x9f/0x480 kernel: cpuidle_enter+0x29/0x40 kernel: do_idle+0x1de/0x260 kernel: cpu_startup_entry+0x19/0x20 kernel: start_secondary+0x186/0x1d0 kernel: secondary_startup_64+0xb6/0xc0 kernel: ---[ end trace f98dd501aa1b8460 ]--- kernel: r8169 0000:22:00.0 enp34s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100). NetworkManager[1184]: <info> [1578977660.9164] dhcp4 (enp34s0.3): option dhcp_lease_time => '43200' NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option domain_name_servers => '192.168.3.1' NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option expiry => '1579020860' NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option host_name => 'XXXXXXX' NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option ip_address => '192.168.3.26' NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option ntp_servers => '192.168.3.1' NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option requested_broadcast_address => '1' NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option requested_dhcp_server_identifier => '1' NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option requested_domain_name => '1' NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option requested_domain_name_servers => '1' NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option requested_domain_search => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_host_name => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_interface_mtu => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_ms_classless_static_routes => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_nis_domain => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_nis_servers => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_ntp_servers => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_rfc3442_classless_static_routes => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_root_path => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_routers => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_static_routes => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_subnet_mask => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_time_offset => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_wpad => '1' NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option routers => '192.168.3.1' NetworkManager[1184]: <info> [1578977660.9167] dhcp4 (enp34s0.3): option subnet_mask => '255.255.255.0' The problem seems to intermittent. If I reboot the problem may not happen. Other devices on the switch work just fine, and indeed that patch cable so I have determined it is not an upstream issue in my network. When I am having issues with the 1000+ms lag I see this repeated in the kernel logs: r8169 0000:22:00.0 enp34s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100). If anyone could provide any further assistance so that I may get to the bottom of this I'd appreciate it. I'd like to report this upstream if it's a bug. |
This task depends upon
Should mention during my testing, I have substituted the switch and the issue still persists. Additionally I actually have more than 1 system (2) which both have B450 TOMAHAWK MAX (MS-7C02) motherboards. The issue is reproducible in both systems, therefore it's not a hardware issue with a malfunctioning NIC etc.
RTL8125 connection with 82571GB will cause frequent problems
RTL8125 connection RTL8111E occurs once a day
Both devices are Arch Linux. One has RTL8125 x2 (on-board and independent PCIE card), the other has 82571GB x4 (stand-alone PCIE card) and RTL8111E (on-board).
The third is a laptop connected to a Dell USB-C Mobile Adapter - DA300 https://www.dell.com/en-us/shop/dell-usb-c-mobile-adapter-da300/apd/470-acwn/pc-accessories
I have observed that while the laptop is off, but the dock is on, there must be some kind of bad frame being sent from the dock. I hear coil whine as soon as the dock is plugged into the powered off laptop.
I had two of these boards giving me this error, simultaneously. In fact in the 3-4 times i have observed it always happens on both Tomahawks at the same time.
As soon as I powered on the laptop, the problem disappeared on both machines. I've been able to reproduce this 3 times now, unplugging the dock seems to fix it too.
Additionally the issue mentioned above also effects Windows 10, so it seems like something that dock is doing to my switch (I've tested two switches and replaced all cables). The issue seems to definitely be something to do with the dock.
So it turns out it wasn't a bad switch, but it must be a bug in the switch firmware. I tried with multiple switches which were older model Cisco SG 100D-08 Gigabit Switch https://www.cisco.com/c/en/us/obsolete/switches/cisco-sg-100d-08-8-port-gigabit-switch.html
When I changed those switches to a Ubiquiti ES‑8‑150W the problem went away. Certainly seemed to be something to do with my laptop being plugged into the dock, but powered off.
When I enabled port mirroring I didn't see any unusual packets coming from the laptop.