FS#65138 - r8168/r8169 Issues bringing network online/massive packet loss kernel trace

Attached to Project: Arch Linux
Opened by Daniel Gray (dngray) - Tuesday, 14 January 2020, 07:31 GMT
Last edited by freswa (frederik) - Thursday, 20 February 2020, 22:01 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To No-one
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Hi,

I have a B450 TOMAHAWK MAX (MS-7C02) which has a RTL8111H network interface on linux-5.4.11-arch1-1

22:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
Subsystem: Micro-Star International Co., Ltd. [MSI] RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 60
Region 0: I/O ports at f000 [size=256]
Region 2: Memory at fcc04000 (64-bit, non-prefetchable) [size=4K]
Region 4: Memory at fcc00000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: r8168
Kernel modules: r8169, r8168

When using the r8168 8.047.05-14 driver sometimes when the machine boots no route can be found:

r8168 Gigabit Ethernet driver 8.047.05-NAPI loaded
r8168 0000:22:00.0: enabling device (0000 -> 0003)
r8168: This product is covered by one or more of the following patents: US6,570,884, US6,115,776, and US6,327,625.
r8168 Copyright (C) 2019 Realtek NIC software team <nicfae@realtek.com>
r8168 0000:22:00.0 enp34s0: renamed from eth0
r8168: enp34s0: link up


PING 192.168.3.1 (192.168.3.1) 56(84) bytes of data.
From 192.168.3.26 icmp_seq=1 Destination Host Unreachable
From 192.168.3.26 icmp_seq=2 Destination Host Unreachable
From 192.168.3.26 icmp_seq=3 Destination Host Unreachable
From 192.168.3.26 icmp_seq=4 Destination Host Unreachable
From 192.168.3.26 icmp_seq=5 Destination Host Unreachable

--- 192.168.3.1 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 4058ms

Despite having an ip address and a route:

default via 192.168.3.1 dev enp34s0.3 proto dhcp metric 20400
192.168.3.0/24 dev enp34s0.3 proto kernel scope link src 192.168.3.26 metric 400

When using the in kernel driver, ie if r8168 is removed, we see a massive amount of packet loss:

64 bytes from 192.168.3.1: icmp_seq=136 ttl=64 time=5753 ms
64 bytes from 192.168.3.1: icmp_seq=137 ttl=64 time=4740 ms
64 bytes from 192.168.3.1: icmp_seq=138 ttl=64 time=3726 ms
64 bytes from 192.168.3.1: icmp_seq=139 ttl=64 time=2714 ms
From 192.168.3.26 icmp_seq=143 Destination Host Unreachable
From 192.168.3.26 icmp_seq=144 Destination Host Unreachable
From 192.168.3.26 icmp_seq=145 Destination Host Unreachable
From 192.168.3.26 icmp_seq=146 Destination Host Unreachable
From 192.168.3.26 icmp_seq=147 Destination Host Unreachable
64 bytes from 192.168.3.1: icmp_seq=141 ttl=64 time=8713 ms
64 bytes from 192.168.3.1: icmp_seq=142 ttl=64 time=7700 ms
64 bytes from 192.168.3.1: icmp_seq=159 ttl=64 time=5420 ms
64 bytes from 192.168.3.1: icmp_seq=160 ttl=64 time=4406 ms
64 bytes from 192.168.3.1: icmp_seq=161 ttl=64 time=3393 ms

Even saw this come up once:

ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available

In the kernel logs I observed this:

kernel: ------------[ cut here ]------------
kernel: NETDEV WATCHDOG: enp34s0 (r8169): transmit queue 0 timed out
kernel: WARNING: CPU: 6 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x26a/0x280
kernel: Modules linked in: fuse 8021q garp mrp stp llc rfkill amdgpu snd_hda_codec_realtek gpu_sched i2c_algo_bit snd_hda_codec_generic ttm ledtrig_audio snd_hda_codec_hdmi nls_iso8859_1 drm_kms_helper snd_hda_intel nls_cp437 snd_i>
kernel: CPU: 6 PID: 0 Comm: swapper/6 Tainted: G OE 5.4.11-arch1-1 #1
kernel: Hardware name: Micro-Star International Co., Ltd MS-7C02/B450 TOMAHAWK MAX (MS-7C02), BIOS 3.50 11/07/2019
kernel: RIP: 0010:dev_watchdog+0x26a/0x280
kernel: Code: 1c 3d 82 ff eb 88 4c 89 f7 c6 05 1b 6f b3 00 01 e8 fb c6 fa ff 44 89 e9 4c 89 f6 48 c7 c7 70 34 57 b4 48 89 c2 e8 04 f9 8a ff <0f> 0b e9 66 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
kernel: RSP: 0018:ffffad6f803a8e60 EFLAGS: 00010286
kernel: RAX: 0000000000000000 RBX: ffff9bdb02772400 RCX: 0000000000000000
kernel: RDX: 0000000000000103 RSI: ffff9bdb0e997708 RDI: 00000000ffffffff
kernel: RBP: ffff9bdb01c0a45c R08: 0000000000000546 R09: 0000000000000004
kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff9bdb01c0a480
kernel: R13: 0000000000000000 R14: ffff9bdb01c0a000 R15: ffff9bdb02772480
kernel: FS: 0000000000000000(0000) GS:ffff9bdb0e980000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000330f8905ec20 CR3: 00000003f2916000 CR4: 0000000000340ee0
kernel: Call Trace:
kernel: <IRQ>
kernel: ? qdisc_put_unlocked+0x30/0x30
kernel: call_timer_fn+0x2d/0x160
kernel: run_timer_softirq+0x1ad/0x510
kernel: ? qdisc_put_unlocked+0x30/0x30
kernel: __do_softirq+0x111/0x34d
kernel: irq_exit+0xac/0xd0
kernel: smp_apic_timer_interrupt+0xa6/0x1b0
kernel: apic_timer_interrupt+0xf/0x20
kernel: </IRQ>
kernel: RIP: 0010:cpuidle_enter_state+0xc4/0x480
kernel: Code: e8 31 fa 99 ff 80 7c 24 0f 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 93 03 00 00 31 ff e8 03 51 a0 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 be 02 00 00 49 63 cc 4c 2b 6c 24 10 48 8d 04 49 48
kernel: RSP: 0018:ffffad6f80177e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
kernel: RAX: ffff9bdb0e980000 RBX: ffffffffb46c1680 RCX: 000000000000001f
kernel: RDX: 0000000000000000 RSI: 00000000238e3a2f RDI: 0000000000000000
kernel: RBP: ffff9bdb04eb2400 R08: 000000069cf8bcb4 R09: 00000000000163bb
kernel: R10: ffff9bdb0e9a97e0 R11: ffff9bdb0e9a97c0 R12: 0000000000000002
kernel: R13: 000000069cf8bcb4 R14: 0000000000000002 R15: ffff9bdb0c4adac0
kernel: ? cpuidle_enter_state+0x9f/0x480
kernel: cpuidle_enter+0x29/0x40
kernel: do_idle+0x1de/0x260
kernel: cpu_startup_entry+0x19/0x20
kernel: start_secondary+0x186/0x1d0
kernel: secondary_startup_64+0xb6/0xc0
kernel: ---[ end trace f98dd501aa1b8460 ]---
kernel: r8169 0000:22:00.0 enp34s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100).
NetworkManager[1184]: <info> [1578977660.9164] dhcp4 (enp34s0.3): option dhcp_lease_time => '43200'
NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option domain_name_servers => '192.168.3.1'
NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option expiry => '1579020860'
NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option host_name => 'XXXXXXX'
NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option ip_address => '192.168.3.26'
NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option ntp_servers => '192.168.3.1'
NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option requested_broadcast_address => '1'
NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option requested_dhcp_server_identifier => '1'
NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option requested_domain_name => '1'
NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option requested_domain_name_servers => '1'
NetworkManager[1184]: <info> [1578977660.9165] dhcp4 (enp34s0.3): option requested_domain_search => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_host_name => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_interface_mtu => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_ms_classless_static_routes => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_nis_domain => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_nis_servers => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_ntp_servers => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_rfc3442_classless_static_routes => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_root_path => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_routers => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_static_routes => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_subnet_mask => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_time_offset => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option requested_wpad => '1'
NetworkManager[1184]: <info> [1578977660.9166] dhcp4 (enp34s0.3): option routers => '192.168.3.1'
NetworkManager[1184]: <info> [1578977660.9167] dhcp4 (enp34s0.3): option subnet_mask => '255.255.255.0'

The problem seems to intermittent. If I reboot the problem may not happen. Other devices on the switch work just fine, and indeed that patch cable so I have determined it is not an upstream issue in my network.

When I am having issues with the 1000+ms lag I see this repeated in the kernel logs:

r8169 0000:22:00.0 enp34s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100).

If anyone could provide any further assistance so that I may get to the bottom of this I'd appreciate it. I'd like to report this upstream if it's a bug.
This task depends upon

Closed by  freswa (frederik)
Thursday, 20 February 2020, 22:01 GMT
Reason for closing:  Not a bug
Comment by Daniel Gray (dngray) - Tuesday, 14 January 2020, 15:46 GMT
> Other devices on the switch work just fine, and indeed that patch cable so I have determined it is not an upstream issue in my network.

Should mention during my testing, I have substituted the switch and the issue still persists. Additionally I actually have more than 1 system (2) which both have B450 TOMAHAWK MAX (MS-7C02) motherboards. The issue is reproducible in both systems, therefore it's not a hardware issue with a malfunctioning NIC etc.
Comment by KCC (KCC) - Wednesday, 15 January 2020, 13:01 GMT
I encountered a similar problem.

RTL8125 connection with 82571GB will cause frequent problems
RTL8125 connection RTL8111E occurs once a day

Both devices are Arch Linux. One has RTL8125 x2 (on-board and independent PCIE card), the other has 82571GB x4 (stand-alone PCIE card) and RTL8111E (on-board).
Comment by Daniel Gray (dngray) - Wednesday, 15 January 2020, 17:55 GMT
I have observed some strange behavior. The switch has 3 devices in it. Two of those devices are B450 TOMAHAWK MAX (MS-7C02) motherboards.

The third is a laptop connected to a Dell USB-C Mobile Adapter - DA300 https://www.dell.com/en-us/shop/dell-usb-c-mobile-adapter-da300/apd/470-acwn/pc-accessories

I have observed that while the laptop is off, but the dock is on, there must be some kind of bad frame being sent from the dock. I hear coil whine as soon as the dock is plugged into the powered off laptop.

I had two of these boards giving me this error, simultaneously. In fact in the 3-4 times i have observed it always happens on both Tomahawks at the same time.

As soon as I powered on the laptop, the problem disappeared on both machines. I've been able to reproduce this 3 times now, unplugging the dock seems to fix it too.

Additionally the issue mentioned above also effects Windows 10, so it seems like something that dock is doing to my switch (I've tested two switches and replaced all cables). The issue seems to definitely be something to do with the dock.
Comment by Daniel Gray (dngray) - Wednesday, 15 January 2020, 17:56 GMT
I should mention the dock has a Realtek RTL8153 in it.
Comment by Daniel Gray (dngray) - Saturday, 25 January 2020, 08:55 GMT
I finally figured out what was causing this.

So it turns out it wasn't a bad switch, but it must be a bug in the switch firmware. I tried with multiple switches which were older model Cisco SG 100D-08 Gigabit Switch https://www.cisco.com/c/en/us/obsolete/switches/cisco-sg-100d-08-8-port-gigabit-switch.html

When I changed those switches to a Ubiquiti ES‑8‑150W the problem went away. Certainly seemed to be something to do with my laptop being plugged into the dock, but powered off.

When I enabled port mirroring I didn't see any unusual packets coming from the laptop.

Loading...