FS#62329 - [linux] e1000e crash

Attached to Project: Arch Linux
Opened by Bart (bart82) - Saturday, 13 April 2019, 12:34 GMT
Last edited by Andreas Radke (AndyRTR) - Tuesday, 01 March 2022, 21:07 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:
Since about 1.5 weeks the system has ended up with a non-working wan interface. I haven't been able to pinpoint yet, if this is exactly in sync with the DHCP lease. It seems the lease is longer that the crash interval from this information.


Apr 05 22:30:46 lala kernel: ------------[ cut here ]------------
-- Reboot --
Apr 07 14:13:20 lala kernel: Linux version 5.0.5-arch1-1-ARCH (builduser@heftig-17705) (gcc version 8.2.1 20181127 (GCC)) #1 SMP PREEMPT Wed Mar 27 17:53:10 UTC 2019


Apr 11 03:24:46 lala kernel: ------------[ cut here ]------------
-- Reboot --
Apr 13 10:08:11 lala kernel: Linux version 5.0.7-arch1-1-ARCH (builduser@heftig-20167) (gcc version 8.2.1 20181127 (GCC)) #1 SMP PREEMPT Mon Apr 8 10:37:08 UTC 2019


The system runs fine until that certain point the issue happens. Below the information I've gathered thus far; Where google also pointed me to some ASPM topics, so i've added that information as well.


Additional info:
### journalctl
Apr 11 03:24:46 lala kernel: ------------[ cut here ]------------
Apr 11 03:24:46 lala kernel: NETDEV WATCHDOG: wan0 (e1000e): transmit queue 0 timed out
Apr 11 03:24:46 lala kernel: WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x21a/0x220
Apr 11 03:24:46 lala kernel: Modules linked in: veth nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo arc4 8021q garp mrp bridge stp llc ath9k ath9k_common ath9k_hw intel_rapl x86_pkg_temp_thermal intel_powercla
mp ath coretemp mac80211 nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit xt_multiport xt_conntrack iptable_filter ipt_MASQUERADE xt_nat xt_tcpudp iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_
defrag_ipv6 nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio i915 mousedev input_leds joydev btrfs kvmgt vfio_mdev mdev vfio_iommu_type1 vfio kvm libcrc32c hid_logitech_h
idpp irqbypass i2c_algo_bit xor drm_kms_helper snd_hda_intel cfg80211 snd_hda_codec hid_logitech_dj hid_generic drm crct10dif_pclmul iTCO_wdt iTCO_vendor_support crc32_pclmul snd_hda_core snd_hwdep ghash_clmulni_i
ntel raid6_pq snd_pcm cryptd intel_cstate usbhid intel_gtt intel_uncore e1000e snd_timer intel_rapl_perf agpgart hid pcspkr i2c_i801 mei_me snd syscopyarea uas
Apr 11 03:24:46 lala kernel: sysfillrect lpc_ich sysimgblt rfkill mei soundcore fb_sys_fops usb_storage evdev mac_hid pcc_cpufreq ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sd_mod ahci libahci libata crc32c_intel xhci_pci scsi_mod ehci_pci ehci_hcd xhci_hcd
Apr 11 03:24:46 lala kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.6-arch1-1-ARCH #1
Apr 11 03:24:46 lala kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./ChiefRiver, BIOS 4.6.5 09/24/2014
Apr 11 03:24:46 lala kernel: RIP: 0010:dev_watchdog+0x21a/0x220
Apr 11 03:24:46 lala kernel: Code: 49 63 4c 24 e0 eb 8c 4c 89 ef c6 05 92 52 bf 00 01 e8 da 71 fc ff 89 d9 4c 89 ee 48 c7 c7 58 64 73 8c 48 89 c2 e8 10 dc 96 ff <0f> 0b eb be 66 90 0f 1f 44 00 00 48 c7 47 08 00 00 00 00 48 c7 07
Apr 11 03:24:46 lala kernel: RSP: 0018:ffff8cce80103e78 EFLAGS: 00010286
Apr 11 03:24:46 lala kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Apr 11 03:24:46 lala kernel: RDX: 0000000000000103 RSI: 00000000000000f6 RDI: 00000000ffffffff
Apr 11 03:24:46 lala kernel: RBP: ffff8ccdf0ae045c R08: 0000000000000001 R09: 000000000000059a
Apr 11 03:24:46 lala kernel: R10: 0000000000000004 R11: 0000000000000000 R12: ffff8ccdf0ae0480
Apr 11 03:24:46 lala kernel: R13: ffff8ccdf0ae0000 R14: 0000000000000001 R15: ffff8ccdf047b480
Apr 11 03:24:46 lala kernel: FS: 0000000000000000(0000) GS:ffff8cce80100000(0000) knlGS:0000000000000000
Apr 11 03:24:46 lala kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 11 03:24:46 lala kernel: CR2: 00007fcaa29fcc14 CR3: 0000000051a0e001 CR4: 00000000001606e0
Apr 11 03:24:46 lala kernel: Call Trace:
Apr 11 03:24:46 lala kernel: <IRQ>
Apr 11 03:24:46 lala kernel: ? qdisc_put_unlocked+0x30/0x30
Apr 11 03:24:46 lala kernel: ? qdisc_put_unlocked+0x30/0x30
Apr 11 03:24:46 lala kernel: call_timer_fn+0x2b/0x160
Apr 11 03:24:46 lala kernel: ? qdisc_put_unlocked+0x30/0x30
Apr 11 03:24:46 lala kernel: expire_timers+0x99/0x110
Apr 11 03:24:46 lala kernel: run_timer_softirq+0x8a/0x160
Apr 11 03:24:46 lala kernel: ? sched_clock+0x5/0x10
Apr 11 03:24:46 lala kernel: ? sched_clock_cpu+0xe/0xd0
Apr 11 03:24:46 lala kernel: __do_softirq+0x112/0x356
Apr 11 03:24:46 lala kernel: irq_exit+0xd9/0xf0
Apr 11 03:24:46 lala kernel: smp_apic_timer_interrupt+0x87/0x180
Apr 11 03:24:46 lala kernel: apic_timer_interrupt+0xf/0x20
Apr 11 03:24:46 lala kernel: </IRQ>
Apr 11 03:24:46 lala kernel: RIP: 0010:cpuidle_enter_state+0xbc/0x480
Apr 11 03:24:46 lala kernel: Code: e8 39 04 a3 ff 80 7c 24 13 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 99 03 00 00 31 ff e8 db 27 a9 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 c4 02 00 00 49 63 cc 4c 8b 3c 24 4c 2b 7c 24 08 48
Apr 11 03:24:46 lala kernel: RSP: 0018:ffffb361403abe98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Apr 11 03:24:46 lala kernel: RAX: ffff8cce80100000 RBX: ffffffff8c8b81c0 RCX: 000000000000001f
Apr 11 03:24:46 lala kernel: RDX: 0001160e5206cc08 RSI: ffffffff8c6aa9f8 RDI: ffffffff8c6b3088
Apr 11 03:24:46 lala kernel: RBP: ffff8cce8012ad00 R08: 0000000000000004 R09: 0000000000021500
Apr 11 03:24:46 lala kernel: R10: 0001f360a4815ec5 R11: ffff8cce80120be4 R12: 0000000000000005
Apr 11 03:24:46 lala kernel: R13: ffffffff8c8b83b8 R14: 0000000000000005 R15: 0000000000000000
Apr 11 03:24:46 lala kernel: ? cpuidle_enter_state+0x97/0x480
Apr 11 03:24:46 lala kernel: do_idle+0x217/0x250
Apr 11 03:24:46 lala kernel: cpu_startup_entry+0x19/0x20
Apr 11 03:24:46 lala kernel: start_secondary+0x1aa/0x200
Apr 11 03:24:46 lala kernel: secondary_startup_64+0xa4/0xb0
Apr 11 03:24:46 lala kernel: ---[ end trace 70d2440d3ae6f9e3 ]---
Apr 11 03:24:46 lala kernel: e1000e 0000:01:00.0 wan0: Reset adapter unexpectedly
Apr 11 03:24:46 lala systemd-networkd[361]: wan0: Lost carrier
Apr 11 03:24:46 lala systemd-networkd[361]: wan0: DHCP lease lost
Apr 11 03:24:46 lala kernel: e1000e 0000:01:00.0 wan0: Timesync Tx Control register not set as expected


# cat /run/systemd/netif/leases/2
# This is private data. Do not parse.
ADDRESS=*.*.*.*
NETMASK=255.255.248.0
ROUTER=*.*.*.*
SERVER_ADDRESS=*.*.*.*
NEXT_SERVER=*.*.*.*
T1=301767
T2=528092
LIFETIME=603534
DNS=*.*.*.* *.*.*.*
NTP=*.*.*.*
CLIENTID=<number>


# dmesg | grep -e e1000e -e wan0
[ 3.219223] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
[ 3.219224] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
[ 3.219319] e1000e 0000:01:00.0: Disabling ASPM L0s L1
[ 3.219321] e1000e 0000:01:00.0: can't disable ASPM; OS doesn't have ASPM control
[ 3.219573] e1000e 0000:01:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[ 3.282554] e1000e 0000:01:00.0 0000:01:00.0 (uninitialized): registered PHC clock
[ 3.351519] e1000e 0000:01:00.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:e8:4c:68:17:c2
[ 3.352749] e1000e 0000:01:00.0 eth0: Intel(R) PRO/1000 Network Connection
[ 3.352838] e1000e 0000:01:00.0 eth0: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF
[ 3.655904] e1000e 0000:01:00.0 wan0: renamed from eth0
[ 5.476139] 8021q: adding VLAN 0 to HW filter on device wan0
[ 5.476171] A link change request failed with some changes committed already. Interface wan0 may have been left with an inconsistent configuration, please check.
[ 8.504015] e1000e: wan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 8.504375] IPv6: ADDRCONF(NETDEV_CHANGE): wan0: link becomes ready


Steps to reproduce:
reboot and wait a few days.
This task depends upon

Closed by  Andreas Radke (AndyRTR)
Tuesday, 01 March 2022, 21:07 GMT
Reason for closing:  Fixed
Additional comments about closing:  No more reprodudible. Assuming fixed upstream.
Comment by Bart (bart82) - Saturday, 13 April 2019, 12:43 GMT
networkd conf

# cat /etc/systemd/network/wan.network
[Match]
Name=wan0

[Network]
Description="WAN interface"
DHCP=ipv4
IPForward=ipv4
IPv6AcceptRA=no


# cat /etc/udev/rules.d/network_persistent.rules
SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="00:e8:4c:68:17:c2", NAME="wan0"
SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="00:e8:4c:68:17:c3", NAME="lan0"
SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="00:23:08:f9:d1:ce", NAME="wifi0"
Comment by loqs (loqs) - Saturday, 13 April 2019, 14:42 GMT
[1] contains three changes commits referencing e1000e please determine which commit caused your issue and report upstream
b503ea08fe0dcf03a5e78a558ea0bd7b5dcbd164
7f0a3a436e88a71b96694c029f01a9a8eade3d5d
a782956c2a305551a49962ad6bd784d14af769e7
See also [2]

[1] https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.0.7
[2] https://bbs.archlinux.org/viewtopic.php?id=245587
Comment by Bart (bart82) - Sunday, 14 April 2019, 06:12 GMT
refresh - repost - sorry
Comment by Bart (bart82) - Sunday, 14 April 2019, 06:23 GMT
I'm currently remote (vacation). Troubleshooting this might be a challenge since this happens on my router. Besides that is there a howto that explains these kind of steps to help the team to provide with the most accurate/best information?

Perhaps the git clone, revert, reset commands and the necessary building commands for the kerel or whatever is required?

cheers.
Comment by Kevin Flynn (Kevin_Flynn) - Saturday, 27 April 2019, 14:56 GMT
Wanted to add my 2 cents here, might be a related bug, might be a different one.

# cat /proc/version
Linux version 4.20.17-1-MANJARO (builduser@development) (gcc version 8.2.1 20181127 (GCC)) #1 SMP PREEMPT Tue Mar 19 18:07:36 UTC 2019

# lspci | grep 'net'
...
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 05)
...

There is/was a years old bug in the e1000 driver software for this specific Intel gigabit chipset (82579LM) that used to manifest itself as the network card repeatedly resetting itself, spurious interrupts reported in the system logs, and in my case where the system was being used as a router, network connectivity going up and down repeatedly. The symptoms would not appear immediately after a reboot but took hours or days, perhaps depending on traffic levels, to appear at which point it would continue repeatedly until a reboot.

The workaround for this bug was to use ethtool to disable the offload features as follows:

# ethtool -K eno1 gso off gro off tso off

While this reduced performance in theory ( I never tested it ) the system was very stable after inserting this into my network startup scripts ( I do not use network manager ).

Fast forward to last night ( April 26th 2019 ) and my desktop system ( see version info above ) started behaving in a similar manner. Network connectivity was repeatedly connecting / disconnecting on a 15-30 second cycle. This system has a dual Intel ethernet card with a different chipset in it, so as a temporary fix, I switched over to one of the other ports as a temporary fix.

I searched dmesg for the infamous "spurious interrupt" messages I used to see, but was unable to locate any specific output that seemed to indicate the error.

Good luck and thanks to all for making Arch great ✌️
Comment by loqs (loqs) - Saturday, 27 April 2019, 15:51 GMT Comment by Andris Pavenis (andris) - Saturday, 28 September 2019, 16:56 GMT
One more possibly similar reproduced on 2 systems.

Both has Intel network adapter built into motherboard:
1) Asrock Z270 Gaming K4 Fatal1ty, Core i7 7700K, 64GB, ...
2) Asus X79 Sabertooth motherboard, Core i7 3930K, 64Gb, ...

Symptoms:

Repeated network conneting/disconnecting severel times after either system is booted, cable connected or connection stoped and started again from NetworkḾanagaer (in all these cases). It happens several times and stops if I'm lucky or could continue for several minutes. After that network adapter seems to work OK

Output from 'sudo journalctl -f' when I'm trying to enable adapter from NetworkManager (I disabled it before) is in attached file

ethtool trick mentioned before did not help.
Comment by Andris Pavenis (andris) - Saturday, 28 September 2019, 16:57 GMT
Missing attachment from previous message
Comment by mattia (nTia89) - Monday, 28 February 2022, 16:24 GMT
Upstream report has been closed. Is an issue still valid for you?
Comment by Andris Pavenis (andris) - Tuesday, 01 March 2022, 18:12 GMT
Not any more. I then temporarily used USB3 ethernet adapter to workaround the problem instead of motherboards own adapter. Did need it any more after some time as far as I remember.

Loading...