FS#61688 - [networkmanager] 100% CPU consumption

Attached to Project: Arch Linux
Opened by Igor (f2404) - Saturday, 09 February 2019, 03:11 GMT
Last edited by Jan Alexander Steffens (heftig) - Tuesday, 19 February 2019, 15:45 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 22
Private No

Details

Description:
The NetworkManager process starts consuming 100% CPU shortly after the power up.

Additional info:
* package version(s): 1.14.5dev+17+gba83251bb-2
* config and/or log files etc.

strace'ing the process gives the following messages which are being repeated at a very high rate:

epoll_wait(15, [], 4, 0) = 0
clock_gettime(CLOCK_BOOTTIME, {tv_sec=882, tv_nsec=758164139}) = 0
poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=18, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}], 10, 1242) = 2 ([{fd=22, revents=POLLNVAL}, {fd=24, revents=POLLNVAL}])
epoll_wait(15, [], 4, 0) = 0
clock_gettime(CLOCK_BOOTTIME, {tv_sec=882, tv_nsec=758265211}) = 0
poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=18, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}], 10, 1242) = 2 ([{fd=22, revents=POLLNVAL}, {fd=24, revents=POLLNVAL}])
epoll_wait(15, [], 4, 0) = 0
clock_gettime(CLOCK_BOOTTIME, {tv_sec=882, tv_nsec=758364674}) = 0
poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=18, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}], 10, 1242) = 2 ([{fd=22, revents=POLLNVAL}, {fd=24, revents=POLLNVAL}])
etc.
This task depends upon

Closed by  Jan Alexander Steffens (heftig)
Tuesday, 19 February 2019, 15:45 GMT
Reason for closing:  Fixed
Additional comments about closing:  curl 7.64.0-9
Comment by Oliver (xxarch) - Saturday, 09 February 2019, 20:17 GMT
Hi Igor

Are you using the plasma desktop?

I have the same problem, but I am not convinced that it is actually a bug in NetworkManager, because a downgrade to networkmanager-1.14.4-1-x86_64.pkg.tar.xz did not solve the problem. However, downgrading networkmanager-qt to version 5.53.0-1 seems to solve the problem.

Best regards
Oliver
Comment by Igor (f2404) - Saturday, 09 February 2019, 21:19 GMT
Hi Oliver

I'm using the Xfce desktop. I've also tried downgrading to 1.14.4, and it also didn't help. I don't have networkmanager-qt installed.
It could be that something else that's external to NetworkManager is causing this problem but in the end it's the NM process that goes into an infinite loop and consumes all CPU time.
Comment by Igor (f2404) - Saturday, 09 February 2019, 21:36 GMT
It looks like the bug might not be reproducible with any WiFi access point.
I'm currently sitting in a cafe using their open WiFi AP, and NM is behaving totally fine. While when I'm at home and using my home password protected AP, NM starts consuming 100% CPU within like 5 mins after it has connected to the AP.
Comment by sydney (sydney6) - Sunday, 10 February 2019, 00:15 GMT
FWIW, i appear also to be affected by this issue. After short research, these are my findings:

- The Issue somehow only occurs if i am connected to (the same) network via ethernet and wifi simultaneously.
- After said 300s/5m tcpdump shows connections being made to apollo.archlinux.org and www.archlinux.org.
- This appears to be the "connectivity" feature of Networkmanager defaulting to 300s and making/trying to make a http connection for the following uri "http://www.archlinux.org/check_network_status.txt".
- Setting the interval option to zero in "/usr/lib/NetworkManager/conf.d/20-connectivity.conf" seems to remedy the issue.
- This has started 1-2 days ago and downgrading any packages installed since then doesn't appear to help.
- I didn't investigate this further, but i had issues around HSTS trying to fetch this file from the nm connectivity feature.
- Do not really understand whats happening.
Comment by Igor (f2404) - Sunday, 10 February 2019, 01:24 GMT
sydney, this is a great finding!
It seems that 20-connectivity.conf has been there for a while, at least version 1.14.4 does contain it - this is why downgrading the package doesn't help.
The URL is returning error 404 now. Perhaps the connectivity check feature was working in the past but they have removed the file from the website recently?
In any case, NetworkManager falling into an endless loop looks like a design flaw.
Comment by sydney (sydney6) - Sunday, 10 February 2019, 02:03 GMT
The uri still works when fetched using curl or wget, but what i don't understand is that www.archlinux.org has a HTTP Strict Transport Security Policy on this domain, but wget, at least, doesn't appear to honor this, which would be quite odd.
Comment by loqs (loqs) - Sunday, 10 February 2019, 02:05 GMT Comment by loqs (loqs) - Sunday, 10 February 2019, 02:18 GMT
libcurl which is used by the network manager connectivity check does not support HSTS https://github.com/curl/curl/blob/9cb126792ce225e7c7d6ab5d1bf74f58f1844a29/docs/TODO#L810
wget does support HSTS
wget -d http://www.archlinux.org/check_network_status.txt
DEBUG output created by Wget 1.20.1 on linux-gnu.

Reading HSTS entries from /home/user/.wget-hsts
URI encoding = ‘UTF-8’
URL transformed to HTTPS due to an HSTS policy
Comment by sydney (sydney6) - Sunday, 10 February 2019, 03:15 GMT
snafu on my part, somehow i had the ownership of the .wget-hsts file all messed up. Wouldn't have made sense either, as this issue appears to occur only when connected via ethernet and wifi simultaneously, to me at least.
Comment by loqs (loqs) - Sunday, 10 February 2019, 03:36 GMT
Was curl 7.64.0-1 part of the update that triggered the issue?
Comment by sydney (sydney6) - Sunday, 10 February 2019, 03:52 GMT
[2019-02-07 20:32] [ALPM] upgraded curl (7.63.0-4 -> 7.64.0-1)

yes, possibly. will give it a try.
Comment by sydney (sydney6) - Sunday, 10 February 2019, 04:23 GMT
downgrading curl to 7.63.0-4 indeed does seem to redeem this issue.
Comment by Oskar (traubenuss) - Sunday, 10 February 2019, 10:30 GMT
For me it happens when connected to my home network via wifi, with ethernet disconnected. I have not been able to test it on other networks yet but downgrading curl to 7.63.0-4 fixed it.
I'll try it on my other machine connected via ethernet later today.
Comment by Oskar (traubenuss) - Sunday, 10 February 2019, 11:23 GMT
Yep, my wired machine shows the same behavior. Downgrading curl 7.63.0-4 also fixes it on this machine.
Comment by loqs (loqs) - Sunday, 10 February 2019, 15:13 GMT
Does the following produce any more output in the journal when the issue occurs?
dbus-send --system --print-reply --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager.SetLogging string:"debug" string:"CONCHECK"
This should revert the change
dbus-send --system --print-reply --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager.SetLogging string:"info" string:"CONCHECK"

Would also suggest bisecting curl between 7.63 and 7.64 and seeing if you can identify which commit causes the issue.
Comment by sydney (sydney6) - Sunday, 10 February 2019, 23:41 GMT
I have looked at the full nm debug log earlier and didn't notice anything unusual, besides the wireless interface trying and not getting a response.

As i was getting your requested log output, i turned off ipv6 on both interfaces because of noise in tcpdump and i noticed that this also redeems the issue.

### with ipv6
Feb 10 23:57:27 x1 NetworkManager[4952]: <info> [1549839447.0782] dhcp6 (wlp2s0): state changed unknown -> bound
Feb 11 00:02:20 x1 NetworkManager[4952]: <debug> [1549839740.2153] connectivity: (enp0s31f6) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:20 x1 NetworkManager[4952]: <debug> [1549839740.2956] connectivity: (enp0s31f6) check completed: FULL; expected response
Feb 11 00:02:24 x1 NetworkManager[4952]: <debug> [1549839744.5160] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:44 x1 NetworkManager[4952]: <debug> [1549839764.9302] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:02:44 x1 NetworkManager[4952]: <debug> [1549839764.9302] device[0x55ddaea998e0] (wlp2s0): connectivity state changed from FULL to LIMITED
Feb 11 00:02:44 x1 NetworkManager[4952]: <debug> [1549839764.9313] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:45 x1 NetworkManager[4952]: <debug> [1549839765.9294] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:47 x1 NetworkManager[4952]: <debug> [1549839767.9294] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:51 x1 NetworkManager[4952]: <debug> [1549839771.9294] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:59 x1 NetworkManager[4952]: <debug> [1549839779.9294] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:03:04 x1 NetworkManager[4952]: <debug> [1549839784.9337] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:03:05 x1 NetworkManager[4952]: <debug> [1549839785.9301] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:03:07 x1 NetworkManager[4952]: <debug> [1549839787.9301] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:03:11 x1 NetworkManager[4952]: <debug> [1549839791.9301] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:03:15 x1 NetworkManager[4952]: <debug> [1549839795.9294] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:03:19 x1 NetworkManager[4952]: <debug> [1549839799.9301] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:03:35 x1 systemd[1]: Stopping Network Manager...

### without ipv6
Feb 10 23:31:31 x1 NetworkManager[4644]: <debug> [1549837891.1660] connectivity: (enp0s31f6) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 10 23:31:31 x1 NetworkManager[4644]: <debug> [1549837891.2334] connectivity: (enp0s31f6) check completed: FULL; expected response
Feb 10 23:31:35 x1 NetworkManager[4644]: <debug> [1549837895.5501] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 10 23:31:35 x1 NetworkManager[4644]: <debug> [1549837895.6167] connectivity: (wlp2s0) check completed: FULL; expected response
Feb 10 23:36:31 x1 NetworkManager[4644]: <debug> [1549838191.1653] connectivity: (enp0s31f6) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 10 23:36:31 x1 NetworkManager[4644]: <debug> [1549838191.2479] connectivity: (enp0s31f6) check completed: FULL; expected response
Feb 10 23:36:35 x1 NetworkManager[4644]: <debug> [1549838195.5491] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 10 23:36:35 x1 NetworkManager[4644]: <debug> [1549838195.6162] connectivity: (wlp2s0) check completed: FULL; expected response

tcpdump shows that nm indeed tries to connect to this uri via ipv6 and fails on the wireless link, when ethernet is also connected, whose connection check, also via ipv6, succeeds.

also, i am not a v6 guru, but ping -6 bound to the wireless interface does not work, but tcpdump shows the echo reply from the server, but ping still reports 100% packet loss.
Comment by sydney (sydney6) - Sunday, 10 February 2019, 23:42 GMT
i have now turned off firewalld and cleared iptables out and now everything is working as expected again.. firewalld has also been updated in the timeframe. i guess, i'll have to invest a little more time in this..

edit: downgraded firewalld, no difference. I guess, i will start bisecting curl between the two releases in the next few days..
Comment by Stephan S (buzo) - Monday, 11 February 2019, 14:23 GMT
Same here, and downgrading curl to 7.63.0-4 removed the CPU hog.
Comment by Josh T (jat255) - Monday, 11 February 2019, 15:46 GMT
I was also bitten by this. I downgraded curl to 7.63.0-4 and have not observed the high CPU usage in about 10 minutes.
Comment by Tobias (strizi9) - Monday, 11 February 2019, 16:06 GMT
Just disable IPv6 works.
Comment by Josh T (jat255) - Monday, 11 February 2019, 17:19 GMT
@strizi9 unfortunately that's not really a valid solution for a lot of cases (in an enterprise environment where ipv6 is needed...)
Comment by loqs (loqs) - Monday, 11 February 2019, 17:26 GMT
Downgrading to curl to 7.63.0-4 will no longer work after libidn2 2.1.1-2 moves from testing to core which has an soname bump.
You can rebuild curl 7.63 locally against libidn2 2.1.1-2 when that happens.
Comment by Josh T (jat255) - Monday, 11 February 2019, 18:07 GMT
In the aim of facilitating finding where the regression was...

Here's the change list for 7.64.0: https://curl.haxx.se/changes.html#7_64_0
And the commit list: https://github.com/curl/curl/commits/master
Comment by Josh T (jat255) - Monday, 11 February 2019, 18:15 GMT
I'm out of my depth on this issue, but here's a commit that changed IPv6 handling (not sure if this could cause an issue): https://github.com/curl/curl/commit/a4482b21bd8bb0ac7cfb3471bad32bf84f829805
Comment by sydney (sydney6) - Monday, 11 February 2019, 19:16 GMT
@loqs have you tried curl 7.64.0-3 with libidn2 2.1.1-2 already?

I was hoping for these two hitting core the next few days, before starting bisecting curl.
Comment by loqs (loqs) - Monday, 11 February 2019, 19:42 GMT
sydney6 no I am using stable at the moment I just spotted the libidn2 rebuild in testing.

To test the PKGBUILD included:
Place in an empty directory
pacman -S devtools
extra-x86_64-build
pacman -U curl-git-7.63.0.64.g1a9315378-1-x86_64.pkg.tar.xz

This is for the bisection midpoint it includes the ipv6 commit estimates 6 steps to finding the cause.
   PKGBUILD (1.1 KiB)
Comment by Josh T (jat255) - Monday, 11 February 2019, 20:26 GMT
A weird result I got testing this out a bit using @loqs method:

Restarting NM via systemctl after installing curl 7.64.0 from the official repos caused NM to peg after a short while (not measured precisely, but < 5 minutes)
Running with curl-git built as of https://github.com/curl/curl/commit/1a93153783c0ae626acf0b9041eeb2cb2fe7944a did not lead to NM pegging the CPU (after 5 minutes).
After reinstalling curl 7.64.0 from the official repos (pacman -S curl), NM is no longer pegging the CPU.

Nothing else has changed about the setup, and I'm on the same ethernet + wireless connection that I was before, so I'm uncertain what might be causing the difference. I tried to get debugging output from NM, but I'm only seeing <info> lines through journalctl, even after adding LogLevelMax=6 and --log-level=DEBUG to the ExecStart line in the NetworkManager.service file...
Comment by Oliver (xxarch) - Monday, 11 February 2019, 20:36 GMT
@jat255
This is an interesting finding. I played around a lot with different package versions, but I did not change anything regarding curl. The result is, that I cannot reproduce the problem with NetworkManager anymore, even after doing a 'pacman -Syu'.
Comment by Josh T (jat255) - Monday, 11 February 2019, 20:40 GMT
Could there have been a transient issue on http://www.archlinux.org/check_network_status.txt that happened to be resolved while we were poking around?
Comment by Josh T (jat255) - Monday, 11 February 2019, 20:57 GMT
Yeah, I'm not sure what changed, I added a 30s interval to 20-connectivity.conf, and I confirmed with tcpdump that NM is issuing a GET request on /check_network_status.txt every 30 seconds, and the CPU lockup is not happening. I can't get it to happen connected to both ethernet/wifi, just ethernet, or just wifi. Also, when connected to just ethernet, NM is issuing the request over IPv6 using 2a01:4f8:172:1d86::1 as the host, which resolves to apollo.archlinux.org (and still no CPU lockup).

I also still cannot see DEBUG output in the journal even after issuing a DEBUG dbus message for all domains, so I'm not sure what I'm doing wrong there:

dbus-send --system --print-reply --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager.GetLogging
method return time=1549918285.289542 sender=:1.4 -> destination=:1.738 serial=5033 reply_serial=2
string "DEBUG"
string "PLATFORM,RFKILL,ETHER,WIFI,BT,MB,DHCP4,DHCP6,PPP,WIFI_SCAN,IP4,IP6,AUTOIP4,DNS,VPN,SHARING,SUPPLICANT,AGENTS,SETTINGS,SUSPEND,CORE,DEVICE,OLPC,INFINIBAND,FIREWALL,ADSL,BOND,VLAN,BRIDGE,DBUS_PROPS,TEAM,CONCHECK,DCB,DISPATCH,AUDIT,SYSTEMD,VPN_PLUGIN:INFO,PROXY"

NB: I also rebooted the system before running this test (after all the messing around with packages before), and that didn't make a difference (i.e. I still can't reproduce the bug any longer).
Comment by Josh T (jat255) - Monday, 11 February 2019, 21:44 GMT
Ugh, after some time, the pegged CPU is back on the latest curl, so perhaps I just had to wait longer. I'll go back to testing the curl packages bisecting as necessary, restarting NM each time, and waiting longer to see if any problems crop up
Comment by sydney (sydney6) - Monday, 11 February 2019, 23:02 GMT
curl 7.64.0-3 with libidn2 2.1.1-2. Issue persists. Appear to be indeed somewhere in the ~130 commits between curl 7.63 and 7.64.
Comment by Stephan S (buzo) - Tuesday, 12 February 2019, 11:10 GMT
Issue persists for me, too, with curl 7.64.0-3. I downgraded everything by replacing /etc/pacman.d/mirrorlist with
Server = https://archive.archlinux.org/repos/2019/02/07/$repo/os/$arch

… as described in https://wiki.archlinux.org/index.php/Arch_Linux_Archive#How_to_restore_all_packages_to_a_specific_date , because I really need a working system.

But don't we have two bugs here? One in curl, as described above, and one in networkmanager, which does not handle a bad answer from curl gracefully.
Comment by frankie (fhgogogo) - Tuesday, 12 February 2019, 13:38 GMT
I added the following in my hosts file as a workaround. NetworkManager's CPU seems to be stable at the moment.
○ → grep arch /etc/hosts
138.201.81.199 www.archlinux.org

Comment by Igor (f2404) - Tuesday, 12 February 2019, 13:42 GMT
I wonder if this bug will ever receive any official attention...
Comment by Josh T (jat255) - Tuesday, 12 February 2019, 14:22 GMT
@f2404, please refrain from making such comments; they do not help us figure out the root cause of the bug, and just serve to clutter the inboxes of those that are working on/watching it.

@fhgogogo, would that cause any other problems (such as for pacman or other tools that phone home to the arch servers)?

@buzo, I'd agree that there are likely two issues here, as you state

Otherwise, checking back, I've been running curl as of commit 1a9315378, and have yet to see the pegged CPU, but I'm not confident it has fixed the issue, since I could not reliably reproduce the issue previously.
Comment by frankie (fhgogogo) - Tuesday, 12 February 2019, 14:51 GMT
@jat255, pacman is working fine for me as there is mirror in /etc/pacman.d/mirrorlist
For my case, I have a working IPv6 connection, however I am using openconnect for VPN and openconnect added a IPv6 default route which break IPv6 connections. So I think the high CPU issue only happens when there is a broken IPv6 setup.
Comment by Oskar (traubenuss) - Tuesday, 12 February 2019, 14:56 GMT
I just checked and it seems like my ipv6 setup is borked right now. My machines get a global IP address but cannot connect to the internet via v6.
Right now I don't have the time to fix it and check if it works when v6 is setup properly but that might be another indicator for this.
Comment by sydney (sydney6) - Tuesday, 12 February 2019, 16:30 GMT
Also, i'm using firewalld which loads a bunch of nft rules, after which i'm no longer able to ping -6 from the wireless interface, when ethernet and wireless are active. tcpdump show the replies but ping shows packet loss. Clearing out firewall rules seems to help here, after which nm connectivity check works again on both interfaces via ipv6.
Comment by sydney (sydney6) - Tuesday, 12 February 2019, 19:25 GMT
unfortunately, i don't have ipv6 connectivity atm, but has someone tried running firewalld with the iptables backend instead of the nft one and looked if the issue persists?
Comment by Jan Alexander Steffens (heftig) - Tuesday, 12 February 2019, 20:04 GMT
I can't reproduce it, so the best I can do is ask you to please work with upstream ( https://github.com/curl/curl/issues and/or https://gitlab.freedesktop.org/NetworkManager/NetworkManager/issues ). As soon as you've figured out a fix or workaround we can quickly push out fixed packages; thanks.
Comment by sydney (sydney6) - Tuesday, 12 February 2019, 23:06 GMT
Hi Jan, in my opinion this issue is caused by firewalld and the change to the new default backend nftables starting with the release of firewalld 0.6. I have therefore opened an issue with firewalld upstream under the following url, if someone wants to follow this.

https://github.com/firewalld/firewalld/issues/459
Comment by Igor (f2404) - Wednesday, 13 February 2019, 02:16 GMT
sydney, I don't think firewalld is the root cause of this bug, at least not in my cause - it isn't even installed on my system.
Comment by Josh T (jat255) - Wednesday, 13 February 2019, 02:36 GMT
@sydney6, I agree with Igor, I don't have firewalld installed on my system either, so I don't think that it's the root cause of this bug.
Comment by Tom Bishop (bishoptfgmail.comf) - Wednesday, 13 February 2019, 03:19 GMT
Not using firewalld, only by rolling back the curl version was I able to resolve the CPU issue, I am running IPV6.
Comment by sydney (sydney6) - Wednesday, 13 February 2019, 04:32 GMT
Anyone using nftables then? I'm asking because when using firewalld with the nftables backend (since 0.6), i am no longer able to receive any ipv6 traffic on the wireless interface when ethernet is also connected. Downgrading to something before 0.6 or flushing firewall rules solves all of the above, for me at least. Perhaps we should open a forum post?
Comment by Oskar (traubenuss) - Wednesday, 13 February 2019, 12:52 GMT
I've now tested the 'faulty' configuration on a network that has a properly setup dual stack connection. I haven't observed the issue in a few minutes, at which point both my setups at home with the broken ipv6 setup had shown the high CPU usage.
From this I'd think that the issue is that NetworkManager does not gracefully handle errors in ipv6 connectivity when checking for general connections to the internet.
Comment by Petr Špaček (spacekpe) - Wednesday, 13 February 2019, 14:54 GMT
It seems that problem is triggered by connectivity check *only if there are two default routes for IPv6*.

I'm testing on two networks, both networks have full IPv4 + IPv6 connectivity.

Observation:
- If I'm attached to just one of the networks (Ethernet or WiFi) the CPU load is normal.
- Once I connect using Ethernet and Wifi, I get two IPv6 default routes at the same time, and the CPU load spikes up after "connectivity check interval".
- Load spike does *not* happen if I disable IPv6 on one of the interfaces (no matter which one).
- Load spike does *not* happen if I disable disable default route on one of the interfaces (no matter which one).
- Shortening connectivity check using `interval=1` in `/usr/lib/NetworkManager/conf.d/20-connectivity.conf` expedites occurrence of the problem.
- This is reliably reproducible on my machine.
- Once CPU is locked up, removing default route using `sudo ip -6 route del default via fe80::...` does not get load down - it keeps spinning until I kill NetworkManager.
Comment by Josh T (jat255) - Wednesday, 13 February 2019, 15:12 GMT
Thanks for the detailed reproducing steps @spacekpe!

I have two routes now:

$ ip -6 route show

****:**:****:***::b2e dev wlan0 proto kernel metric 600 pref medium
****:**:****:***::ad6 dev eth0 proto kernel metric 100 pref medium
****::/64 dev eth0 proto kernel metric 100 pref medium
****::/64 dev wlan0 proto kernel metric 600 pref medium
default proto ra metric 100
nexthop via ****::1 dev eth0 weight 1
nexthop via ****::*:****:****:675 dev eth0 weight 1 pref medium
default proto ra metric 600
nexthop via ****::1 dev wlan0 weight 1
nexthop via ****::*:****:****:648 dev wlan0 weight 1 pref medium

and I am not observing the lockup, although I did just update all my packages (including curl and libidn2, which were updated on 2/11/19). Have you tried updating to see if the issue persists?
Comment by Petr Špaček (spacekpe) - Wednesday, 13 February 2019, 15:20 GMT
I'm on fully updated system as of now:
$ pacman -Q curl networkmanager
curl 7.64.0-3
networkmanager 1.14.5dev+17+gba83251bb-2

I'm attempting to build debug version of NetworkManager to see what is going on inside.
Comment by sydney (sydney6) - Wednesday, 13 February 2019, 15:39 GMT
When you're connected with both interfaces via ipv6, does ping or wget ipv6 work via both interfaces?
Comment by Josh T (jat255) - Wednesday, 13 February 2019, 15:43 GMT
@sydney6, in my case, yes.

Tested via:
$ ping -6 www.google.com -I wlan0
$ ping -6 www.google.com -I eth0
Comment by Petr Špaček (spacekpe) - Wednesday, 13 February 2019, 17:11 GMT
It turns out that my system has working IPv6 on both interfaces separately, but firewalld-generated rules partially break IPv6 when both interfaces are active (I will open another bug about this).

Now I believe that the bug in NM is triggered if connection check over an IPv6 interface fails in some specific way.

Here is reproducer which seems to work on machines with single IPv6 enabled interface:
1. shorten connection check interval in `/usr/lib/NetworkManager/conf.d/20-connectivity.conf` to 1 second (`interval=1`)
2. add ip6table rule: $ ip6tables -A INPUT -j DROP -s 2a01:4f8:172:1d86::1/128
3. (make sure NM was restarted to pick up new configuration)

This reliably reproduces the issue even with a single network interface with (previously) working IPv6. For some reason the same trick does not reproduce problem over IPv4, which is totally baffling.
Comment by Josh T (jat255) - Wednesday, 13 February 2019, 18:02 GMT
@spacekpe... hooray!

Those instructions also successfully reproduce the bug on my system. Adding the ip6tables rule causes the CPU to peg on a connection check, and NM does not stop its onslaught on the CPU until the rule is deleted and NM is restarted.

I recompiled the networkmanager package (using the ABS) with the `more_logging` flag set to true in the PKGBUILD, and saw the following output in the journal from NM when adding the DROP rule to ip6tables:

Feb 13 12:58:33 hostname NetworkManager[27528]: <debug> [1550080713.1064] connectivity: (eth0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 13 12:58:34 hostname NetworkManager[27528]: <debug> [1550080714.1040] connectivity: (eth0) check completed: LIMITED; timeout
Feb 13 12:58:37 hostname NetworkManager[27528]: <debug> [1550080717.0322] connectivity: (wlan0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 13 12:58:38 hostname NetworkManager[27528]: <debug> [1550080718.1040] connectivity: (wlan0) check completed: LIMITED; timeout

After removing the rule, NM completes its checks, as expected:

Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.3714] connectivity: (eth0) check completed: FULL; expected response
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.3714] device[0x5612bdcdb570] (eth0): connectivity state changed from LIMITED to FULL
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.3723] connectivity: (eth0) check completed: CANCELLED; cancelled
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.3724] connectivity: (eth0) check completed: CANCELLED; cancelled
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.3724] connectivity: (eth0) check completed: CANCELLED; cancelled
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.8843] connectivity: (wlan0) check completed: FULL; expected response
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.8844] device[0x5612bdcf6a30] (wlan0): connectivity state changed from LIMITED to FULL
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.8854] connectivity: (wlan0) check completed: CANCELLED; cancelled
Feb 13 12:58:57 hostname NetworkManager[27528]: <debug> [1550080737.0324] connectivity: (wlan0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 13 12:58:57 hostname NetworkManager[27528]: <debug> [1550080737.5090] connectivity: (wlan0) check completed: FULL; expected response

But the CPU remains pegged at 100% until a restart of NM.

I guess this is an upstream bug, then?
Comment by loqs (loqs) - Wednesday, 13 February 2019, 18:29 GMT
An upstream bug triggered by what change though? jat255 if you rebuild curl 7.63 can you still trigger the issue with the firewall rules in place?
Comment by sydney (sydney6) - Wednesday, 13 February 2019, 18:48 GMT
@Petr Can you try with the same firewalld setup/config and "FirewallBackend=iptables" backend in "/etc/firewalld/firewalld.conf"?

Still, this shouldn't let NetworkManager go all crazy..
Comment by Josh T (jat255) - Wednesday, 13 February 2019, 19:23 GMT
I cannot reproduce using curl-git-7.63.0.64.g1a9315378-1 (i.e. issue did not exist on Jan. 08 2019)
I cannot reproduce using curl-git-7.63.0.66.g1f30dc886-1 (i.e. issue did not exist on Jan. 09 2019)
I cannot reproduce using curl-git-7.63.0.67.g5f5b5afcb-1 (i.e. issue did not exist on Jan. 09 2019)
I can reproduce using curl-git-7.63.0.68.g4c35574bb-1 (i.e. issue DID exist on Jan. 09 2019) <----- offending commit
I can reproduce using curl-git-7.63.0.72.gf6bb05ccb-1 (i.e. issue DID exist on Jan. 10 2019)
I can reproduce using curl-git-7.63.0.80.g52e27fe9c-1 (i.e. issue DID exist on Jan. 11 2019)
I can reproduce using curl-git-7.63.0.101.g6bd5bc97f-1 (i.e. issue DID exist on Jan. 19 2019)

So based on the bisection, the offending commit causing the issue is https://github.com/curl/curl/commit/4c35574bb785ce44d72db5483541c9da2d885705
Comment by sydney (sydney6) - Wednesday, 13 February 2019, 22:58 GMT
@jat255 Thank you for your effort.. I can confirm that from curl-git-7.63.0.68.g4c35574bb onwards the problem is redroducible on my system also. But then again, what @loqs said, this commit triggers what issue? FWIW, regardless of the curl version used, the issue with set up firewall rules remains, i.e. "curl --interface wlan0 -6 http://www.archlinux.org/check_network_status.txt&quot; still fails/hangs, as long there are nftables rules loaded, although cpu load doesn't spike as nm does. I'm a bit puzzled here.
Comment by loqs (loqs) - Wednesday, 13 February 2019, 23:07 GMT
Open an issue with upstream NetworkManager about its CPU use and if the NetworkManager developers think curl's operation with 4c35574bb785ce44d72db5483541c9da2d885705 is expected?
Comment by dannybook (dannybook) - Thursday, 14 February 2019, 01:21 GMT

strace'ing the process gives the following messages which are being repeated at a very high rate:


poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=19, events=POLLIN}, {fd=26, events=POLLOUT}, {fd=27, events=POLLIN}, {fd=28, events=POLLOUT}], 11, 0) = 2 ([{fd=3, revents=POLLIN}, {fd=26, revents=POLLOUT|POLLERR|POLLHUP}])
read(3, "\2\0\0\0\0\0\0\0", 16) = 8
epoll_wait(15, [], 8, 0) = 0
clock_gettime(CLOCK_BOOTTIME, {tv_sec=169674, tv_nsec=123857959}) = 0
write(3, "\1\0\0\0\0\0\0\0", 8) = 8
write(3, "\1\0\0\0\0\0\0\0", 8) = 8
Comment by loqs (loqs) - Thursday, 14 February 2019, 12:33 GMT Comment by Grzegorz Wierzowiecki (gwpl) - Sunday, 17 February 2019, 14:11 GMT
TL;DR- I followed "downgrade curl" advice, it resolved issue, but when I upgraded curl again to newest version, issue did not come back yet -> strange. More verbose report here: https://bbs.archlinux.org/viewtopic.php?pid=1832205#p1832205
Comment by Michiel Roos (Tuurlijk) - Sunday, 17 February 2019, 20:10 GMT
My temporary solution is to set uri=http://nmcheck.gnome.org/check_network_status.txt in /usr/lib/NetworkManager/conf.d/20-connectivity.conf
Comment by Ashwin Vishnu (jadelord) - Monday, 18 February 2019, 14:49 GMT
@Tuurlijk: Thanks for that. I tried your workaround and it seems to function well. It seems nmcheck.gnome.org has not migrated to ipv6 then, which is good for us now.

❯ curl 'http://nmcheck.gnome.org/check_network_status.txt'
NetworkManager is online
❯ curl -6 'http://nmcheck.gnome.org/check_network_status.txt'
curl: (6) Could not resolve host: nmcheck.gnome.org
Comment by Stephan S (buzo) - Monday, 18 February 2019, 19:26 GMT
Another good workaround is to add the IPv4 of www.archlinux.org to /etc/hosts, as mentioned above in https://bugs.archlinux.org/task/61688#comment177144 .

This has the advantage that it won't be overwritten by an update.
Comment by Stephan S (buzo) - Monday, 18 February 2019, 19:43 GMT
I have filed a bug for curl here: https://github.com/curl/curl/issues/3585
Comment by Jan Alexander Steffens (heftig) - Tuesday, 19 February 2019, 05:04 GMT
Curl upstream suggests this should be fixed with the patches included in 7.64.0-7.
Comment by Stephan S (buzo) - Tuesday, 19 February 2019, 08:22 GMT
Nope, I just checked: networkmanager still takes 100% CPU after removing the workaround (in /etc/hosts) and restarting it.

curl 7.64.0-7
networkmanager 1.14.5dev+17+gba83251bb-2
Comment by Jan Alexander Steffens (heftig) - Tuesday, 19 February 2019, 15:14 GMT
Should be fixed in curl 7.64.0-9
Comment by Igor (f2404) - Tuesday, 19 February 2019, 15:34 GMT
Thanks for following up with curl, Jan!
Comment by Stephan S (buzo) - Tuesday, 19 February 2019, 15:35 GMT
Confirmed, curl 7.64.0-9 fixes this bug for me. Thanks!

Loading...