FS#61688 - [networkmanager] 100% CPU consumption
Attached to Project:
Arch Linux
Opened by Igor (f2404) - Saturday, 09 February 2019, 03:11 GMT
Last edited by Jan Alexander Steffens (heftig) - Tuesday, 19 February 2019, 15:45 GMT
Opened by Igor (f2404) - Saturday, 09 February 2019, 03:11 GMT
Last edited by Jan Alexander Steffens (heftig) - Tuesday, 19 February 2019, 15:45 GMT
|
Details
Description:
The NetworkManager process starts consuming 100% CPU shortly after the power up. Additional info: * package version(s): 1.14.5dev+17+gba83251bb-2 * config and/or log files etc. strace'ing the process gives the following messages which are being repeated at a very high rate: epoll_wait(15, [], 4, 0) = 0 clock_gettime(CLOCK_BOOTTIME, {tv_sec=882, tv_nsec=758164139}) = 0 poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=18, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}], 10, 1242) = 2 ([{fd=22, revents=POLLNVAL}, {fd=24, revents=POLLNVAL}]) epoll_wait(15, [], 4, 0) = 0 clock_gettime(CLOCK_BOOTTIME, {tv_sec=882, tv_nsec=758265211}) = 0 poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=18, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}], 10, 1242) = 2 ([{fd=22, revents=POLLNVAL}, {fd=24, revents=POLLNVAL}]) epoll_wait(15, [], 4, 0) = 0 clock_gettime(CLOCK_BOOTTIME, {tv_sec=882, tv_nsec=758364674}) = 0 poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=18, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}], 10, 1242) = 2 ([{fd=22, revents=POLLNVAL}, {fd=24, revents=POLLNVAL}]) etc. |
This task depends upon
Closed by Jan Alexander Steffens (heftig)
Tuesday, 19 February 2019, 15:45 GMT
Reason for closing: Fixed
Additional comments about closing: curl 7.64.0-9
Tuesday, 19 February 2019, 15:45 GMT
Reason for closing: Fixed
Additional comments about closing: curl 7.64.0-9
Are you using the plasma desktop?
I have the same problem, but I am not convinced that it is actually a bug in NetworkManager, because a downgrade to networkmanager-1.14.4-1-x86_64.pkg.tar.xz did not solve the problem. However, downgrading networkmanager-qt to version 5.53.0-1 seems to solve the problem.
Best regards
Oliver
I'm using the Xfce desktop. I've also tried downgrading to 1.14.4, and it also didn't help. I don't have networkmanager-qt installed.
It could be that something else that's external to NetworkManager is causing this problem but in the end it's the NM process that goes into an infinite loop and consumes all CPU time.
I'm currently sitting in a cafe using their open WiFi AP, and NM is behaving totally fine. While when I'm at home and using my home password protected AP, NM starts consuming 100% CPU within like 5 mins after it has connected to the AP.
- The Issue somehow only occurs if i am connected to (the same) network via ethernet and wifi simultaneously.
- After said 300s/5m tcpdump shows connections being made to apollo.archlinux.org and www.archlinux.org.
- This appears to be the "connectivity" feature of Networkmanager defaulting to 300s and making/trying to make a http connection for the following uri "http://www.archlinux.org/check_network_status.txt".
- Setting the interval option to zero in "/usr/lib/NetworkManager/conf.d/20-connectivity.conf" seems to remedy the issue.
- This has started 1-2 days ago and downgrading any packages installed since then doesn't appear to help.
- I didn't investigate this further, but i had issues around HSTS trying to fetch this file from the nm connectivity feature.
- Do not really understand whats happening.
It seems that 20-connectivity.conf has been there for a while, at least version 1.14.4 does contain it - this is why downgrading the package doesn't help.
The URL is returning error 404 now. Perhaps the connectivity check feature was working in the past but they have removed the file from the website recently?
In any case, NetworkManager falling into an endless loop looks like a design flaw.
From this system http://www.archlinux.org/check_network_status.txt is retuning 200 with content "NetworkManager is online"
While https://www.archlinux.org/check_network_status.txt is returning a 404.
wget does support HSTS
wget -d http://www.archlinux.org/check_network_status.txt
DEBUG output created by Wget 1.20.1 on linux-gnu.
Reading HSTS entries from /home/user/.wget-hsts
URI encoding = ‘UTF-8’
URL transformed to HTTPS due to an HSTS policy
yes, possibly. will give it a try.
I'll try it on my other machine connected via ethernet later today.
dbus-send --system --print-reply --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager.SetLogging string:"debug" string:"CONCHECK"
This should revert the change
dbus-send --system --print-reply --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager.SetLogging string:"info" string:"CONCHECK"
Would also suggest bisecting curl between 7.63 and 7.64 and seeing if you can identify which commit causes the issue.
As i was getting your requested log output, i turned off ipv6 on both interfaces because of noise in tcpdump and i noticed that this also redeems the issue.
### with ipv6
Feb 10 23:57:27 x1 NetworkManager[4952]: <info> [1549839447.0782] dhcp6 (wlp2s0): state changed unknown -> bound
Feb 11 00:02:20 x1 NetworkManager[4952]: <debug> [1549839740.2153] connectivity: (enp0s31f6) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:20 x1 NetworkManager[4952]: <debug> [1549839740.2956] connectivity: (enp0s31f6) check completed: FULL; expected response
Feb 11 00:02:24 x1 NetworkManager[4952]: <debug> [1549839744.5160] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:44 x1 NetworkManager[4952]: <debug> [1549839764.9302] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:02:44 x1 NetworkManager[4952]: <debug> [1549839764.9302] device[0x55ddaea998e0] (wlp2s0): connectivity state changed from FULL to LIMITED
Feb 11 00:02:44 x1 NetworkManager[4952]: <debug> [1549839764.9313] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:45 x1 NetworkManager[4952]: <debug> [1549839765.9294] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:47 x1 NetworkManager[4952]: <debug> [1549839767.9294] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:51 x1 NetworkManager[4952]: <debug> [1549839771.9294] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:02:59 x1 NetworkManager[4952]: <debug> [1549839779.9294] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:03:04 x1 NetworkManager[4952]: <debug> [1549839784.9337] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:03:05 x1 NetworkManager[4952]: <debug> [1549839785.9301] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:03:07 x1 NetworkManager[4952]: <debug> [1549839787.9301] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:03:11 x1 NetworkManager[4952]: <debug> [1549839791.9301] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:03:15 x1 NetworkManager[4952]: <debug> [1549839795.9294] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 11 00:03:19 x1 NetworkManager[4952]: <debug> [1549839799.9301] connectivity: (wlp2s0) check completed: LIMITED; timeout
Feb 11 00:03:35 x1 systemd[1]: Stopping Network Manager...
### without ipv6
Feb 10 23:31:31 x1 NetworkManager[4644]: <debug> [1549837891.1660] connectivity: (enp0s31f6) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 10 23:31:31 x1 NetworkManager[4644]: <debug> [1549837891.2334] connectivity: (enp0s31f6) check completed: FULL; expected response
Feb 10 23:31:35 x1 NetworkManager[4644]: <debug> [1549837895.5501] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 10 23:31:35 x1 NetworkManager[4644]: <debug> [1549837895.6167] connectivity: (wlp2s0) check completed: FULL; expected response
Feb 10 23:36:31 x1 NetworkManager[4644]: <debug> [1549838191.1653] connectivity: (enp0s31f6) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 10 23:36:31 x1 NetworkManager[4644]: <debug> [1549838191.2479] connectivity: (enp0s31f6) check completed: FULL; expected response
Feb 10 23:36:35 x1 NetworkManager[4644]: <debug> [1549838195.5491] connectivity: (wlp2s0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 10 23:36:35 x1 NetworkManager[4644]: <debug> [1549838195.6162] connectivity: (wlp2s0) check completed: FULL; expected response
tcpdump shows that nm indeed tries to connect to this uri via ipv6 and fails on the wireless link, when ethernet is also connected, whose connection check, also via ipv6, succeeds.
also, i am not a v6 guru, but ping -6 bound to the wireless interface does not work, but tcpdump shows the echo reply from the server, but ping still reports 100% packet loss.
edit: downgraded firewalld, no difference. I guess, i will start bisecting curl between the two releases in the next few days..
You can rebuild curl 7.63 locally against libidn2 2.1.1-2 when that happens.
Here's the change list for 7.64.0: https://curl.haxx.se/changes.html#7_64_0
And the commit list: https://github.com/curl/curl/commits/master
I was hoping for these two hitting core the next few days, before starting bisecting curl.
To test the PKGBUILD included:
Place in an empty directory
pacman -S devtools
extra-x86_64-build
pacman -U curl-git-7.63.0.64.g1a9315378-1-x86_64.pkg.tar.xz
This is for the bisection midpoint it includes the ipv6 commit estimates 6 steps to finding the cause.
Restarting NM via systemctl after installing curl 7.64.0 from the official repos caused NM to peg after a short while (not measured precisely, but < 5 minutes)
Running with curl-git built as of https://github.com/curl/curl/commit/1a93153783c0ae626acf0b9041eeb2cb2fe7944a did not lead to NM pegging the CPU (after 5 minutes).
After reinstalling curl 7.64.0 from the official repos (pacman -S curl), NM is no longer pegging the CPU.
Nothing else has changed about the setup, and I'm on the same ethernet + wireless connection that I was before, so I'm uncertain what might be causing the difference. I tried to get debugging output from NM, but I'm only seeing <info> lines through journalctl, even after adding LogLevelMax=6 and --log-level=DEBUG to the ExecStart line in the NetworkManager.service file...
This is an interesting finding. I played around a lot with different package versions, but I did not change anything regarding curl. The result is, that I cannot reproduce the problem with NetworkManager anymore, even after doing a 'pacman -Syu'.
I also still cannot see DEBUG output in the journal even after issuing a DEBUG dbus message for all domains, so I'm not sure what I'm doing wrong there:
dbus-send --system --print-reply --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager.GetLogging
method return time=1549918285.289542 sender=:1.4 -> destination=:1.738 serial=5033 reply_serial=2
string "DEBUG"
string "PLATFORM,RFKILL,ETHER,WIFI,BT,MB,DHCP4,DHCP6,PPP,WIFI_SCAN,IP4,IP6,AUTOIP4,DNS,VPN,SHARING,SUPPLICANT,AGENTS,SETTINGS,SUSPEND,CORE,DEVICE,OLPC,INFINIBAND,FIREWALL,ADSL,BOND,VLAN,BRIDGE,DBUS_PROPS,TEAM,CONCHECK,DCB,DISPATCH,AUDIT,SYSTEMD,VPN_PLUGIN:INFO,PROXY"
NB: I also rebooted the system before running this test (after all the messing around with packages before), and that didn't make a difference (i.e. I still can't reproduce the bug any longer).
Server = https://archive.archlinux.org/repos/2019/02/07/$repo/os/$arch
… as described in https://wiki.archlinux.org/index.php/Arch_Linux_Archive#How_to_restore_all_packages_to_a_specific_date , because I really need a working system.
But don't we have two bugs here? One in curl, as described above, and one in networkmanager, which does not handle a bad answer from curl gracefully.
○ → grep arch /etc/hosts
138.201.81.199 www.archlinux.org
@fhgogogo, would that cause any other problems (such as for pacman or other tools that phone home to the arch servers)?
@buzo, I'd agree that there are likely two issues here, as you state
Otherwise, checking back, I've been running curl as of commit 1a9315378, and have yet to see the pegged CPU, but I'm not confident it has fixed the issue, since I could not reliably reproduce the issue previously.
For my case, I have a working IPv6 connection, however I am using openconnect for VPN and openconnect added a IPv6 default route which break IPv6 connections. So I think the high CPU issue only happens when there is a broken IPv6 setup.
Right now I don't have the time to fix it and check if it works when v6 is setup properly but that might be another indicator for this.
https://github.com/firewalld/firewalld/issues/459
From this I'd think that the issue is that NetworkManager does not gracefully handle errors in ipv6 connectivity when checking for general connections to the internet.
I'm testing on two networks, both networks have full IPv4 + IPv6 connectivity.
Observation:
- If I'm attached to just one of the networks (Ethernet or WiFi) the CPU load is normal.
- Once I connect using Ethernet and Wifi, I get two IPv6 default routes at the same time, and the CPU load spikes up after "connectivity check interval".
- Load spike does *not* happen if I disable IPv6 on one of the interfaces (no matter which one).
- Load spike does *not* happen if I disable disable default route on one of the interfaces (no matter which one).
- Shortening connectivity check using `interval=1` in `/usr/lib/NetworkManager/conf.d/20-connectivity.conf` expedites occurrence of the problem.
- This is reliably reproducible on my machine.
- Once CPU is locked up, removing default route using `sudo ip -6 route del default via fe80::...` does not get load down - it keeps spinning until I kill NetworkManager.
I have two routes now:
$ ip -6 route show
****:**:****:***::b2e dev wlan0 proto kernel metric 600 pref medium
****:**:****:***::ad6 dev eth0 proto kernel metric 100 pref medium
****::/64 dev eth0 proto kernel metric 100 pref medium
****::/64 dev wlan0 proto kernel metric 600 pref medium
default proto ra metric 100
nexthop via ****::1 dev eth0 weight 1
nexthop via ****::*:****:****:675 dev eth0 weight 1 pref medium
default proto ra metric 600
nexthop via ****::1 dev wlan0 weight 1
nexthop via ****::*:****:****:648 dev wlan0 weight 1 pref medium
and I am not observing the lockup, although I did just update all my packages (including curl and libidn2, which were updated on 2/11/19). Have you tried updating to see if the issue persists?
$ pacman -Q curl networkmanager
curl 7.64.0-3
networkmanager 1.14.5dev+17+gba83251bb-2
I'm attempting to build debug version of NetworkManager to see what is going on inside.
Tested via:
$ ping -6 www.google.com -I wlan0
$ ping -6 www.google.com -I eth0
Now I believe that the bug in NM is triggered if connection check over an IPv6 interface fails in some specific way.
Here is reproducer which seems to work on machines with single IPv6 enabled interface:
1. shorten connection check interval in `/usr/lib/NetworkManager/conf.d/20-connectivity.conf` to 1 second (`interval=1`)
2. add ip6table rule: $ ip6tables -A INPUT -j DROP -s 2a01:4f8:172:1d86::1/128
3. (make sure NM was restarted to pick up new configuration)
This reliably reproduces the issue even with a single network interface with (previously) working IPv6. For some reason the same trick does not reproduce problem over IPv4, which is totally baffling.
Those instructions also successfully reproduce the bug on my system. Adding the ip6tables rule causes the CPU to peg on a connection check, and NM does not stop its onslaught on the CPU until the rule is deleted and NM is restarted.
I recompiled the networkmanager package (using the ABS) with the `more_logging` flag set to true in the PKGBUILD, and saw the following output in the journal from NM when adding the DROP rule to ip6tables:
Feb 13 12:58:33 hostname NetworkManager[27528]: <debug> [1550080713.1064] connectivity: (eth0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 13 12:58:34 hostname NetworkManager[27528]: <debug> [1550080714.1040] connectivity: (eth0) check completed: LIMITED; timeout
Feb 13 12:58:37 hostname NetworkManager[27528]: <debug> [1550080717.0322] connectivity: (wlan0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 13 12:58:38 hostname NetworkManager[27528]: <debug> [1550080718.1040] connectivity: (wlan0) check completed: LIMITED; timeout
After removing the rule, NM completes its checks, as expected:
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.3714] connectivity: (eth0) check completed: FULL; expected response
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.3714] device[0x5612bdcdb570] (eth0): connectivity state changed from LIMITED to FULL
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.3723] connectivity: (eth0) check completed: CANCELLED; cancelled
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.3724] connectivity: (eth0) check completed: CANCELLED; cancelled
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.3724] connectivity: (eth0) check completed: CANCELLED; cancelled
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.8843] connectivity: (wlan0) check completed: FULL; expected response
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.8844] device[0x5612bdcf6a30] (wlan0): connectivity state changed from LIMITED to FULL
Feb 13 12:58:56 hostname NetworkManager[27528]: <debug> [1550080736.8854] connectivity: (wlan0) check completed: CANCELLED; cancelled
Feb 13 12:58:57 hostname NetworkManager[27528]: <debug> [1550080737.0324] connectivity: (wlan0) start request to 'http://www.archlinux.org/check_network_status.txt'
Feb 13 12:58:57 hostname NetworkManager[27528]: <debug> [1550080737.5090] connectivity: (wlan0) check completed: FULL; expected response
But the CPU remains pegged at 100% until a restart of NM.
I guess this is an upstream bug, then?
Still, this shouldn't let NetworkManager go all crazy..
I cannot reproduce using curl-git-7.63.0.66.g1f30dc886-1 (i.e. issue did not exist on Jan. 09 2019)
I cannot reproduce using curl-git-7.63.0.67.g5f5b5afcb-1 (i.e. issue did not exist on Jan. 09 2019)
I can reproduce using curl-git-7.63.0.68.g4c35574bb-1 (i.e. issue DID exist on Jan. 09 2019) <----- offending commit
I can reproduce using curl-git-7.63.0.72.gf6bb05ccb-1 (i.e. issue DID exist on Jan. 10 2019)
I can reproduce using curl-git-7.63.0.80.g52e27fe9c-1 (i.e. issue DID exist on Jan. 11 2019)
I can reproduce using curl-git-7.63.0.101.g6bd5bc97f-1 (i.e. issue DID exist on Jan. 19 2019)
So based on the bisection, the offending commit causing the issue is https://github.com/curl/curl/commit/4c35574bb785ce44d72db5483541c9da2d885705
strace'ing the process gives the following messages which are being repeated at a very high rate:
poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=19, events=POLLIN}, {fd=26, events=POLLOUT}, {fd=27, events=POLLIN}, {fd=28, events=POLLOUT}], 11, 0) = 2 ([{fd=3, revents=POLLIN}, {fd=26, revents=POLLOUT|POLLERR|POLLHUP}])
read(3, "\2\0\0\0\0\0\0\0", 16) = 8
epoll_wait(15, [], 8, 0) = 0
clock_gettime(CLOCK_BOOTTIME, {tv_sec=169674, tv_nsec=123857959}) = 0
write(3, "\1\0\0\0\0\0\0\0", 8) = 8
write(3, "\1\0\0\0\0\0\0\0", 8) = 8
❯ curl 'http://nmcheck.gnome.org/check_network_status.txt'
NetworkManager is online
❯ curl -6 'http://nmcheck.gnome.org/check_network_status.txt'
curl: (6) Could not resolve host: nmcheck.gnome.org
This has the advantage that it won't be overwritten by an update.
curl 7.64.0-7
networkmanager 1.14.5dev+17+gba83251bb-2