FS#57684 - [linux] Processes hanging in 4.15.6-1-ARCH

Attached to Project: Arch Linux
Opened by Cristian Bradiceanu (cbredi) - Thursday, 01 March 2018, 07:30 GMT
Last edited by Eli Schwartz (eschwartz) - Tuesday, 06 March 2018, 21:18 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
Architecture All
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 3
Private No

Details

After upgrading to 4.15.6-1-ARCH foubd some processes hanging.
Mentioned processes cannot be killed (systemd stop, kill -9), they even prevent a system reboot.

One one server, haproxy hanging with:

[ 1352.767556] INFO: task haproxy:729 blocked for more than 120 seconds.
[ 1352.767574] Not tainted 4.15.6-1-ARCH #1
[ 1352.767579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1352.767587] haproxy D 0 729 1 0x00000004
[ 1352.767592] Call Trace:
[ 1352.767610] ? __schedule+0x24b/0x8c0
[ 1352.767615] schedule+0x32/0x90
[ 1352.767620] __lock_sock+0x79/0xc0
[ 1352.767625] ? wait_woken+0x80/0x80
[ 1352.767629] lock_sock_nested+0x50/0x60
[ 1352.767638] getorigdst+0x5a/0x240 [nf_conntrack_ipv4]
[ 1352.767642] ? preempt_count_add+0x49/0xa0
[ 1352.767648] nf_getsockopt+0x47/0x70
[ 1352.767653] ip_getsockopt+0x7f/0xc0
[ 1352.767658] SyS_getsockopt+0x76/0xd0
[ 1352.767664] do_syscall_64+0x74/0x190
[ 1352.767669] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

On another server, squid process is hanging:

[ 2949.468966] INFO: task squid:387 blocked for more than 120 seconds.
[ 2949.468989] Not tainted 4.15.6-1-ARCH #1
[ 2949.468997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2949.469010] squid D 0 387 385 0x00000000
[ 2949.469018] Call Trace:
[ 2949.469041] ? __schedule+0x24b/0x8c0
[ 2949.469049] ? preempt_count_add+0x49/0xa0
[ 2949.469077] schedule+0x32/0x90
[ 2949.469081] __lock_sock+0x79/0xc0
[ 2949.469102] ? wait_woken+0x80/0x80
[ 2949.469105] lock_sock_nested+0x50/0x60
[ 2949.469111] getorigdst+0x5a/0x240 [nf_conntrack_ipv4]
[ 2949.469114] ? preempt_count_add+0x49/0xa0
[ 2949.469118] nf_getsockopt+0x47/0x70
[ 2949.469121] ip_getsockopt+0x7f/0xc0
[ 2949.469126] ? set_close_on_exec+0x30/0x70
[ 2949.469128] ipv6_getsockopt+0x4a/0x110
[ 2949.469132] SyS_getsockopt+0x76/0xd0
[ 2949.469137] do_syscall_64+0x74/0x190
[ 2949.469140] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Both issues appear to be related to nf_conntrack_ipv4.
This task depends upon

Closed by  Eli Schwartz (eschwartz)
Tuesday, 06 March 2018, 21:18 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 4.15.7-1
Comment by loqs (loqs) - Thursday, 01 March 2018, 10:22 GMT
From https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.15.6
Can you try reverting 8f2f8993e0f69f4f8d5afe3873158f723daacb31 then ff225999c603f0efed8fdbb791bab039d133eda2 and see if that resolves the issue?
Comment by Jan de Groot (JGC) - Thursday, 01 March 2018, 10:43 GMT
Probably related:  FS#57651 
Comment by Jan de Groot (JGC) - Thursday, 01 March 2018, 10:45 GMT
Probably fixed with commit d7ef969797fdeeb12a3afe069d86d1eaf037ac71 which is in 4.15.7 released upstream yesterday.
Comment by Mario Korte (emkay1) - Monday, 05 March 2018, 07:15 GMT
I got the same problem with this kernel version. Had to go back as far as 4.9.78-1-lts to get it resolved. It killed my PC while scrubbing my SW-RAID all the time. Got it running through with aforementioned kernel version. Now went back to the current kernel and had complete hang of my server with similar messages on other threads. Something is really fishy with the latest kernel releases.
Comment by Cristian Bradiceanu (cbredi) - Monday, 05 March 2018, 07:49 GMT
Kernel 4.14.23-1-lts works for me.
Comment by loqs (loqs) - Monday, 05 March 2018, 12:38 GMT
@emkay1 please specify package and versions not current kernel and latest release it removes ambiguity and saves having to cross reference post date with package release if time has passed.
Did you try linux 4.15.7-1 from testing?
Comment by Dark Eye (Dark_eye) - Monday, 05 March 2018, 16:26 GMT
Same with
Linux 60fe586beddc 4.15.6-1-ARCH #1 SMP PREEMPT Sun Feb 25 12:53:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Moreover, with haproxy also:
HA-Proxy version 1.6.3 2015/12/25
Copyright 2000-2015 Willy Tarreau <willy@haproxy.org>

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2g-fips 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Comment by loqs (loqs) - Monday, 05 March 2018, 16:35 GMT
@Dark_eye if the the bug was introduced with 4.15.6 and the fix with 4.15.7 what result would expect with 4.15.6?
Comment by Eli Schwartz (eschwartz) - Monday, 05 March 2018, 17:30 GMT
  • Field changed: Summary (Processes hanging in 4.15.6-1-ARCH → [linux] Processes hanging in 4.15.6-1-ARCH)
  • Field changed: Status (Unconfirmed → Assigned)
  • Field changed: Category (Packages: Core → Kernel)
  • Field changed: Architecture (x86_64 → All)
  • Task assigned to Jan Alexander Steffens (heftig), Tobias Powalowski (tpowa)
Can you confirm this is fixed with the current kernel in [testing]
Comment by Marti (intgr) - Tuesday, 06 March 2018, 20:01 GMT
I also had this issue and I can confirm that 'linux 4.15.7-1' fixed it. Thanks!

Loading...