FS#43921 - [linux] Kernel tasks can hang

Attached to Project: Arch Linux
Opened by Dan Liew (delcypher) - Monday, 23 February 2015, 13:14 GMT
Last edited by Jan de Groot (JGC) - Tuesday, 03 October 2017, 11:40 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 4
Private No

Details

Description:

I'm not very familiar with the kernel internals but I've observed when using 3.18.6-1 that I can reliably cause all consoles on my system to hang. If I downgrade to 3.18.5-1 the issue disappears. It seems to be triggered when I run an application I'm currently developing and a stackoverflow occurs. Of course my application shouldn't hit a stack overflow but even if that happens it shouldn't break all the consoles on my system.

When I say "cause all consoles to hang" what I observe is
- I can't open a new instance of konsole and the open instance of konsole does not respond (menus are frozen). It shows as state D in top (uninterruptible sleep) (see top.txt)
- My desktop environment (KDE) seems to continue working
- If I try to login on one of the virtual TTYs it hangs. If I press CTRL+C it drops to a prompt (presumably it got stuck loading some script). If I try to run a tool like htop or pgrep from here it hangs. The top commands seems to work however

I also notice after a while I start getting messages in dmesg about hung tasks inside the kernel (see journal-tail.txt)

Additional info:
* Linux kernel 3.18.6-1
* top.txt
* journal.txt


Steps to reproduce:

This is tricky because this has only happened so far when I run a program of mine when it crashes. The application is written in mono and the problem is triggered when my application runs and a stackoverflow occurs. The application will be open sourced eventually but it's not ready yet. I tried writing a simple application in mono where a stackoverflow occurs but that didn't trigger the bug in the kernel.

I'd happily provide you with binaries in private though if you want to try and reproduce the issue

For now I've reverted back to 3.18.5-1. Please let me know if there's more information that I can provide. I did try running strace on my application but it doesn't look very useful (see strace.log). A large majority of it is just the huge stacktrace being written by the write() system call and then after that there is nothing else.
This task depends upon

Closed by  Jan de Groot (JGC)
Tuesday, 03 October 2017, 11:40 GMT
Reason for closing:  Fixed
Comment by Alexandre Rosenberg (arekkusu) - Wednesday, 25 February 2015, 19:37 GMT
Note: Edited to add summarise info from (now closed) #43902 duplicate.
----------

My system appear to be in a similar state. I am running Nvidia binary driver (*See reproducibility)

= Chain of events

- Gimp crashed with error 6 type segfault - very similar dmesg output (*attached dmesg)
- Gimp Process then stays in state "Disk sleep" - can't be killed
- Running "ps aux" hangs when trying to read some of the process information (*attached strace)
- Same behaviour for "ls -hal /proc/4469/"
- I manage to get top in the same state although not consistently
- All those ps and ls process end up in "Disk sleep" state as well. (*attached process_in_D_state )
Note: I ran ls/ps many time troubleshooting, hence the number of process

= Additional info

I noticed that in the list of process in disk sleep state I have "khugepaged" (*attached khugepaged_process_status).I don't have the kernel knowledge / understand what khugepagedbut does but it looks relevant. I don't know if khugepaged is involved in the initial report.

= Symptoms

- Unlike to initial report I have no problem running konsole or other terminal emulator (note the different tasks involved between both dmesg - I don't have "konsole" in mine)
- Identical issue with virtual TTYs hanging (tested after reading the report)
- Identical issue with tool like htop, pgrep, ps

Other that this system is usable.

= Reproducibility

With Nvidia binary driver installed I simply need to open/close gimp to reproduce the issue on my system (after reboot)
- A few minutes after the crash, only the gimp process is in "sleep disk" state.
- Some time later the process khugepaged is also stuck in "sleep disk" state.

From  FS#43921 :

- Jan noticed problem does not occur with Kernel 3.20 from testing
- Jan noticed gimp segfault only occur with Nvidia binary driver (no crash with Nouveau)
- I can confirm the same behaviour (no segfault) when switching to Nouveau on my system




Comment by mattia (nTia89) - Monday, 02 October 2017, 17:57 GMT
is this issue still valid?
Comment by Dan Liew (delcypher) - Tuesday, 03 October 2017, 10:00 GMT
I don't think it's valid anymore.

Loading...