FS#61269 - [linux] 4.20 reproducible deadlock as a Hyper-V guest
Attached to Project:
Arch Linux
Opened by Dan Bryant (dicta) - Friday, 04 January 2019, 04:28 GMT
Last edited by Jan de Groot (JGC) - Sunday, 24 March 2019, 23:26 GMT
Opened by Dan Bryant (dicta) - Friday, 04 January 2019, 04:28 GMT
Last edited by Jan de Groot (JGC) - Sunday, 24 March 2019, 23:26 GMT
|
Details
Description:
I was able to trigger a reproducible kernel hang while running as a VM guest using the Hyper-V hypervisor. This bug appears to be specific to the Hyper-V drivers (see reproduction steps below) Additional info: Kernel version: observed in linux-4.20.0-arch1-1-ARCH VM host information: - Host OS: Windows 10 Professional, Version 1809, OS build 17763.195 - Tested virtual machine is the only virtual machine running on this host - One network card attached using the "Default Switch" virtual switch in Hyper-V (note: The virtual switch connected on the host side does not appear to matter, I was able to reproduce even with the network adapter not connected to any vswitch) Steps to reproduce: - Create a new Hyper-V virtual machine. - Install Arch. - Boot into the system. - From a root console, change the MTU of the ethernet interface that's being virtualized by Hyper-V using the following command: > ifconfig eth0 mtu 1300 Expected result: ifconfig command returns immediately, MTU of the interface is changed Actual result: ifconfig hangs indefinitely, most other commands that interact with the kernel will also begin to hang as well. Attempting to attach to the process to gather further debugging information using either gdb or perf will hang these tools as well. Any change in the MTU size from the interface's default value of 1500 will trigger this bug, there's nothing special about 1300 here. If a user changes this value from the NetworkManager GUI, the system will be in this state immediately on boot. Depending on boot timings, this can result in the system booting into a state where neither graphical logins nor programs like "sudo" will work. After two minutes, the kernel will start printing hung task messages for each task that is affected. I was able to collect the following trace from the ifconfig task that triggered the initial bug: [ 368.489656] INFO: task pool:1513 blocked for more than 120 seconds. [ 368.489658] Not tainted 4.20.0-arch1-1-ARCH #1 [ 368.489659] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 368.489660] pool D 0 1513 1047 0x00000000 [ 368.489662] Call Trace: [ 368.489670] ? __schedule+0x29b/0x8b0 [ 368.489672] schedule+0x32/0x90 [ 368.489673] schedule_preempt_disabled+0x14/0x20 [ 368.489675] __mutex_lock.isra.1+0x217/0x530 [ 368.489678] __netlink_dump_start+0x54/0x1e0 [ 368.489681] ? rtnl_fill_ifinfo+0xec0/0xec0 [ 368.489683] rtnetlink_rcv_msg+0x264/0x390 [ 368.489685] ? rtnl_fill_ifinfo+0xec0/0xec0 [ 368.489686] ? rtnl_calcit.isra.11+0x110/0x110 [ 368.489688] netlink_rcv_skb+0x4c/0x120 [ 368.489690] netlink_unicast+0x196/0x240 [ 368.489692] netlink_sendmsg+0x1fd/0x3c0 [ 368.489694] sock_sendmsg+0x33/0x40 [ 368.489696] __sys_sendto+0xee/0x160 [ 368.489700] ? preempt_count_add+0x5a/0xb0 [ 368.489702] ? __fd_install+0x51/0xd0 [ 368.489703] ? __sys_socket+0x93/0xe0 [ 368.489705] __x64_sys_sendto+0x24/0x30 [ 368.489707] do_syscall_64+0x5b/0x170 [ 368.489709] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 368.489711] RIP: 0033:0x7f804933e0da [ 368.489717] Code: Bad RIP value. [ 368.489717] RSP: 002b:00007f8042c3f7a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002c [ 368.489719] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f804933e0da [ 368.489720] RDX: 0000000000000014 RSI: 00007f8042c40880 RDI: 000000000000000a [ 368.489721] RBP: 0000000000000000 R08: 00007f8042c40840 R09: 000000000000000c [ 368.489721] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8042c40880 [ 368.489722] R13: 0000000000000014 R14: 0000000000000000 R15: 00007f8042c40840 [ 368.489725] INFO: task ifconfig:2021 blocked for more than 120 seconds. [ 368.489726] Not tainted 4.20.0-arch1-1-ARCH #1 [ 368.489726] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 368.489727] ifconfig D 0 2021 2020 0x00000000 [ 368.489728] Call Trace: [ 368.489730] ? __schedule+0x29b/0x8b0 [ 368.489731] ? wait_for_common+0x113/0x190 [ 368.489732] ? preempt_count_add+0x79/0xb0 [ 368.489734] schedule+0x32/0x90 [ 368.489738] rndis_set_subchannel+0x105/0x270 [hv_netvsc] [ 368.489741] ? wait_woken+0x80/0x80 [ 368.489743] netvsc_attach+0x5a/0xa0 [hv_netvsc] [ 368.489746] netvsc_change_mtu+0x12d/0x180 [hv_netvsc] [ 368.489749] dev_set_mtu_ext+0xe1/0x1d0 [ 368.489751] dev_set_mtu+0x52/0x90 [ 368.489753] dev_ifsioc+0x215/0x3d0 [ 368.489756] ? cap_inode_getsecurity+0x240/0x240 [ 368.489757] ? dev_get_by_name_rcu+0x73/0x90 [ 368.489759] dev_ioctl+0xac/0x3d0 [ 368.489761] sock_do_ioctl+0xb4/0x160 [ 368.489763] sock_ioctl+0x1a4/0x320 [ 368.489766] do_vfs_ioctl+0xa4/0x630 [ 368.489769] ? handle_mm_fault+0x10a/0x250 [ 368.489781] ? __do_page_fault+0x254/0x510 [ 368.489783] ksys_ioctl+0x60/0x90 [ 368.489785] __x64_sys_ioctl+0x16/0x20 [ 368.489786] do_syscall_64+0x5b/0x170 [ 368.489788] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 368.489789] RIP: 0033:0x7f66b2c7980b [ 368.489790] Code: Bad RIP value. [ 368.489791] RSP: 002b:00007ffd3085d308 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 368.489792] RAX: ffffffffffffffda RBX: 00007ffd3085de03 RCX: 00007f66b2c7980b [ 368.489793] RDX: 00007ffd3085d370 RSI: 0000000000008922 RDI: 0000000000000004 [ 368.489794] RBP: 00007ffd3085de08 R08: 00007ffd3085de10 R09: 0000000000000000 [ 368.489794] R10: 00007f66b2cf5ae0 R11: 0000000000000246 R12: 0000559cd017b750 [ 368.489795] R13: 00007ffd3085d658 R14: 0000000000000000 R15: 0000000000000000 |
This task depends upon
There's been code changes in both 4.19 and 4.20 around this code, for potentially related issues see also:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807757
I haven't yet had a chance to bisect to a specific commit.
I rebased the fix to v5.0-rc1 and linux-next, and push it onto https://github.com/dcui/linux/commits/decui/linus/v5.0-rc1. With the fix applied, the issue here should be resolved.
BTW, for v4.20, the rebased fix is here: https://github.com/dcui/linux/commits/decui/linus/v4.20 .
We’ll repost the fix to LKML ASAP.
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.20.y&id=601cdaedd2ab8c9f635d1164c0fb52a086b25b8f
Arch has shipped 4.20.6 as of today, which contains this fix.