FS#63730 - [linux] 5.2.11+ (at least through 5.2.14) on host locks up KVM VM's using > nproc/2 virtCPUs

Attached to Project: Arch Linux
Opened by James Harvey (jamespharvey20) - Thursday, 12 September 2019, 03:21 GMT
Last edited by Antonio Rojas (arojas) - Saturday, 21 September 2019, 09:07 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To No-one
Architecture All
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description: Host running linux 5.2.10 or earlier successfully boots. Host 5.2.11-5.2.14 with hyperthreading and a VM using more than host's nproc/2 virtual CPUs hangs in early boot stage. Booting UEFI grub/systemd shows "Loading Initial Ramdisk..." and hangs. Booting UEFI ISO goes to a black screen. Host shows 100% CPU usage * number of virtual CPUs. So, for example, my 16 physical core system with hyperthreading shows nproc of 32. 5.2.11+ allows a VM with up to 16 virtual CPUs to boot, but 18 or more hangs forever. A race condition is probably involved, because about 5% of boot attempts expected to hang succeed.


Additional info:
* linux 5.2.11-5.2.14. This is a KVM bug. Version of QEMU doesn't seem to matter - 4.0.x and 4.1.x have identical behavior.
* See https://www.spinics.net/lists/kvm/msg195171.html
* Unknown if other hypervisors using KVM will run into this


Workarounds:
* Use linux 5.2.10 or 5.2.11+ custom made with commit 2ad350fb4c reverted
* Temporarily decrease number of virtual CPUs given to each specific VM to be <= nproc/2 on host


Arch maintainers:
Probably nothing to do here, but wait for upstream to release a fix and mark closed. Commit 2ad350fb4c could be considered to be reverted for Arch, but upstream says that isn't viable because it would be at the expense of a fix for a "regression with device assignment" regarding to removing memslots. And, 5.2.11 was released almost 2 weeks ago and I haven't seen others reporting this issue.
This task depends upon

Closed by  Antonio Rojas (arojas)
Saturday, 21 September 2019, 09:07 GMT
Reason for closing:  Fixed
Comment by James Harvey (jamespharvey20) - Friday, 20 September 2019, 09:59 GMT
Upstream said fixed in 5.3. https://www.spinics.net/lists/kvm/msg195685.html

I confirmed it fixes the problem for me.

Loading...