FS#66578 - [qemu] 4.2.0-2 -> 5.0.0-5 // Windows 10 guest BSOD on boot

Attached to Project: Arch Linux
Opened by Managarmr (managarmr) - Thursday, 07 May 2020, 13:05 GMT
Last edited by Anatol Pomozov (anatolik) - Tuesday, 28 July 2020, 18:49 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Tobias Powalowski (tpowa)
Anatol Pomozov (anatolik)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 3
Private No

Details

Description:

Upgrading the qemu version breaks my VM setup spectacularly with BSOD (KMODE_EXCEPTION_NOT_HANDLED) on boot.
I have verified qemu working by downgrading my system to a snapshot of 2020-05-05.
The following packages have been downgraded:

warning: chromium: downgrading from version 81.0.4044.138-1 to version 81.0.4044.129-2
warning: dnsmasq: downgrading from version 2.81-4 to version 2.81-3
warning: edk2-ovmf: downgrading from version 202002-9 to version 202002-7
warning: filesystem: downgrading from version 2020.05.03-1 to version 2019.10-2
warning: gnutls: downgrading from version 3.6.13-2 to version 3.6.13-1
warning: gvfs: downgrading from version 1.44.1-3 to version 1.44.1-1
warning: jansson: downgrading from version 2.12-2 to version 2.12-1
warning: libsm: downgrading from version 1.2.3-2 to version 1.2.3-1
warning: libsoxr: downgrading from version 0.1.3-2 to version 0.1.3-1
warning: libxshmfence: downgrading from version 1.3-2 to version 1.3-1
warning: lmdb: downgrading from version 0.9.25-1 to version 0.9.24-1
warning: luajit: downgrading from version 2.0.5-3 to version 2.0.5-2
warning: nettle: downgrading from version 3.6-1 to version 3.5.1-2
warning: nodejs: downgrading from version 14.2.0-1 to version 14.1.0-2
warning: poppler-data: downgrading from version 0.4.9-2 to version 0.4.9-1
warning: pygobject-devel: downgrading from version 3.36.1-1 to version 3.36.0-2
warning: python-gobject: downgrading from version 3.36.1-1 to version 3.36.0-2
warning: qemu: downgrading from version 5.0.0-5 to version 4.2.0-2
warning: rasqal: downgrading from version 1:0.9.33-3 to version 1:0.9.33-2
warning: rest: downgrading from version 0.8.1-2 to version 0.8.1-1
warning: riot-desktop: downgrading from version 1.6.0-1 to version 1.5.15-1
warning: riot-web: downgrading from version 1.6.0-1 to version 1.5.15-1
warning: shared-mime-info: downgrading from version 2.0+1+g6bf9e4f-1 to version 1.15-2
warning: slang: downgrading from version 2.3.2-2 to version 2.3.2-1
warning: wget: downgrading from version 1.20.3-3 to version 1.20.3-2
warning: xcb-util: downgrading from version 0.4.0-3 to version 0.4.0-2
warning: xcb-util-cursor: downgrading from version 0.1.3-3 to version 0.1.3-2
warning: xcb-util-image: downgrading from version 0.4.0-3 to version 0.4.0-2
warning: xcb-util-keysyms: downgrading from version 0.4.0-3 to version 0.4.0-2
warning: xcb-util-renderutil: downgrading from version 0.3.9-3 to version 0.3.9-2
warning: xcb-util-wm: downgrading from version 0.4.1-3 to version 0.4.1-2
warning: xcb-util-xrm: downgrading from version 1.3-2 to version 1.3-1
warning: xorg-bdftopcf: downgrading from version 1.1-2 to version 1.1-1
warning: xorg-server: downgrading from version 1.20.8-2 to version 1.20.8-1
warning: xorg-server-common: downgrading from version 1.20.8-2 to version 1.20.8-1
warning: xorg-xset: downgrading from version 1.2.4-2 to version 1.2.4-1

Steps to reproduce:
Setup a VM on the old qemu version, upgrade, watch it break.

My CPU is a Ryzen 9 3900x and I am passing through a NVIDIA RTX 2070 SUPER.
The kernel has been patched with the following patch:

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 35d0d638d..3555ccf1c 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5095,6 +5095,10 @@ static void quirk_intel_no_flr(struct pci_dev *dev)
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_intel_no_flr);
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_intel_no_flr);

+/* FLR causes Ryzen 3000s built-in HD Audio & USB Controllers to hang on VFIO passthrough */
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_intel_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_intel_no_flr);
+
static void quirk_no_ext_tags(struct pci_dev *pdev)
{
struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);


My libvirt XML:
<domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
<name>win10-gaming</name>
<uuid>xx</uuid>
<title>Windows 10 Gaming</title>
<metadata>
<libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
<libosinfo:os id="http://microsoft.com/win/10"/>
</libosinfo:libosinfo>
</metadata>
<memory unit="KiB">8388608</memory>
<currentMemory unit="KiB">8388608</currentMemory>
<vcpu placement="static">12</vcpu>
<os>
<type arch="x86_64" machine="pc-q35-4.2">hvm</type>
<loader readonly="yes" type="pflash">/usr/share/ovmf/x64/OVMF_CODE.fd</loader>
<nvram>/var/lib/libvirt/qemu/nvram/win10-gaming_VARS.fd</nvram>
</os>
<features>
<acpi/>
<apic/>
<hyperv>
<relaxed state="on"/>
<vapic state="on"/>
<spinlocks state="on" retries="8191"/>
<vendor_id state="on" value="133713371337"/>
</hyperv>
<kvm>
<hidden state="on"/>
</kvm>
<vmport state="off"/>
</features>
<cpu mode="host-passthrough" check="none">
<topology sockets="1" cores="6" threads="2"/>
</cpu>
<clock offset="localtime">
<timer name="rtc" tickpolicy="catchup"/>
<timer name="pit" tickpolicy="delay"/>
<timer name="hpet" present="no"/>
<timer name="hypervclock" present="yes"/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<pm>
<suspend-to-mem enabled="no"/>
<suspend-to-disk enabled="no"/>
</pm>
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<disk type="file" device="disk">
<driver name="qemu" type="qcow2"/>
<source file="/home/managarmr/VirtualDisks/windows10-gaming.qcow2"/>
<target dev="sda" bus="sata"/>
<boot order="1"/>
<address type="drive" controller="0" bus="0" target="0" unit="0"/>
</disk>
<controller type="usb" index="0" model="qemu-xhci" ports="15">
<address type="pci" domain="0x0000" bus="0x02" slot="0x00" function="0x0"/>
</controller>
<controller type="sata" index="0">
<address type="pci" domain="0x0000" bus="0x00" slot="0x1f" function="0x2"/>
</controller>
<controller type="pci" index="0" model="pcie-root"/>
<controller type="pci" index="1" model="pcie-root-port">
<model name="pcie-root-port"/>
<target chassis="1" port="0x8"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x0" multifunction="on"/>
</controller>
<controller type="pci" index="2" model="pcie-root-port">
<model name="pcie-root-port"/>
<target chassis="2" port="0x9"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x1"/>
</controller>
<controller type="pci" index="3" model="pcie-root-port">
<model name="pcie-root-port"/>
<target chassis="3" port="0xa"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x2"/>
</controller>
<controller type="pci" index="4" model="pcie-root-port">
<model name="pcie-root-port"/>
<target chassis="4" port="0xb"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x3"/>
</controller>
<controller type="pci" index="5" model="pcie-root-port">
<model name="pcie-root-port"/>
<target chassis="5" port="0xc"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x4"/>
</controller>
<controller type="pci" index="6" model="pcie-root-port">
<model name="pcie-root-port"/>
<target chassis="6" port="0xd"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x5"/>
</controller>
<controller type="pci" index="7" model="pcie-root-port">
<model name="pcie-root-port"/>
<target chassis="7" port="0xe"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x6"/>
</controller>
<controller type="pci" index="8" model="pcie-root-port">
<model name="pcie-root-port"/>
<target chassis="8" port="0xf"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x7"/>
</controller>
<interface type="network">
<mac address="xx"/>
<source network="default"/>
<model type="e1000e"/>
<address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
</interface>
<input type="keyboard" bus="ps2"/>
<input type="mouse" bus="ps2"/>
<sound model="ich9">
<address type="pci" domain="0x0000" bus="0x00" slot="0x1b" function="0x0"/>
</sound>
<hostdev mode="subsystem" type="pci" managed="yes">
<source>
<address domain="0x0000" bus="0x0a" slot="0x00" function="0x0"/>
</source>
<address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
</hostdev>
<hostdev mode="subsystem" type="pci" managed="yes">
<source>
<address domain="0x0000" bus="0x0a" slot="0x00" function="0x1"/>
</source>
<address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0"/>
</hostdev>
<hostdev mode="subsystem" type="pci" managed="yes">
<source>
<address domain="0x0000" bus="0x0a" slot="0x00" function="0x2"/>
</source>
<address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
</hostdev>
<hostdev mode="subsystem" type="pci" managed="yes">
<source>
<address domain="0x0000" bus="0x0a" slot="0x00" function="0x3"/>
</source>
<address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
</hostdev>
<hostdev mode="subsystem" type="pci" managed="yes">
<source>
<address domain="0x0000" bus="0x0d" slot="0x00" function="0x3"/>
</source>
<address type="pci" domain="0x0000" bus="0x08" slot="0x00" function="0x0"/>
</hostdev>
<memballoon model="virtio">
<address type="pci" domain="0x0000" bus="0x03" slot="0x00" function="0x0"/>
</memballoon>
</devices>
<qemu:commandline>
<qemu:env name="QEMU_AUDIO_DRV" value="pa"/>
<qemu:env name="QEMU_PA_SERVER" value="/run/user/1000/pulse/native"/>
</qemu:commandline>
</domain>
This task depends upon

Closed by  Anatol Pomozov (anatolik)
Tuesday, 28 July 2020, 18:49 GMT
Reason for closing:  Fixed
Comment by Krister Bäckman (ixevix) - Thursday, 07 May 2020, 19:10 GMT
I'm affected by this as well.
Comment by Anatol Pomozov (anatolik) - Thursday, 07 May 2020, 19:32 GMT
I see 2 important package updates: qemu and edk2-ovmf. Any of these packages can potentially cause the problem. Could you please try to downgrade packages one by one and find out which package update caused the breakage?
Comment by Anatol Pomozov (anatolik) - Thursday, 07 May 2020, 19:36 GMT
And another quick check: could you please try replacing in your libvirt XML:

"/usr/share/ovmf/x64/OVMF_CODE.fd" -> "/usr/share/edk2-ovmf/x64/OVMF_CODE.secboot.fd"

Does it make any difference for you?
Comment by Managarmr (managarmr) - Thursday, 07 May 2020, 19:42 GMT
Oh yes sorry forgot to mention that - Yes I've changed it temporarily and it made no difference.
I've added it as a permanent change now as the package name changed and switching seems sane.
Ultimately the bug persists.
Comment by Managarmr (managarmr) - Thursday, 07 May 2020, 19:47 GMT
The issue appears to be qemu based. The bug is not prevalent with the following versions:
edk2-ovmf 202002-9
qemu 4.2.0-2

and

edk2-ovmf 202002-7
qemu 4.2.0-2
Comment by Anatol Pomozov (anatolik) - Thursday, 07 May 2020, 19:49 GMT
> The issue appears to be qemu based.

Thank you for confirming it. Could you please also check it with the previous testing versions of qemu 5.0.0-4, 5.0.0-3 ....?
Comment by Managarmr (managarmr) - Thursday, 07 May 2020, 19:58 GMT
I have double checked and the breakage appears to occur between qemu 4.2.0-2 and 5.0.0-1.
Comment by Toolybird (Toolybird) - Thursday, 07 May 2020, 20:29 GMT
> CPU is a Ryzen 9 3900x

It's been reported on reddit [1]. A cause has been identified and workarounds are available. Affects only Zen 2 architecture. Mostly likely an upstream qemu bug, or possibly kernel.

[1]: https://www.reddit.com/r/VFIO/comments/gf53o8/upgrading_to_qemu_5_broke_my_setup_windows_bsods/
Comment by Managarmr (managarmr) - Thursday, 07 May 2020, 20:42 GMT
> It's been reported on reddit [1]. A cause has been identified and workarounds are available. Definitely an upstream qemu bug affecting only Zen 2 architecture.

Oh wow, thank you very much. Switching from host-passthrough to host-model for the CPU worked.
Thanks a bunch :)
Comment by Christian (Darius) - Friday, 08 May 2020, 20:49 GMT
Since todays update (pacman -Suy) to qemu 5, my windows VM basically turned unusable. After some minutes runtime, it gets more and more laggy until it allmost freezes completely. E.g skype or playing a random Full HD vid in firefox. For example half of this video (https://www.youtube.com/watch?v=mcixldqDIEQ) will freeze the VM. I also have Ryzen 9 3900X, i tried switching to random cpu modell (qemu64) but it didn't help.
It boots fine (and never seen a Bluescreen) so not quite sure it is related but wanted to ask if there is someone out there to confirm or deny..

I don't have pass-through but basically default QXL Libvirt VM (virt-manager) just weeks ago created.

<libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
<libosinfo:os id="http://microsoft.com/win/10"/>
</libosinfo:libosinfo>
...
<cpu mode="host-model" check="none"/>
...
<video>
<model type="qxl" ram="65536" vram="65536" vgamem="16384" heads="1" primary="yes"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x0"/>
</video>

this sounds related:
https://www.reddit.com/r/VFIO/comments/gf9cay/qemukvm_windows_10_guest_got_very_very_laggy_arch/

btw dmesg shows nothing special and linux runs as fast as ever...
Comment by Toolybird (Toolybird) - Saturday, 09 May 2020, 00:08 GMT
> it gets more and more laggy until it allmost freezes completely

This sounds like a different issue to OP. Might be better suited to the forums, in fact there's already a thread:

https://bbs.archlinux.org/viewtopic.php?id=255489

Things to check for Win10 VM's:
- virtio is a must (both network and storage) with appropriate drivers
- if not using VGA passthrough, QXL drivers are a must
- audio is often flakey. As a test, try (temporarily) removing virtual sound card (ich9) and see if problem repros.
Comment by Bill Payne (Billzargo) - Saturday, 09 May 2020, 00:43 GMT
I do not think virtio was the issue here, though I would recommend using them too... I was using them when I experienced the issue.

DPC Latency Checker's in Windows found big issues with the sound, network and storage drivers so I have also tested not using virtio and it has the same issue.
I tried downgrading qemu but I think there are other packages that need to be downgraded because it was complaining about missing libraries.
Comment by Anatol Pomozov (anatolik) - Saturday, 09 May 2020, 01:06 GMT
> because it was complaining about missing libraries.

what errors do you see?
Comment by Bill Payne (Billzargo) - Saturday, 09 May 2020, 02:29 GMT
I did the downgrade properly and not just the one package (qemu) and it is working. I posted the steps I used in my post here: https://bbs.archlinux.org/viewtopic.php?id=255489
It seems to be working great now. I will know more when I use it all day with Visual Studio on Monday for work.
Comment by Toolybird (Toolybird) - Saturday, 09 May 2020, 09:21 GMT
Hmm, it appears the addition of io_uring is the cause of the laggy Win10 for some folks. See the forum post from @wkchu in above referenced thread. Maybe io_uring is not ready for prime time.

I run my Win10 VM with io=threads and don't see the issue.

Options to test:

1. run your VM's with io='threads' in the XML for the disk.

2. build qemu with `--disable-linux-io-uring'

3. remove liburing from deps and rebuild qemu (same effect as 2)
Comment by Christian (Darius) - Saturday, 09 May 2020, 10:53 GMT
Hmm i did what i understood as Option 1.
[Code]
<disk type="file" device="disk">
<driver name="qemu" type="qcow2" io="threads"/>
<source file="/run/media/Daten/VM/win10-buero.qcow2"/>
<backingStore/>
<target dev="sda" bus="sata"/>
<address type="drive" controller="0" bus="0" target="0" unit="0"/>
</disk>
[/Code]

but playing Fullhd vid made VM stuck after around 2,5 minutes.
Comment by Toolybird (Toolybird) - Saturday, 09 May 2020, 20:59 GMT
> Hmm i did what i understood as Option 1.

Yeah sorry, I now understand that won't work.

It appears io_uring is used in at least 2 places in latest qemu:

1. block driver
2. fd monitoring

1 is a tunable and therefore optional. 2 is not AFAICT, and this is the bit apparently upsetting Win10 VM's.

Seeing as multiple folks are affected, it's probably best if Arch disable io_uring for the time being (until upstream can look at @wkchu's bug report [1] and hopefully fix it).

@Anatol, what do you think? Simply removing liburing from _headlessdeps and rebuilding will do the trick.

[1]: https://bugs.launchpad.net/qemu/+bug/1877716
Comment by Anatol Pomozov (anatolik) - Saturday, 09 May 2020, 21:49 GMT
> it's probably best if Arch disable io_uring for the time being

Sounds good to me. liburing has been disabled and pushed to [testing] as qemu-5.0.0-6. Please take a look and let me know if it fixed the issue for you.
Comment by Oleksandr Natalenko (post-factum) - Saturday, 09 May 2020, 22:13 GMT
You don't have to use io_uring if it slows things down for you, and it is not the default backend for I/O.

Please keep io_uring support in place. There are people who use it. Thanks.
Comment by Toolybird (Toolybird) - Saturday, 09 May 2020, 23:17 GMT
> You don't have to use io_uring

Please re-read my comments above. The problematic io_uring usage is in the file descriptor monitoring portion of qemu. This part *cannot* be disabled (at least that I can see - but maybe it can be patched out?). Yes, it's unfortunate for those who want to utilise it in block I/O, but hopefully this disablement will only be short term temporary.
Comment by Christian (Darius) - Sunday, 10 May 2020, 06:49 GMT
> ..[testing] as qemu-5.0.0-6. Please take a look and let me know if it fixed the issue for you.

For me it is resolved now! Thanks a bunch to all of you!
Comment by Anatol Pomozov (anatolik) - Sunday, 10 May 2020, 07:11 GMT
> Please keep io_uring support in place.

We definitely interested to see it enabled. But this is going to be done once this feature stabilized and does not cause any major issues to our users.

Once https://bugs.launchpad.net/qemu/+bug/1877716 is resolved we will reconsider enabling this feature.
Comment by zkrx (wkchu) - Sunday, 10 May 2020, 16:13 GMT
> ..[testing] as qemu-5.0.0-6. Please take a look and let me know if it fixed the issue for you.

Running qemu-5.0.0-6 from [testing] for nearly two hours now and the bug appears to be gone. Thanks.
Comment by Oleksandr Natalenko (post-factum) - Sunday, 10 May 2020, 19:55 GMT
Could you please check whether it makes any difference if you use io_uring with guest disk image stored directly in an LVM volume (or plain partition) instead of file?
Comment by zkrx (wkchu) - Monday, 11 May 2020, 08:13 GMT
My guest uses an nvme passthrough, so I guess that no file is involved (the whole disk is given to Windows). My logs are attached in the upstream bug report. Is this answer sufficient?
Comment by Oleksandr Natalenko (post-factum) - Monday, 11 May 2020, 08:15 GMT
Yes, thanks.
Comment by Bill Payne (Billzargo) - Monday, 11 May 2020, 22:44 GMT
I have been using qemu-5.0.0-6 for about an hour now, it has not locked up. I have both Visual Studio and Tibco JasperSoft Studio open so it should have locked up by now.
Comment by Anatol Pomozov (anatolik) - Wednesday, 20 May 2020, 01:17 GMT
Reenabling io_uring is tracked here https://bugs.archlinux.org/task/66710

Loading...