FS#66578 - [qemu] 4.2.0-2 -> 5.0.0-5 // Windows 10 guest BSOD on boot
Attached to Project:
Arch Linux
Opened by Managarmr (managarmr) - Thursday, 07 May 2020, 13:05 GMT
Last edited by Anatol Pomozov (anatolik) - Tuesday, 28 July 2020, 18:49 GMT
Opened by Managarmr (managarmr) - Thursday, 07 May 2020, 13:05 GMT
Last edited by Anatol Pomozov (anatolik) - Tuesday, 28 July 2020, 18:49 GMT
|
Details
Description:
Upgrading the qemu version breaks my VM setup spectacularly with BSOD (KMODE_EXCEPTION_NOT_HANDLED) on boot. I have verified qemu working by downgrading my system to a snapshot of 2020-05-05. The following packages have been downgraded: warning: chromium: downgrading from version 81.0.4044.138-1 to version 81.0.4044.129-2 warning: dnsmasq: downgrading from version 2.81-4 to version 2.81-3 warning: edk2-ovmf: downgrading from version 202002-9 to version 202002-7 warning: filesystem: downgrading from version 2020.05.03-1 to version 2019.10-2 warning: gnutls: downgrading from version 3.6.13-2 to version 3.6.13-1 warning: gvfs: downgrading from version 1.44.1-3 to version 1.44.1-1 warning: jansson: downgrading from version 2.12-2 to version 2.12-1 warning: libsm: downgrading from version 1.2.3-2 to version 1.2.3-1 warning: libsoxr: downgrading from version 0.1.3-2 to version 0.1.3-1 warning: libxshmfence: downgrading from version 1.3-2 to version 1.3-1 warning: lmdb: downgrading from version 0.9.25-1 to version 0.9.24-1 warning: luajit: downgrading from version 2.0.5-3 to version 2.0.5-2 warning: nettle: downgrading from version 3.6-1 to version 3.5.1-2 warning: nodejs: downgrading from version 14.2.0-1 to version 14.1.0-2 warning: poppler-data: downgrading from version 0.4.9-2 to version 0.4.9-1 warning: pygobject-devel: downgrading from version 3.36.1-1 to version 3.36.0-2 warning: python-gobject: downgrading from version 3.36.1-1 to version 3.36.0-2 warning: qemu: downgrading from version 5.0.0-5 to version 4.2.0-2 warning: rasqal: downgrading from version 1:0.9.33-3 to version 1:0.9.33-2 warning: rest: downgrading from version 0.8.1-2 to version 0.8.1-1 warning: riot-desktop: downgrading from version 1.6.0-1 to version 1.5.15-1 warning: riot-web: downgrading from version 1.6.0-1 to version 1.5.15-1 warning: shared-mime-info: downgrading from version 2.0+1+g6bf9e4f-1 to version 1.15-2 warning: slang: downgrading from version 2.3.2-2 to version 2.3.2-1 warning: wget: downgrading from version 1.20.3-3 to version 1.20.3-2 warning: xcb-util: downgrading from version 0.4.0-3 to version 0.4.0-2 warning: xcb-util-cursor: downgrading from version 0.1.3-3 to version 0.1.3-2 warning: xcb-util-image: downgrading from version 0.4.0-3 to version 0.4.0-2 warning: xcb-util-keysyms: downgrading from version 0.4.0-3 to version 0.4.0-2 warning: xcb-util-renderutil: downgrading from version 0.3.9-3 to version 0.3.9-2 warning: xcb-util-wm: downgrading from version 0.4.1-3 to version 0.4.1-2 warning: xcb-util-xrm: downgrading from version 1.3-2 to version 1.3-1 warning: xorg-bdftopcf: downgrading from version 1.1-2 to version 1.1-1 warning: xorg-server: downgrading from version 1.20.8-2 to version 1.20.8-1 warning: xorg-server-common: downgrading from version 1.20.8-2 to version 1.20.8-1 warning: xorg-xset: downgrading from version 1.2.4-2 to version 1.2.4-1 Steps to reproduce: Setup a VM on the old qemu version, upgrade, watch it break. My CPU is a Ryzen 9 3900x and I am passing through a NVIDIA RTX 2070 SUPER. The kernel has been patched with the following patch: diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 35d0d638d..3555ccf1c 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -5095,6 +5095,10 @@ static void quirk_intel_no_flr(struct pci_dev *dev) DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_intel_no_flr); DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_intel_no_flr); +/* FLR causes Ryzen 3000s built-in HD Audio & USB Controllers to hang on VFIO passthrough */ +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_intel_no_flr); +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_intel_no_flr); + static void quirk_no_ext_tags(struct pci_dev *pdev) { struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus); My libvirt XML: <domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm"> <name>win10-gaming</name> <uuid>xx</uuid> <title>Windows 10 Gaming</title> <metadata> <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0"> <libosinfo:os id="http://microsoft.com/win/10"/> </libosinfo:libosinfo> </metadata> <memory unit="KiB">8388608</memory> <currentMemory unit="KiB">8388608</currentMemory> <vcpu placement="static">12</vcpu> <os> <type arch="x86_64" machine="pc-q35-4.2">hvm</type> <loader readonly="yes" type="pflash">/usr/share/ovmf/x64/OVMF_CODE.fd</loader> <nvram>/var/lib/libvirt/qemu/nvram/win10-gaming_VARS.fd</nvram> </os> <features> <acpi/> <apic/> <hyperv> <relaxed state="on"/> <vapic state="on"/> <spinlocks state="on" retries="8191"/> <vendor_id state="on" value="133713371337"/> </hyperv> <kvm> <hidden state="on"/> </kvm> <vmport state="off"/> </features> <cpu mode="host-passthrough" check="none"> <topology sockets="1" cores="6" threads="2"/> </cpu> <clock offset="localtime"> <timer name="rtc" tickpolicy="catchup"/> <timer name="pit" tickpolicy="delay"/> <timer name="hpet" present="no"/> <timer name="hypervclock" present="yes"/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <pm> <suspend-to-mem enabled="no"/> <suspend-to-disk enabled="no"/> </pm> <devices> <emulator>/usr/bin/qemu-system-x86_64</emulator> <disk type="file" device="disk"> <driver name="qemu" type="qcow2"/> <source file="/home/managarmr/VirtualDisks/windows10-gaming.qcow2"/> <target dev="sda" bus="sata"/> <boot order="1"/> <address type="drive" controller="0" bus="0" target="0" unit="0"/> </disk> <controller type="usb" index="0" model="qemu-xhci" ports="15"> <address type="pci" domain="0x0000" bus="0x02" slot="0x00" function="0x0"/> </controller> <controller type="sata" index="0"> <address type="pci" domain="0x0000" bus="0x00" slot="0x1f" function="0x2"/> </controller> <controller type="pci" index="0" model="pcie-root"/> <controller type="pci" index="1" model="pcie-root-port"> <model name="pcie-root-port"/> <target chassis="1" port="0x8"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x0" multifunction="on"/> </controller> <controller type="pci" index="2" model="pcie-root-port"> <model name="pcie-root-port"/> <target chassis="2" port="0x9"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x1"/> </controller> <controller type="pci" index="3" model="pcie-root-port"> <model name="pcie-root-port"/> <target chassis="3" port="0xa"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x2"/> </controller> <controller type="pci" index="4" model="pcie-root-port"> <model name="pcie-root-port"/> <target chassis="4" port="0xb"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x3"/> </controller> <controller type="pci" index="5" model="pcie-root-port"> <model name="pcie-root-port"/> <target chassis="5" port="0xc"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x4"/> </controller> <controller type="pci" index="6" model="pcie-root-port"> <model name="pcie-root-port"/> <target chassis="6" port="0xd"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x5"/> </controller> <controller type="pci" index="7" model="pcie-root-port"> <model name="pcie-root-port"/> <target chassis="7" port="0xe"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x6"/> </controller> <controller type="pci" index="8" model="pcie-root-port"> <model name="pcie-root-port"/> <target chassis="8" port="0xf"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x7"/> </controller> <interface type="network"> <mac address="xx"/> <source network="default"/> <model type="e1000e"/> <address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0"/> </interface> <input type="keyboard" bus="ps2"/> <input type="mouse" bus="ps2"/> <sound model="ich9"> <address type="pci" domain="0x0000" bus="0x00" slot="0x1b" function="0x0"/> </sound> <hostdev mode="subsystem" type="pci" managed="yes"> <source> <address domain="0x0000" bus="0x0a" slot="0x00" function="0x0"/> </source> <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> </hostdev> <hostdev mode="subsystem" type="pci" managed="yes"> <source> <address domain="0x0000" bus="0x0a" slot="0x00" function="0x1"/> </source> <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0"/> </hostdev> <hostdev mode="subsystem" type="pci" managed="yes"> <source> <address domain="0x0000" bus="0x0a" slot="0x00" function="0x2"/> </source> <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/> </hostdev> <hostdev mode="subsystem" type="pci" managed="yes"> <source> <address domain="0x0000" bus="0x0a" slot="0x00" function="0x3"/> </source> <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/> </hostdev> <hostdev mode="subsystem" type="pci" managed="yes"> <source> <address domain="0x0000" bus="0x0d" slot="0x00" function="0x3"/> </source> <address type="pci" domain="0x0000" bus="0x08" slot="0x00" function="0x0"/> </hostdev> <memballoon model="virtio"> <address type="pci" domain="0x0000" bus="0x03" slot="0x00" function="0x0"/> </memballoon> </devices> <qemu:commandline> <qemu:env name="QEMU_AUDIO_DRV" value="pa"/> <qemu:env name="QEMU_PA_SERVER" value="/run/user/1000/pulse/native"/> </qemu:commandline> </domain> |
This task depends upon
"/usr/share/ovmf/x64/OVMF_CODE.fd" -> "/usr/share/edk2-ovmf/x64/OVMF_CODE.secboot.fd"
Does it make any difference for you?
I've added it as a permanent change now as the package name changed and switching seems sane.
Ultimately the bug persists.
edk2-ovmf 202002-9
qemu 4.2.0-2
and
edk2-ovmf 202002-7
qemu 4.2.0-2
Thank you for confirming it. Could you please also check it with the previous testing versions of qemu 5.0.0-4, 5.0.0-3 ....?
It's been reported on reddit [1]. A cause has been identified and workarounds are available. Affects only Zen 2 architecture. Mostly likely an upstream qemu bug, or possibly kernel.
[1]: https://www.reddit.com/r/VFIO/comments/gf53o8/upgrading_to_qemu_5_broke_my_setup_windows_bsods/
Oh wow, thank you very much. Switching from host-passthrough to host-model for the CPU worked.
Thanks a bunch :)
It boots fine (and never seen a Bluescreen) so not quite sure it is related but wanted to ask if there is someone out there to confirm or deny..
I don't have pass-through but basically default QXL Libvirt VM (virt-manager) just weeks ago created.
<libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
<libosinfo:os id="http://microsoft.com/win/10"/>
</libosinfo:libosinfo>
...
<cpu mode="host-model" check="none"/>
...
<video>
<model type="qxl" ram="65536" vram="65536" vgamem="16384" heads="1" primary="yes"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x0"/>
</video>
this sounds related:
https://www.reddit.com/r/VFIO/comments/gf9cay/qemukvm_windows_10_guest_got_very_very_laggy_arch/
btw dmesg shows nothing special and linux runs as fast as ever...
This sounds like a different issue to OP. Might be better suited to the forums, in fact there's already a thread:
https://bbs.archlinux.org/viewtopic.php?id=255489
Things to check for Win10 VM's:
- virtio is a must (both network and storage) with appropriate drivers
- if not using VGA passthrough, QXL drivers are a must
- audio is often flakey. As a test, try (temporarily) removing virtual sound card (ich9) and see if problem repros.
DPC Latency Checker's in Windows found big issues with the sound, network and storage drivers so I have also tested not using virtio and it has the same issue.
I tried downgrading qemu but I think there are other packages that need to be downgraded because it was complaining about missing libraries.
what errors do you see?
It seems to be working great now. I will know more when I use it all day with Visual Studio on Monday for work.
I run my Win10 VM with io=threads and don't see the issue.
Options to test:
1. run your VM's with io='threads' in the XML for the disk.
2. build qemu with `--disable-linux-io-uring'
3. remove liburing from deps and rebuild qemu (same effect as 2)
[Code]
<disk type="file" device="disk">
<driver name="qemu" type="qcow2" io="threads"/>
<source file="/run/media/Daten/VM/win10-buero.qcow2"/>
<backingStore/>
<target dev="sda" bus="sata"/>
<address type="drive" controller="0" bus="0" target="0" unit="0"/>
</disk>
[/Code]
but playing Fullhd vid made VM stuck after around 2,5 minutes.
Yeah sorry, I now understand that won't work.
It appears io_uring is used in at least 2 places in latest qemu:
1. block driver
2. fd monitoring
1 is a tunable and therefore optional. 2 is not AFAICT, and this is the bit apparently upsetting Win10 VM's.
Seeing as multiple folks are affected, it's probably best if Arch disable io_uring for the time being (until upstream can look at @wkchu's bug report [1] and hopefully fix it).
@Anatol, what do you think? Simply removing liburing from _headlessdeps and rebuilding will do the trick.
[1]: https://bugs.launchpad.net/qemu/+bug/1877716
Sounds good to me. liburing has been disabled and pushed to [testing] as qemu-5.0.0-6. Please take a look and let me know if it fixed the issue for you.
Please keep io_uring support in place. There are people who use it. Thanks.
Please re-read my comments above. The problematic io_uring usage is in the file descriptor monitoring portion of qemu. This part *cannot* be disabled (at least that I can see - but maybe it can be patched out?). Yes, it's unfortunate for those who want to utilise it in block I/O, but hopefully this disablement will only be short term temporary.
For me it is resolved now! Thanks a bunch to all of you!
We definitely interested to see it enabled. But this is going to be done once this feature stabilized and does not cause any major issues to our users.
Once https://bugs.launchpad.net/qemu/+bug/1877716 is resolved we will reconsider enabling this feature.
Running qemu-5.0.0-6 from [testing] for nearly two hours now and the bug appears to be gone. Thanks.