FS#70101 - [nvidia] Forwarding eGPUs into qemu (KVM) has become flaky

Attached to Project: Arch Linux
Opened by Andrej Podzimek (andrej) - Sunday, 21 March 2021, 02:43 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Wednesday, 21 April 2021, 05:34 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Sven-Hendrik Haase (Svenstaro)
Felix Yan (felixonmars)
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
After a recent update, not sure which one (the main suspect being nvidia 460.56 -> 460.67), eGPUs work with virt-manager + qemu + KVM only during a few lucky uptimes (out of many more) and keep failing during other uptimes. I haven't spotted a precise pattern. During a particular uptime, it either always works or always fails. This used to work rock-solid just a few days ago; no lucky or unlucky boots/uptimes.

Tried (without any results):
* both Thunderbolt ports for the eGPU; both are affected
* presence / absence of other Thunderbolt devices (doesn't seem to matter)
* all sorts of reboots (~20+) (randomness still prevails)
* module unloads / reloads (as far as possible; not too far) of vfio_.* and nvidia.*

Haven't tried (but perhaps should have):
* building 460.56 and reverting to that version
* heck, this has been an all-nighter and I've just stumbled upon an uptime in which 460.67 + qemu + KVM + eGPU works again, so I don't feel like rebooting it again, tbh

Additional info:
* package version(s)

libvirt 1:7.0.0-3
libvirt-dbus 1.4.0-1
libvirt-glib 3.0.0-2
libvirt-python 1:6.4.0-3
nvidia-dkms 460.67-1
nvidia-settings 460.67-1
nvidia-utils 460.67-1
opencl-nvidia 460.67-1
qemu 5.2.0-3
qemu-arch-extra 5.2.0-3
virt-install 3.2.0-1
virt-manager 3.2.0-1
virt-viewer 9.0-1

* config and/or log files etc.

Some dmesg stuff when it works:
Mar 21 03:10:04 charon kernel: VFIO - User Level meta-driver version: 0.3
Mar 21 03:10:05 charon kernel: vfio-pci 0000:3d:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none
Mar 21 03:10:10 charon kernel: vfio-pci 0000:3d:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Mar 21 03:10:10 charon kernel: vfio-pci 0000:3d:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0x0000
Mar 21 03:11:26 charon kernel: vfio-pci 0000:3d:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Mar 21 03:11:26 charon kernel: nvidia 0000:3d:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none

Some dmesg stuff when it doesn't work:
Mar 21 02:52:51 charon kernel: VFIO - User Level meta-driver version: 0.3
Mar 21 02:52:51 charon kernel: vfio-pci 0000:3d:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none
Mar 21 02:52:57 charon kernel: vfio-pci 0000:3d:00.0: can't enable device: BAR 5 [io 0x0000-0x007f] not claimed
Mar 21 02:52:58 charon kernel: vfio-pci 0000:3d:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Mar 21 02:52:58 charon kernel: nvidia 0000:3d:00.0: can't enable device: BAR 5 [io 0x0000-0x007f] not claimed
Mar 21 02:52:58 charon kernel: nvidia: probe of 0000:3d:00.0 failed with error -1

* link to upstream bug report, if any
N/A. The error message in the Linux sources is from ~2014. There are a few seemingly related bugs from 2014-2017, but that's just too distant history.

Steps to reproduce:
Reboot a few times.
Each time, try to forward an NVidia eGPU (PCIe device in virt-manager) into a qemu + KVM machine.

Some extra notes:
* Even when the vfio forwarding doesn't work, the NVidia works just fine. It can run Folding@Home, for example, no problem at all.
* The machine is an ASRock X570 Creator (BIOS 3.40) with its default built-in Thunderbolt and a Radeon Pro W5700 inside.
* The eGPU is an NVidia Quadro P5000 in a Razer Core X Chroma enclosure.
* I have the following magic incantation on the kernel command line: pcie_ports=native pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=256M,hpmmioprefsize=16G
See this thread for more context on why it's needed: https://bbs.archlinux.org/viewtopic.php?id=261303
This task depends upon

Closed by  Sven-Hendrik Haase (Svenstaro)
Wednesday, 21 April 2021, 05:34 GMT
Reason for closing:  Won't fix
Additional comments about closing:  2021-04-05: A task closure has been requested. Reason for request: I can't reproduce this any more.
Comment by Sven-Hendrik Haase (Svenstaro) - Sunday, 21 March 2021, 10:00 GMT
Well, I can not test this and even if I could I don't think there's anything I can do. Did you try contacting Nvidia? Is this perhaps a timing issue? You could theoretically blacklist some of the affected modules so they don't come up automatically and then load them yourself or you could make your own initcpio hooks to try to debug this problem during boot somehow.
Comment by Sven-Hendrik Haase (Svenstaro) - Monday, 05 April 2021, 07:07 GMT
Any further update on this? I don't have an eGPU to test and frankly the information provided so far is too diffuse for me to make a solid guess as to what's up here.

Loading...