FS#72991 - [edk2-ovmf] Black screen when doing single GPU passthrough

Attached to Project: Arch Linux
Opened by Shivanshu Goyal (Haxxer64) - Monday, 13 December 2021, 04:33 GMT
Last edited by David Runge (dvzrv) - Saturday, 18 December 2021, 19:25 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Anatol Pomozov (anatolik)
David Runge (dvzrv)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 3
Private No

Details

Description: Black screen when doing single GPU passthrough

Problematic package: edk2-ovmf

This problem happened when I upgraded from version 202108-1 to 202111-4. The issue is fixed when I rollback this package to the old version.
Other people also seem to be running into this issue after updating this package. There are some reports on this Reddit thread:
https://www.reddit.com/r/VFIO/comments/reuozk/something_weird_is_happening_with_the_virtual/

This is the libvirt XML of the VM which breaks after updating this package:
https://pastebin.com/pKWvVD0y

Note that there are no error messages produced by libvirt. It successfully boots up the VM, but I don't see anything on the screen when it boots up.
This task depends upon

Closed by  David Runge (dvzrv)
Saturday, 18 December 2021, 19:25 GMT
Reason for closing:  Fixed
Additional comments about closing:  Fixed with edk2-ovmf 202111-5
Comment by David Runge (dvzrv) - Monday, 13 December 2021, 07:05 GMT
@Haxxer64: Thanks for the report.

Note: Please make sure to attach the relevant file(s) to the ticket using the attachment functionality, as we otherwise have to rely upon non-free paste services such as pastebin.com, which are not very privacy friendly.

In regards to the issue: I do not use this type of setup so this will require some time to investigate.
There have been some changes [1] to the package since 202108-1:

* upgrade and CSM support [2]
* separating CSM images [3]
* add descriptor files for CSM, add TPM_ENABLE where possible [4]
* better naming scheme for descriptor files, S4 support, images for IA32 UEFI on x86_64 [5]

For good measure, are you able to test whether a new machine exhibits the same behavior?
I believe, that your issue is probably solved by an update to one of your configuration files (e.g. updated firmware location).

[1] https://github.com/archlinux/svntogit-packages/commits/packages/edk2/trunk
[2] https://github.com/archlinux/svntogit-packages/commit/c34ae40ea9e1dc243ebfc640a491681ff84af15d#diff-3e341d2d9c67be01819b25b25d5e53ea3cdf3a38d28846cda85a195eb9b7203a
[3] https://github.com/archlinux/svntogit-packages/commit/13a9fbbd85507bc3894277190b9de9f24b8bce47#diff-3e341d2d9c67be01819b25b25d5e53ea3cdf3a38d28846cda85a195eb9b7203a
[4] https://github.com/archlinux/svntogit-packages/commit/1b4a048efcdb771e431444ec6927c8a5d4982ded#diff-3e341d2d9c67be01819b25b25d5e53ea3cdf3a38d28846cda85a195eb9b7203a
[5] https://github.com/archlinux/svntogit-packages/commit/a123cfd55e608a31899b666a627028f968e89afd#diff-3e341d2d9c67be01819b25b25d5e53ea3cdf3a38d28846cda85a195eb9b7203a
Comment by Shivanshu Goyal (Haxxer64) - Monday, 13 December 2021, 07:15 GMT
@dvzrv, I have attached the libvirt XML.

I will try creating a new machine and see if this behavior repros, and get back to you.

Note that the firmware file path in my VM XML (/usr/share/edk2-ovmf/x64/OVMF_CODE.fd) is a valid file path. Is it possible that the nvram file (/var/lib/libvirt/qemu/nvram/win10-work_VARS.fd) is not compatible with the new firmware and needs to be recreated?
Comment by David Runge (dvzrv) - Monday, 13 December 2021, 07:32 GMT
> Is it possible that the nvram file (/var/lib/libvirt/qemu/nvram/win10-work_VARS.fd) is not compatible with the new firmware and needs to be recreated?

That is possible. Do note, that changing the nvram file may have adverse effects on your guest OS. Please always make sure to have backups!
Comment by Sven-Hendrik Haase (Svenstaro) - Monday, 13 December 2021, 09:47 GMT
Just to note: edk2-ovmf 202111-4 appears to work for me another another person on the team with GPU passthrough (although we both have a dedicated GPU for just the VM).
Comment by Alexander Epaneshnikov (alex19EP) - Monday, 13 December 2021, 09:54 GMT
also can you try providing ovmf debug output?
to do that change your libvirt.xml

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

and at the end of <domain> add

<qemu:commandline>
<qemu:arg value='-debugcon'/>
<qemu:arg value='stdio'/>
<qemu:arg value='-global'/>
<qemu:arg value='isa-debugcon.iobase=0x402'/>
</qemu:commandline>

then attach /var/log/libvirt/qemmu/vmname.log
Comment by Shivanshu Goyal (Haxxer64) - Monday, 13 December 2021, 19:39 GMT
Alexander, I added the extra qemu arguments like you suggested, and ran the VM using both old and new packages to see the difference between the logs. And they both look identical. But I'm able to repro the problem consistently. If you have any more tips to get any more logs I'm happy to collect them.

Note that I rebooted my machine after upgrading and downgrading this package to absolutely make sure they get picked up by any running services (I wasn't sure if any services use this package in the background).

@Svenstaro, when you run your VM with GPU passthrough, do you see the TianoCore logo or does it boot right into Windows? One of my VMs with GPU passthrough works with the new package but the TianoCore logo doesn't show up. The VM which doesn't boot up has Bitlocker enabled and is probably using some UEFI function to ask for the Bitlocker recovery key and I'm guessing that's what's causing it to fail.
Comment by Shivanshu Goyal (Haxxer64) - Monday, 13 December 2021, 20:00 GMT
Since I have an easy repro, I'm happy to try to build this package at various commits and kind of do a binary search to try and find the exact commit which causes this break. Can you please give me some tips on what this process would look like? I'm not sure how to build this package from source, and the range of commits I should be looking at.
Comment by Shivanshu Goyal (Haxxer64) - Tuesday, 14 December 2021, 02:38 GMT
I figured out how to build this package from source and was able to narrow down the break to this commit:
https://github.com/archlinux/svntogit-packages/commit/c34ae40ea9e1dc243ebfc640a491681ff84af15d

This is the commit which does the big upgrade of edk2 to 202111. I will continue digging into this and report back. If you have any tips that will be appreciated :)
Comment by Alexander Epaneshnikov (alex19EP) - Tuesday, 14 December 2021, 03:57 GMT
> I figured out how to build this package from source and was able to narrow down the break to this commit:

yep that was expected. can you try building debug variant. and provide logs. maybe it will show something new.
to do that: change _build_type=RELEASE to _build_type=DEBUG in ovmf pkgbuild.
Comment by Shivanshu Goyal (Haxxer64) - Tuesday, 14 December 2021, 04:17 GMT
Ah, the debug build produced more log messages. I've attached it, hope it helps.
Comment by Shivanshu Goyal (Haxxer64) - Tuesday, 14 December 2021, 04:33 GMT
I'm also attaching the log output produced by the good package (also built with the debug config) so you can compare the 2 files to see what stands out
Comment by Alexander Epaneshnikov (alex19EP) - Tuesday, 14 December 2021, 05:39 GMT
indeed according to -Graphics Console Started, Mode: 0
this doesn't happen on previous ovmf.

to understand if it's upstream regression or hour. you can update working pkg build to latest ovmf. I think it should build.
to do that just change pkgver.
Comment by Shivanshu Goyal (Haxxer64) - Tuesday, 14 December 2021, 07:11 GMT
Found the upstream commit which breaks my VM:
https://github.com/tianocore/edk2/commit/b8675deaa819631db2667df63f89799fe65fc906

All commits on master before this commit work successfully, and this one consistently fails.
Comment by Shivanshu Goyal (Haxxer64) - Tuesday, 14 December 2021, 08:08 GMT
I applied the reverse patch of the commit I mentioned above on top of tianocore/edk2 master branch and the VM boots up fine, which further proves that it's that particular commit which is bad. But it's not obvious why it's causing a failure.

I attached the log file when I ran the VM after applying the reverse patch on top of master
Comment by Alexander Epaneshnikov (alex19EP) - Tuesday, 14 December 2021, 15:01 GMT
> But it's not obvious why it's causing a failure.

for me too. I recommend contacting upstream about it.
logs which you are gathered for us should be enough for them at list for start.

I think this ticket can be closed as A upstream issue.
Comment by Shivanshu Goyal (Haxxer64) - Tuesday, 14 December 2021, 18:49 GMT
I am talking to the upstream developers about this issue now and have filed a bug report:
https://bugzilla.tianocore.org/show_bug.cgi?id=3771

Can we please keep this ticket open until we release a patched version with a fix? This new version will break many people's workflows. This way it will they'll know what's going on and that a fix is pending?
Comment by Sven-Hendrik Haase (Svenstaro) - Wednesday, 15 December 2021, 05:35 GMT
Yeah sounds fair, let's keep it open. We usually close things as upstream if there's nothing we can do about it (like the nvidia packages) but it's fine to track the upstream progress in this case.
Comment by David Runge (dvzrv) - Wednesday, 15 December 2021, 10:09 GMT
If there is a fix (e.g. revert) to apply that is ACK'ed by upstream, I'm happy to apply it.
Comment by Shivanshu Goyal (Haxxer64) - Wednesday, 15 December 2021, 13:06 GMT
This is the patch that the original author of the breaking change wants to submit to fix the bug:

https://github.com/stefanberger/edk2/commit/0207e3bb476b3efe12dee19688d7f035202cbc8d

They are still in the process of getting a code review done for it, so let's wait until this change gets approved.

I'll let you know if they do another patched release. If they do, then we can pick it up. Otherwise we can temporarily apply the patch ourselves until we pick up a newer release.
Comment by Shivanshu Goyal (Haxxer64) - Friday, 17 December 2021, 23:21 GMT Comment by David Runge (dvzrv) - Saturday, 18 December 2021, 12:35 GMT
@Haxxer64: Please check whether 202111-5 in [testing] fixes your issue.
Comment by Shivanshu Goyal (Haxxer64) - Saturday, 18 December 2021, 19:15 GMT
@dvzrv: Yes, I just tried out that package and it fixes the GPU passthrough issue. Thanks!
Comment by David Runge (dvzrv) - Saturday, 18 December 2021, 19:24 GMT
@Haxxer64 thanks for the confirmation and the upstream communication. Much appreciated!

Loading...