FS#73504 - [linux] __dma_async_device_channel_unregister called while 2 clients hold a reference

Attached to Project: Arch Linux
Opened by Leon Möller (jkhsjdhjs) - Tuesday, 25 January 2022, 22:15 GMT
Last edited by Toolybird (Toolybird) - Thursday, 14 September 2023, 03:48 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
When shutting down my system a kernel trace is printed to the framebuffer in __dma_async_device_channel_unregister: https://elixir.bootlin.com/linux/latest/source/drivers/dma/dmaengine.c#L1106
I had to record a 240fps video in order to be able to read it, the relevant section of the video is attached. Start the video with `mpv --pause` and navigate through it frame by frame with , and .
I took a screenshot of the only frame the trace can be seen fully and attached it as well.

The call trace is:
<TASK>
dma_async_device_unregister
ccp_dmaengine_unregister
ccp5_destroy
sp_destroy
pci_device_shutdown
kernel_power_off
__do_sys_reboot
do_syscall
? do_syscall_64
? __s64_sys_rt_sigprocmask
entry_SYSCALL_64_after_hwframe

Since it happens during the poweroff of AMDs cryptography coprocessor, I think that it might be related to the UEFI firmware update I did a week ago: https://www.gigabyte.com/Motherboard/X570-AORUS-ELITE-rev-10/support#support-dl-bios
The traces also started appearing around that time.
I already tried disabling the AMD fTMP that is now enabled by default, but that didn't make a difference.
I also tried to halt the system instead of powering it off, in order to take a better picture of the trace, but with halting the trace still vanishes as quickly as it did while powering off or rebooting.

Additional info:
* package version(s) linux 5.16.2 (also happens with 5.16.1 and 5.15.10)
This task depends upon

Closed by  Toolybird (Toolybird)
Thursday, 14 September 2023, 03:48 GMT
Reason for closing:  Upstream
Additional comments about closing:  Clearly not an Arch packaging issue. If still happening, please continue to liaise with upstream to try and get to the bottom of it.
Comment by Leon Möller (jkhsjdhjs) - Tuesday, 25 January 2022, 23:28 GMT
I downgraded my UEFI to the F35 and the trace still appears. So the UEFI firmware wasn't related to this issue.
Another thing I also changed around a week ago was to use ldmtool to share a raid 1 between windows and linux. To do so, `ldmtool create all` is used to make the filesystems available on linux. I noticed that the trace doesn't appear when not running `ldmtool create all` at all. Even when running `ldmtool remove all` before shutting down, the trace is printed.

Since this turned out to be an issue of ldmtool, please move this bug report to the ldmtool package. I also reported this issue upstream: https://github.com/mdbooth/libldm/issues/21
Comment by Vince (vinicole) - Saturday, 29 January 2022, 14:06 GMT
I noticed exactly the same issue on my computer. My config is:
Asus Prime TRX40-PRO S (latest BIOS version: 1502)
AMD Ryzen Threadripper 3990x
Linux Debian (I tried the following kernel versions: 4.19, 5.10 and 5.15)

I don't use ldmtool, so this is not the cause of the issue in my opinion (at least on my system).
I can reproduce the issue with kernel version 5.10 and 5.15 (when rebooting or powering off my system). However I never experienced it with version 4.19 (or if it occurs, the warning is not printed on my display because for some reason with kernel version 4.19, my display remains blank during the shutdown process)
From what I could read on https://lkml.org/lkml/2020/1/21/1093, it seems that dma_async_device_channel_unregister was added in recent kernel versions, this would explain why the issue doesn't occur with kernel 4.19.

I am actually suspecting my SAS/SATA controller (https://www.intel.com/content/www/us/en/products/sku/97264/intel-raid-controller-rs3uc080j/specifications.html) and its kernel driver (mpt3sas) because I could not reproduce the issue using a virtual machine (which does not rely on the SAS/SATA controller) with the same software configuration.

Would you use any SAS/SATA controller in your system?

Comment by Leon Möller (jkhsjdhjs) - Saturday, 29 January 2022, 21:25 GMT
Thanks for commenting on this issue and confirming it! I don't have any SAS/SATA controller in my system, but since I have this issue with ldmtool, a tool for windows LDM volumes, and you have this issue with your SAS/SATA controller, this issue may be related to RAID on linux or at least some feature that both RAID types use. The current maintainer of libldm also said that it must be a kernel issue, as libldm only creates device-mapper mappings without any kernel component.
Comment by Vince (vinicole) - Sunday, 30 January 2022, 14:17 GMT
Thanks for your comment, this seems indeed to rule out my SAS/SATA Controller as being the root cause of the issue.
I use LVM on top of RAID linux (mdadm), so your assumption might be correct.

But as I stated in my previous comment, if I run a virtual machine with LVM and mdadm, I don't see the issue.
So there should be something specific in the configuration that makes it happen or not.
The virtual machine (not showing the issue) runs the following modules (see below lsmod output filtered on words "raid" and "scsi"):
async_memcpy 20480 2 raid456,async_raid6_recov
async_pq 20480 2 raid456,async_raid6_recov
async_raid6_recov 24576 1 raid456
async_tx 20480 5 async_pq,async_memcpy,async_xor,raid456,async_raid6_recov
async_xor 20480 3 async_pq,raid456,async_raid6_recov
libcrc32c 16384 1 raid456
md_mod 180224 7 raid1,raid10,raid0,linear,raid456,multipath
mptbase 77824 2 mptspi,mptscsih
mptscsih 32768 1 mptspi
raid0 24576 0
raid10 65536 1
raid1 53248 0
raid456 180224 0
raid6_pq 122880 3 async_pq,raid456,async_raid6_recov
scsi_mod 262144 7 mptspi,sd_mod,scsi_transport_spi,mptscsih,libata,sg,sr_mod
scsi_transport_spi 40960 1 mptspi

While my computer (showing the issue) runs those:
async_memcpy 20480 2 raid456,async_raid6_recov
async_pq 20480 2 raid456,async_raid6_recov
async_raid6_recov 24576 1 raid456
async_tx 20480 5 async_pq,async_memcpy,async_xor,raid456,async_raid6_recov
async_xor 20480 3 async_pq,raid456,async_raid6_recov
libcrc32c 16384 1 raid456
md_mod 188416 18 raid1,raid10,raid0,linear,raid456,multipath
raid0 24576 0
raid10 65536 1
raid1 53248 2
raid456 180224 0
raid6_pq 122880 3 async_pq,raid456,async_raid6_recov
raid_class 16384 1 mpt3sas
scsi_common 16384 6 scsi_mod,usb_storage,uas,libata,sg,mpt3sas
scsi_mod 270336 8 scsi_transport_sas,sd_mod,raid_class,usb_storage,uas,libata,sg,mpt3sas
scsi_transport_sas 49152 1 mpt3sas

Should this issue be reported to https://bugzilla.kernel.org?
Who will do this? (I am not that familiar with such bug reporting)
Comment by Leon Möller (jkhsjdhjs) - Monday, 31 January 2022, 01:16 GMT
Because this happens during ccp_destroy() (https://elixir.bootlin.com/linux/v5.16.4/source/drivers/crypto/ccp/ccp-dev-v3.c#L517), I suspect it may be related to the CCP (AMD Cryptographic Coprocessor) driver. This would explain why it doesn't occur in your VM, as you're using a virtual CPU there. While this may not actually be an issue with the CCP driver, I think we should at least ask about this issue there, maybe they can give us a pointer. https://bugzilla.kernel.org is often not the correct place to report issues with the kernel, as stated here: https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html The maintainers of the CCP driver and the respective mailing list are listed here: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/MAINTAINERS?h=v5.16.4#n894

So as far as I understood it, the correct way would be to write a mail to both maintainers with the mailing list in the CC field. I also checked the linux-crypto mailing list archives already (https://lore.kernel.org/linux-crypto/), but couldn't find any existing reports for our problem there. To verify that this isn't an issue with any external kernel modules we or our distributions are loading, we have to compile the latest mainline kernel and check whether the issue persists. It's rather unlikely that it has now been suddenly fixed as the issue persisted for at least 6 minor kernel versions, but this would be the correct way. I have time to test this tomorrow, otherwise you can also test it if you want, or you can also just write the mail directly. If you do decide to write a mail, please also put me (jkhsjdhjs@totally.rip) in the CC field.
Comment by Vince (vinicole) - Thursday, 03 February 2022, 21:29 GMT
Thanks for your email, it helped me to make some progress.

Assuming that the problem comes from CCP, I tried to disable this module in kernel version 5.10. (since my VM runs without it, my computer should be able to do the same)
However, for some unknown reasons, I was not able to blacklist this module. The only way I found to get rid of it was to remove the ccp.ko in the kernel files. I restarted the computer and I observed that the module was not loaded. Then I rebooted again and could not see the error messages again about dma_async_device_unregister and ccp_destroy. So CCP seems to be faulty.

Nevertheless, ccp module also runs on my 4.19 kernel and doesn't cause such any issue. Besides, both kernels (4.19 and 5.1) runs similar ccp module version (1.1).
By looking at the kernel code, the issue seems to come from the CCP DMA engine unregistration and when running modinfo, I noticed that the ccp module has a dmaengine parameter in kernel 5.10 which was not present in 4.19.

So I added ccp.dmaengine=0 as a kernel parameter in my /boot/grub/grub.cfg and could not reproduce the issue anymore. The ccp module is still properly loaded.
It also worked with kernel version 5.15.

I don't know exactly the impact of not registering the CCP to the DMA engine but at least, the error message doesn't show up.
I will drop an email to the CCP maintainers to know more regarding this.
Comment by Vince (vinicole) - Friday, 04 February 2022, 07:15 GMT
Comment by Vince (vinicole) - Friday, 04 February 2022, 07:16 GMT
Comment by Leon Möller (jkhsjdhjs) - Sunday, 06 February 2022, 05:10 GMT
Nice workaround! I don't actually think it's the CCP module that is faulty. I think the CCP module just provides DMA (direct memory access) functionality to other kernel modules, and since "many hardware systems use DMA, including disk drive controllers" [1], maybe raid modules like md do too. This is why I think some module fails to unregister their DMA channel correctly, which the CCP module tells us with the warning.

I still think we should message the CCP maintainers, maybe they can tell us which module is at fault or maybe I'm wrong and it's actually a CCP bug.
[1] https://en.wikipedia.org/wiki/Direct_memory_access
Comment by Vince (vinicole) - Sunday, 06 February 2022, 21:52 GMT
I agree. It would also be interesting to know whether this workaround has any impact on the system (performance, etc.)
For info, I found an additional CCP maintainer email address using modinfo command:
sudo modinfo ccp
filename: /lib/modules/5.10.0-11-amd64/kernel/drivers/crypto/ccp/ccp.ko
description: AMD Secure Processor driver
version: 1.1.0
license: GPL
author: Gary R Hook <gary.hook@amd.com>
author: Tom Lendacky <thomas.lendacky@amd.com>

So you may want to add Gary R Hook to the recipient list of your email.
Comment by Leon Möller (jkhsjdhjs) - Monday, 07 February 2022, 13:45 GMT
I now did an `lsmod > file` before and after I activated my LDM raid. Then I did a diff on the files with the following result:

1a2,12
> raid1 61440 1
> dm_raid 53248 1
> raid456 188416 1 dm_raid
> md_mod 204800 3 raid1,dm_raid,raid456
> async_raid6_recov 24576 1 raid456
> async_memcpy 20480 2 raid456,async_raid6_recov
> async_pq 20480 2 raid456,async_raid6_recov
> async_xor 20480 3 async_pq,raid456,async_raid6_recov
> async_tx 20480 5 async_pq,async_memcpy,async_xor,raid456,async_raid6_recov
> xor 24576 1 async_xor
> raid6_pq 122880 3 async_pq,raid456,async_raid6_recov
34c45
< libcrc32c 16384 3 nf_conntrack,nf_nat,nf_tables
---
> libcrc32c 16384 4 nf_conntrack,nf_nat,nf_tables,raid456
46c57
< amdgpu 8716288 30
---
> amdgpu 8716288 32
111c122
< dm_mod 184320 17 dm_crypt
---
> dm_mod 184320 25 dm_raid,dm_crypt
119c130
< ccp 131072 1 kvm_amd
---
> ccp 131072 7 kvm_amd

The amdgpu line can be ignored. Interesting is, that ccp is loaded anyways. So ccp definitely isn't solely responsible for the warning on shutdown.
The following modules are loaded on raid activation:
raid1
dm_raid
raid456
md_mod
async_raid6_recov
async_memcpy
async_pq
async_xor
async_tx
xor
raid6_pq

So the error is probably somewhere within one of those.
Comment by Leon Möller (jkhsjdhjs) - Monday, 14 February 2022, 19:54 GMT
I asked the AMD CCP maintainers about this issue, let's see what they say: https://marc.info/?l=linux-crypto-vger&m=164468679317868

Loading...