FS#65869 - [linux-hardened] Kernel panic caused by non zeroed-free pages

Attached to Project: Arch Linux
Opened by Filip Brygidyn (fbrygidyn) - Tuesday, 17 March 2020, 20:04 GMT
Last edited by freswa (frederik) - Sunday, 13 September 2020, 15:28 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Levente Polyak (anthraxx)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

The problem:
For some time now I am experiencing kernel panics on linux-hardnened kernel.
The earliest captured traces I have are from 5.4.7 version and I am reproducing it on the latest 5.5 git branch.

The details:
I build a 5.5 version with symbols and you can see a symbolized trace in symbolized_panic.txt
(build on this tree: https://github.com/anthraxx/linux-hardened/tree/b5d24fe9e7cf98359c2910e7444da3022983c3ed using the linux-hardened-git aur package)

It essentially boils down to a check at https://github.com/anthraxx/linux-hardened/blob/b5d24fe9e7cf98359c2910e7444da3022983c3ed/mm/page_alloc.c#L2193
... which fails - the pages at this point are not always zeroed.

I added a crude patch to see more verbose information (0001-verbose.patch)
And with it applied you can see an output in after_verbose_patch.txt
There are several pages that appear to have one or two 64 Byte blocks that are not zeroed.

I also tried applying only this check on a vanilla kernel - same results.
The patch consisted of
- lines 2190-2194 from https://github.com/anthraxx/linux-hardened/blob/b5d24fe9e7cf98359c2910e7444da3022983c3ed/mm/page_alloc.c
- lines 218-223 from https://github.com/anthraxx/linux-hardened/blob/b5d24fe9e7cf98359c2910e7444da3022983c3ed/include/linux/highmem.h


And I tested on both existing arch installation as well as on fresh, clean install to eliminate any possible rouge services corrupting memory - if there is some then it is in some base package.



Possible explanation:

I see 3 things that can be happening here:
1. The assertion in the linux-hardware patches at https://github.com/anthraxx/linux-hardened/blob/b5d24fe9e7cf98359c2910e7444da3022983c3ed/mm/page_alloc.c#L2193 is wrong.
The logic seems to be correct. If the init-after-free flag is present then when we reclaim a page from a free list it should be zero-ed.
But maybe the 'free list' does not only contain pages that were previously 'freed' (and initialized) but also some other ones.
2. This is an upstream bug - As explained above the vanilla kernel with just this check added resulted in the same panic. But this is assuming that that the check is valid.
3. This is some hardware/firmware-related bug - the 64 Byte blocks may be corrupted cache lines?

@anthraxx or someone: Please check if the assert at https://github.com/anthraxx/linux-hardened/blob/b5d24fe9e7cf98359c2910e7444da3022983c3ed/mm/page_alloc.c#L2193 is valid. I do not have enough knowledge about the mm subsystem to tell.


Test hardware:
I tested in on several configurations: the parts I had to swap around were as follows:
CPUs: ryzen 2600, ryzen 2200g
Mobo: x470D4U, B450M steel legend
Ram: 4 sticks of ECC ram: 2x8GB + 2x16GB
SSD: intel 760p (nvme), samsung 860 evo (sata)
No gpu - all tested headless
PSU: supermicro 1200W rack unit, seasonic 360W gold unit
Cooling: wraith prism, wraith stealth

Traces gathered over a serial port

I mixed and matched those components around and ruled out any single component, I also went to town in the BIOS and tried disabling all features I could find + checked lower/higher memory frequencies aside from stock, flashed old and new BIOSes, tried different DIMM slots.

The only 2 common hardware parts that I cannot rule out are that all configurations used a 2nd gen ryzen on an asrock mobo.


Steps to reproduce:
Install linux-hardware package - I tried several, also compiled my own - all had the same problem.
I found ways to reproduce:
1. Easiest: install memtester package and while booted into linux-hardened kernel run a shell like this (done while 16GB of ram were installed):

while true
do
memtester 12G &
sleep 8
killall memtester
done

What it essentially does is to allow memtester enough time to allocate 12G and write something. The panic occurs during the allocation part. Sometimes the panic happens right away, and sometimes it can take a few minutes. Doing something in the background seem to help - for example launch kernel compilation in parallel.

2. Just reboot repeatedly - eventually the boot will fail - you can see some of the failed boot traces in the random_panics.txt. Sometimes it happens right away and sometimes I could reboot for 2 hours without any issues.





This task depends upon

Closed by  freswa (frederik)
Sunday, 13 September 2020, 15:28 GMT
Reason for closing:  Upstream
Additional comments about closing:  https://bugzilla.kernel.org/show_bug.cgi ?id=206963
Comment by Filip Brygidyn (fbrygidyn) - Tuesday, 17 March 2020, 20:37 GMT
When looking at the random_panics.txt you will also find call traces that seem to come from other sources. This would point to an upstream bug. But for now I was unable to get any panics on a non-hardened kernel.

If you have any suggestions about kernel config options or run parameters then please let me know - reproducing a crash on a clean upstream vanilla kernel would rule out hardening patches and I could go to an upstream bug tracker with this.
Comment by Filip Brygidyn (fbrygidyn) - Wednesday, 18 March 2020, 06:49 GMT
The patch with verbose output would not compile - I took it from a wrong tree. Here is a working one.

Also: what do I have to do to be able to edit my own task description/attachments? I would like to fix grammar mistakes and replace the broken patch.
Comment by Levente Polyak (anthraxx) - Wednesday, 18 March 2020, 08:41 GMT
Thanks for the verbose report, details and willingness to debug this :)

The easiest way would be to find the faulty vanilla patch that behaves incorrectly. To do this easily, you seem to be aware of a range of versions where this was good and where it started to behave badly. Could you try a git bisect between both versions to figure out the vanilla linux commit that introduced this regression?
Comment by Filip Brygidyn (fbrygidyn) - Wednesday, 18 March 2020, 08:57 GMT
I only know that all linux-hardened versions I tried reproduced a panic.
And that all non-hardened linux version worked without a panic.

So I do not have any hardened-linux version that worked.


What I can try for now is to apply a minimal patch:
- lines 2190-2194 from https://github.com/anthraxx/linux-hardened/blob/b5d24fe9e7cf98359c2910e7444da3022983c3ed/mm/page_alloc.c
- lines 218-223 from https://github.com/anthraxx/linux-hardened/blob/b5d24fe9e7cf98359c2910e7444da3022983c3ed/include/linux/highmem.h

on top of an random old vanilla kernel and check if I can reproduce a panic. If I am not then bisect from there.

Edit: I can start doing it in about ~6-7 hours after I finish work. Also I am wondering if I will be able to boot on an old kernel on a ryzen system. If I for example take a 4.19 or 3.14.
Comment by Levente Polyak (anthraxx) - Wednesday, 18 March 2020, 09:03 GMT
I see, your description indicated that you were using hardened before it started to panic. Thanks for trying to help finding the root cause, reproducing this with the 'mm: add support for verifying page sanitization' patch and finding a working vanilla variant would be a good first step to allow bisecing it.
Comment by Filip Brygidyn (fbrygidyn) - Wednesday, 18 March 2020, 09:09 GMT
In that case I am sorry - I was not clear.
I did use linux-hardened package for about half a year with little to no crashes. But since the beginning I always had random failed boots and freezes. Back then I didn't know what was the cause. Now after enabling a serial console I see what those problems were.

Only after I started stressing the system I was able to reproduce panics more frequently.
Comment by Filip Brygidyn (fbrygidyn) - Wednesday, 18 March 2020, 17:41 GMT
Instead of patching vanilla I just built an older linux-hardened package. After all I am not really sure if taking just a few lines and applying them on a vanilla kernel would work as expected.

Just now I finished checking the 4.19.17: https://git.archlinux.org/svntogit/packages.git/commit/trunk?h=packages/linux-hardened&id=3cd727170e52501509ac75aaab5d01493ea53a3e
and the panic log is attached.

Will try 4.15.18 now ( https://git.archlinux.org/svntogit/packages.git/commit/trunk?h=packages/linux-hardened&id=8784547f9b40bc1a9dc1a56250a88c6588f7b983 )
Comment by Filip Brygidyn (fbrygidyn) - Wednesday, 18 March 2020, 19:06 GMT
4.15, 4.16 and 4.17 linux-hardnened packages do not compile - most likely gcc is is too recent

4.18 crashes the same way.

I also see on github that there is a 4.14 version but I cannot find a PKGBUILD/config for it. It was most likely back when linux-hardened was in AUR.
@anthraxx Do you know where can I find those old versions? If I'm to fix the broken compilation then I would like to go as far back as I can.
Comment by Levente Polyak (anthraxx) - Wednesday, 18 March 2020, 19:29 GMT
it was in the community repository before i have moved it to extra, an example pick from 4.13:
https://git.archlinux.org/svntogit/community.git/tree/linux-hardened/trunk?id=a623982bcdcf0cae6e841a90120bb705f7ec1deb

you could clone the tree and see how to reach the objects, its not a valid ref/branch anymore as it has been deleted in community.

You may indeed need to downgrade gcc as well in some virtual machine or such :/
Comment by Filip Brygidyn (fbrygidyn) - Thursday, 19 March 2020, 22:55 GMT
I took a old linux-lts package, replaced the kernel.org links with zipped 4.14 hardened tree and copied a config from old 4.14.17 linux-hardened. With hitting return a few times to set missing config options It somehow finished building with current gcc. (config in the attached tar.gz)
I tried building older gcc but god it's infuriating - AUR has __a_lot__ of old gcc versions and pretty much nothing older than gcc8 works.


Anyway: 4.14 paniced the same way.

So at this point I do not think I can go any farther back in kernel versions.
@anthraxx Can you tell me if the check at https://github.com/anthraxx/linux-hardened/blob/b5d24fe9e7cf98359c2910e7444da3022983c3ed/mm/page_alloc.c#L2193 is valid?
Or point to someone who I could ask to verify this? I know it would be more than weird to find out that this check was invalid for years without causing issues but it just seems suspicious to me.

Another theory I have could be a problem with ECC ram on ryzen (no official validation from AMD) - disabling ECC in BIOS didn't help but maybe the support of ECC dimms in both ECC and non-ECC modes is somehow broken/doing something unexpected.
Unfortunately I do not have any non-ECC ddr4 sticks ATM. Will try to get some.


Just to make sure I will check on a second system again ('B450m steel legend + 2200g + 8GB stick' instead of 'x470d4u + 2600 + 16GB stick')
Comment by Levente Polyak (anthraxx) - Thursday, 19 March 2020, 23:11 GMT
The checks are definitively valid, the area is not supposed to contain any undesired bytes at that point. This must be some driver/module or similar doing something faulty like a write-after-free.
Comment by Thibaut Sautereau (thithib) - Friday, 20 March 2020, 14:13 GMT
Hi Filip, have you tried running with KASAN enabled? It might help us catch the potential write-after-free causing this issue. Thanks a lot for your commitment ;)
Comment by Filip Brygidyn (fbrygidyn) - Sunday, 22 March 2020, 11:58 GMT
Hello,
I compiled a KASAN version and did more testing. Now I see 3 separate issues.

To make this all organized:
All the testing I did was on a linux-hardened 5.5.10.a-1:
https://git.archlinux.org/svntogit/packages.git/tree/trunk?id=67b915f73578de3d6874df3cf674404423619db2
with config file modified to enable KASAN and debug info.
You can find a script I used in attached kernel_build_script.tar.gz
This should be a corresponding kernel tree: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/?h=v5.5.10&id=7ee76f1601f39ab3941c8b1c9a19dfc58f7cea47

And also: I noticed my previous logs were not wrapping lines - so longer lines were being cut. I changed the minicom config and logs attached this time seem to be full.



First I booted the mentioned kernel on a second machine (B450M steel legend + 2200G + 8G ecc stick + 860 EVO)
Was able to reproduce the crash as expected but before I even logged in to launch memtester - KASAN logged some errors. It doesn't seem like a source of the original issue but well... It did report use-after-free related to amdgpu module (This board/cpu combo did not allow me to disable IGPU)
You can find the logs (raw and symbolized) in 2nd_machine_panic.tar.gz This may be related to https://bugs.archlinux.org/task/59463

Not all KASAN call-traces were symbolized by 'decode_stacktrace.sh' script. I do not know why. If you know how I can fix this the let me know.



Anyway, After that I went back to the 1st machine (X470D4U + 2600 + 16G ecc stick + 760p), booted the same kernel and got a panic as well. Logs (raw and symbolized) in 1st_machine_panic.tar.gz
No KASAN logs of any kind.

But then I think I found a way to stop the panics:
I tried booting with mem_encrypt=off and after _a_lot_ of reboots and memtester launches I was not able to reproduce a panic. Removing mem_encrypt=off option resulted in easy reproduction after at most few reboots/memtester launches.

***Edit: The thing about BIOS option is incorrect - it disables TSME, not SME***
Now: Maybe you noticed that in the bug description I mentioned about disabling all the BIOS features I could find. This included SME. But it turns out that that option does not do anything on my X470D4U board (I did not check B450M yet). Even with SME disabled in the bios I could see "AMD Secure Memory Encryption (SME) active" in the logs.
***end edit***

And there is one more thing that kinda points to SME:
When I look at the non-zeroed regions I dumped into after_verbose_patch.txt they do seem random. Maybe a coincidence but maybe they were encrypted/decrypted by a broken SME.



If it is indeed SME than I should be able to reproduce some panic on a linux-lts package with mem_encrypt=on. the config of both linux and linux-lts package has SME disabled by default.
from linux-lts config:
CONFIG_AMD_MEM_ENCRYPT=y
# CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set
Without the additional checks of linux-hardened I do not yet know how hard it could be to reproduce.


TL;DR:
I see 3 problems now:
1. KASAN points a seemingly unrelated use-after-free in amdgpu

2. ***Edit: this is incorrect, see the remark above*** Asrock BIOS is broken - SME disable switch doesn't disable SME
3. The original one - Those panics look like a broken SME - disabling it with mem_encrypt=off seems to have helped for now
Comment by Filip Brygidyn (fbrygidyn) - Tuesday, 24 March 2020, 20:25 GMT
I was unable to reproduce any issue on an unmodified lts kernel with mem_encrypt=on (looped reboots+memtester for more than 10 hours). It's by no means a definitive test but I am not longer willing to continue.

On the other hand the lts kernel _with a check for zeroed pages_ added behaves the same way as hardened - with mem_encrypt=on I can reproduce the issue easily and with mem_encrypt=off I can't.
At this point I think this is no longer a linux-hardened issue so I am planning to open a upstream bug on bugzilla.kernel.org (probably this weekend or when I have time)

For reference I am attaching a few things:
- lts+checks_build_script.tar.gz - contains a script that I used for building linux-lts package. It also includes a minimal zero-check patch from linux-hardened and config flags modifications needed to reproduce.
- lts+checks_log.tar.gz - lts kernel logs with non zeroed pages (raw and symbolized)

If you have anything else you would like me to try/check then please let me know
Comment by Thibaut Sautereau (thithib) - Tuesday, 24 March 2020, 20:56 GMT
Thank you Filip. It's sadly not the first time people complain about the combination of linux-hardened and AMD Secure Memory Encryption. Take a look at this, for instance: https://github.com/anthraxx/linux-hardened/issues/16
Comment by Levente Polyak (anthraxx) - Tuesday, 24 March 2020, 21:28 GMT
Thanks Filip, truly good debugging skills and conclusions, please try to report this somehow upstream.

Either way I think we should give up on amd's mem_encrypt, its poorly engineered with an incomplete and borked ecosystem around it.
Comment by Filip Brygidyn (fbrygidyn) - Wednesday, 25 March 2020, 19:26 GMT Comment by loqs (loqs) - Thursday, 26 March 2020, 21:34 GMT
@fbrygidyn if you do not receive a response on the bugzilla I would suggest adding the author of kernel SME support
Tom Lendacky <thomas.lendacky@amd.com>

You could also try the kernel mailing list:
linux-kernel@vger.kernel.org (open list:X86 MM)

Individuals you might want to cc on the list:
Dave Hansen <dave.hansen@linux.intel.com> (maintainer:X86 MM)
Andy Lutomirski <luto@kernel.org> (maintainer:X86 MM)
Peter Zijlstra <peterz@infradead.org> (maintainer:X86 MM)
Thomas Gleixner <tglx@linutronix.de> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))
Ingo Molnar <mingo@redhat.com> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))
Borislav Petkov <bp@alien8.de> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))
"H. Peter Anvin" <hpa@zytor.com> (reviewer:X86 ARCHITECTURE (32-BIT AND 64-BIT))

Loading...