FS#59300 - [linux-hardened] 4.17.x kernel panic on logout

Attached to Project: Arch Linux
Opened by tom (archtom) - Wednesday, 11 July 2018, 15:27 GMT
Last edited by Levente Polyak (anthraxx) - Monday, 06 August 2018, 10:54 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Levente Polyak (anthraxx)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
A kernel panic occurs not on every, but al least on every 3rd logout from openbox using openbox --exit. Logout via loginctl terminate-session $XDG_SESSION_ID produces the same error. I don`t know with wich kernel version this started exactly, but tried a few back and all with the same error. Regular kernel in latest version 4.17.5-1 is fine and the error does not occur.

I attached a picture as the log does not give any info about this after reboot.

Thanks for looking into it in advance.

Kind regards
This task depends upon

Closed by  Levente Polyak (anthraxx)
Monday, 06 August 2018, 10:54 GMT
Reason for closing:  Fixed
Comment by tom (archtom) - Wednesday, 11 July 2018, 15:29 GMT
here comes the picture...
Comment by Levente Polyak (anthraxx) - Wednesday, 11 July 2018, 15:40 GMT
This is the result of CONFIG_DEBUG_LIST=y
To debug this issue, please compile the regular vanilla kernel from git (non hardened as upstream only accepts such reports) with having CONFIG_DEBUG_LIST=y and try to bisect the issue and report the root of the problem to the upstream kernel.
Comment by tom (archtom) - Wednesday, 11 July 2018, 16:37 GMT
thanks for the fast reply. I´m sorry I just don`t know how to do any of this. Neither building the kernel with the changed value, nor debugging the "root" of the problem.

I will gladly help solving this, but I would need a detailed step-by-step manual. Perhaps it`s faster to try yourself before writing the manual. Sorry and thanks for further help to solve this.
Comment by loqs (loqs) - Wednesday, 11 July 2018, 19:08 GMT
@archtom something like the attached help?
Comment by tom (archtom) - Thursday, 12 July 2018, 06:18 GMT
Thanks a lot for the input. I don`t know exactly with which kernel version the issue actually started. Isn`t it easier to downgrade via the downgrade command to the latest 4.16 hardened kernel and see if it really does not have the issue?

Is this procedure unsafe in any way and can everything be deleted safely afterwards? Usually I would try this in my virtualbox, but the error does not occur there. I don`t feel good doing all this on the production machine. This would mean causing a lot of kernel panics on the system with all our data during testing.

I would take the time and I have already learned a lot by overlooking your input, but I don`t want to mess around with the production system. Especially as my knowledge about kernel stuff is not that good. Is there another way?

If not it would be really nice if you could do the debugging. Thanks a lot.
Comment by loqs (loqs) - Thursday, 12 July 2018, 11:13 GMT
Yes please test linux-hardened 4.16.16.a-1 first. Yes everything can be deleted afterwards, I can not offer any guarantees so you should always have a full backup.
You would need to at least build and produce on the issue on an unpatched kernel with CONFIG_DEBUG_LIST=y before upstream would accept the report.
Comment by tom (archtom) - Thursday, 12 July 2018, 19:39 GMT
I just had a talk with my ceo and he forbade doing the "expermiental" build and debug on the production system. I´m sorry. I hope you can find and debug the cause of the problem and it can be fixed soon.

As a feedback the kernel panic does also occur on shutdown and reboot with the hardened kernel sometimes.

I will try hardened 4.16.16.a-1 tomorrow morning with downgrade command and report back before I for now have to use the "regular" version of the kernel. I will try a possible fixed version of the hardened kernel for sure and gladly report back. Sorry I can not be of more assistance for this issue as I cannot test it in the virtualbox.

Thanks for maintaining the kernel and for all the help.

Comment by Levente Polyak (anthraxx) - Thursday, 12 July 2018, 21:38 GMT
The thing is, there is no point in keeping this ticket open as its a defered issue. The only reason to keep this ticket open would be to coordinate and help debug this issue. If nobody who encounters this problem tries to debug it, i'm simply gonna close the ticket.

The other thing is, if you or your ceo like it or not, the BUG aka kernel OOPS is triggered via DEBUG_LIST and BUG_ON_DATA_CORRUPTION and makes the kernel halt... but the linked list corruption itself is quite frankly still there with the regular vanilla kernel as well (which is bad!). You really want to debug and report the cause of the corrupted linked list as something related to your hardware/driver/env definitivly corrupts it. Just closing your eyes won't magically fix the corruption, it could still potentially eat your kittens.

It's understandable that you don't want to toy with a production system, but maybe you can get a comparable environment with as similar hardware components as possible?
Comment by tom (archtom) - Friday, 13 July 2018, 09:39 GMT
I understand completetly what you are saying. Unfortunately I do not have another archlinux system for testing besides the virtualbox.

I took some time (and overtook the company production system ;)) for a while and tried with the precompiled hardened versions.

It turns out that the bug / bad commit must be somewhere between 4.15.18.a-1 and 4.16.5.a-1. Sorry, this is the best and most detailed I could come up with as there are no other precompiled versions of the 4.16.x series below .5.

I hope it helps in any way and gives you a chance to go after it.

Thanks in advance
Comment by tom (archtom) - Friday, 13 July 2018, 09:42 GMT
I also tried the latest kernel linux-hardened 4.17.6.a-1 and the error / kernel message seems a little different. Picture attached.
Comment by Levente Polyak (anthraxx) - Friday, 13 July 2018, 09:51 GMT
It doesn't help, sry.
So either you can convince your ceo to get an equal testing environment so we can track this down or there is no point.
Using an old kernel makes the system potentially exposed to security issues and using the regular kernel will just avoid the panic/BUG but it will still be corrupted internally.
Comment by tom (archtom) - Sunday, 15 July 2018, 12:52 GMT
I`m sorry we don`t have the environment to get a second equal test setting. I will try every new hardened kernel coming up and report back here when it is solved... Perhaps you can come up with another solution in the time being...

Thanks for all the help and sorry that I can not contribute more to solve this. Have a great sunday.
Comment by tom (archtom) - Monday, 06 August 2018, 09:33 GMT
I was out of office for a few days and did not try all versions inbetween but at least in linux-hardened 4.17.12.a-1 the problem seems solved.

Thanks for all the help and for maintaining the kernel, very much appreciated.
Comment by Levente Polyak (anthraxx) - Monday, 06 August 2018, 10:54 GMT
thats good to hear, happy that the problem got fixed in upstream kernel :)
thanks a lot for giving feedback
cheers

Loading...