FS#51455 - [linux] kernel NULL pointer dereference in AMD prevent boot

Attached to Project: Arch Linux
Opened by Pablo Lezaeta (Jristz) - Thursday, 20 October 2016, 06:13 GMT
Last edited by Tobias Powalowski (tpowa) - Monday, 24 July 2017, 12:22 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 3
Private No

Details

Description:
or more exactly finish the initialization process, the bug is the same with and without acpi or kernel patches (ck, zen, grsec) for kerner from 4.7.4 and newer (4.8 still affected)

the bug trigger at boot... no furter action need, in a clean install and no mather if is the fallback or the stock one (or if have a patch as I mentioned).

I tested this on EFI booting thru grub.

Since is so early I don't know how not lost the log so i take an screenshoot with "ignore_loglevel earlyprintk=efi,vga,kee" in the ck kernel and one in the arch stock one.

MAchine is an AMD E-300 APU with Radeon(tm) HD Graphics with Toshiba Satelite on 64 bits


Additional info:
* linux 4.7.4 and newer (4.8 still affected but lts is unnaffected)

Steps to reproduce:
clean install and boot.
This task depends upon

Closed by  Tobias Powalowski (tpowa)
Monday, 24 July 2017, 12:22 GMT
Reason for closing:  Fixed
Comment by Pablo Lezaeta (Jristz) - Thursday, 20 October 2016, 06:16 GMT
Attached using acpi=off in stock arch kernel.

EDIT: Furter test show that this affect only on efi.

in the BIOS boot the action that should be triggered just after the "ACPI: 3 ACPI AML" part is the "Security Framework initialized" but in efi it trigger the bug in that momment
Comment by Sabaku no Kisuke (snkisuke) - Saturday, 22 October 2016, 21:07 GMT
Hello, I've had the same problem ever since I updated the kernel to the 4.7.6 version in my laptop with all the newer kernels (just updated to 4.8.3 and problem persists)

My setup:
systemd-boot
AMD A10-5757M
acer aspire v5-552g-x414
Comment by Sabaku no Kisuke (snkisuke) - Wednesday, 02 November 2016, 00:46 GMT
on 4.8.6-1 the bug is still present.
Comment by Pablo Lezaeta (Jristz) - Wednesday, 02 November 2016, 16:17 GMT
Still present and NULL in the same point with the same messages.
Comment by Sabaku no Kisuke (snkisuke) - Saturday, 12 November 2016, 17:52 GMT
on 4.8.7-1 still present bug and null pointer.
and 4.8.13-1 too.
Comment by Pablo Lezaeta (Jristz) - Thursday, 17 November 2016, 19:05 GMT
Nothing new, the bug still remain and happend in the same reproducible way and as always only in efi mode.
I can't find the root problem only that probably a change related to efi and amd could be the issue, nothing beyond.

lets hope the next lts is not affected by this.

Edit: 4.8.12 and the bug still with the exact same output in the same cituations.
Edit: 4.8.13 still buggy...
Edit: 4.8.15 still unbootable on Efi on AMD...
Comment by Pablo Lezaeta (Jristz) - Sunday, 01 January 2017, 06:16 GMT
New year, new all, testing 4.9.0 and STILL gave the same error.

@snkisuke wha filesystem are you using and or do you manage to fix the problem?
Comment by Sabaku no Kisuke (snkisuke) - Sunday, 01 January 2017, 22:25 GMT
@Jristz for EFI partition I use vfat and for /, f2fs. As far as I can tell, the problem has to do with the way the efivars are being initialized by the kernel. The only way I can use EFI boot rigth now is using the LTS kernel.
For booting the newest kernel, I use Grub with Legacy BIOS enabled.
Comment by Sabaku no Kisuke (snkisuke) - Monday, 02 January 2017, 15:29 GMT Comment by Pablo Lezaeta (Jristz) - Monday, 02 January 2017, 15:59 GMT
@snkisuke that a possibility if the ACPI is parsed before the acpi=off is triggered as I pointed I tested with acpi=off too.
Comment by Sabaku no Kisuke (snkisuke) - Monday, 02 January 2017, 17:01 GMT
with acpi=off produce the same bug.
Comment by Pablo Lezaeta (Jristz) - Wednesday, 11 January 2017, 08:02 GMT
@snkisuke do you can take a screenshoot of the text before "BUG: unable to handle NULL pointer derefrence" part inclusive? I want check something.
Comment by Pablo Lezaeta (Jristz) - Monday, 16 January 2017, 00:23 GMT
I can confirm that latest available 4.9.1 NOT fix the issue... still the same at the same point in boot.
Comment by Sabaku no Kisuke (snkisuke) - Tuesday, 31 January 2017, 19:23 GMT
Jristz the screenshots I send are all I could get as an output. I will try with kernel 4.10 to see if the bug is solved or not, because in the 4.9.x kernel the problem persists.
Comment by Pablo Lezaeta (Jristz) - Wednesday, 01 March 2017, 06:58 GMT
I tryed 4.10.1 and I get a partial work:
When I just boot with no commandline then it work, BUT if I put acpi=off just hang during the udev creation.

the relevan bits now are:

[code]
ACPI: Added _OSI(Module Device)
mar 01 03:52:08 localhost kernel: ACPI: Added _OSI(Processor Device)
mar 01 03:52:08 localhost kernel: ACPI: Added _OSI(3.0 _SCP Extensions)
mar 01 03:52:08 localhost kernel: ACPI: Added _OSI(Processor Aggregator Device)
mar 01 03:52:08 localhost kernel: ACPI: Executed 1 blocks of module-level executable AML code
mar 01 03:52:08 localhost kernel: ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
mar 01 03:52:08 localhost kernel: ACPI : EC: EC started
mar 01 03:52:08 localhost kernel: ACPI Error: [\_PR_.C002._PPC] Namespace lookup failure, AE_NOT_FOUND (20160930/psargs-359)
mar 01 03:52:08 localhost kernel: ACPI Error: Method parse/execution failed [\_SB.PCI0.LPC0.EC0._REG] (Node ffff8801028d3ed8), AE_NOT_FOUND (20160930/psparse-543)
mar 01 03:52:08 localhost kernel: ACPI: \_SB_.PCI0.LPC0.EC0_: Used as first EC
mar 01 03:52:08 localhost kernel: ACPI: \_SB_.PCI0.LPC0.EC0_: GPE=0x3, EC_CMD/EC_SC=0x66, EC_DATA=0x62
mar 01 03:52:08 localhost kernel: ACPI: \_SB_.PCI0.LPC0.EC0_: Used as boot DSDT EC to handle transactions
mar 01 03:52:08 localhost kernel: ACPI: Interpreter enabled
mar 01 03:52:08 localhost kernel: ACPI: (supports S0 S3 S4 S5)
mar 01 03:52:08 localhost kernel: ACPI: Using IOAPIC for interrupt routing
mar 01 03:52:08 localhost kernel: PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
mar 01 03:52:08 localhost kernel: platform wdat_wdt: failed to claim resource 1
mar 01 03:52:08 localhost kernel: ACPI: watchdog: Device creation failed: -16
mar 01 03:52:08 localhost kernel: ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
mar 01 03:52:08 localhost kernel: acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
mar 01 03:52:08 localhost kernel: acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
mar 01 03:52:08 localhost kernel: acpi PNP0A08:00: [Firmware Info]: MMCONFIG for domain 0000 [bus 00-3f] only partially covers this bridge
mar 01 03:52:08 localhost kernel: PCI host bridge to bus 0000:00
mar 01 03:52:08 localhost kernel: pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window]
[/code]

as you can see there is problems with the acpi-in-kernel resolving.

BUT now the linux-lts will be 4.9 that NOT have the fi so far.

so the tldr is the bug is present in the next lts (4.9) and partially fixed in 4.10
Comment by Pablo Lezaeta (Jristz) - Wednesday, 19 July 2017, 22:26 GMT
I don't more suffer the problem not in linux not in linux-lts not in linux-zen.

Is safe to close it now.

Loading...