Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#50785 - [linux] Testing kernel 4.7.3-2 sometimes reboots during resume from successful hibernation (STD)

Attached to Project: Arch Linux
Opened by Aaron (a3n) - Thursday, 15 September 2016, 06:32 GMT
Last edited by Doug Newgard (Scimmia) - Tuesday, 17 October 2017, 13:28 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Architecture x86_64
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description: Occasionally, after resuming (pressing the power on button) after a successful hibernation (STD):
* the resume starts but does not complete,
* instead the system reboots while resume is in progress,
* and rather than getting my pre-hibernation session, I have to log in from scratch.
* Hibernated data is lost.

This does NOT involve black screens, inactive mice, hanging, unsuccessful poweroff and immediate reboot, or other inoperability, which is what most searches turn up. The system works fine, just occasionally I'll get a reboot during resume and lose hibernated data.

Attached files: (with clumsy analasys in the reproduction section below)
* 473FailureLog.txt, journal of boot session with six successful hibernations, and the failed one that reboots.
* 473RebootAfterFailureLog.txt, journal of boot session after the spontaneous reboot.
* netconsole.txt, the netcat capture file from an external laptop monitoring netconsole output.

In the netconsole.txt capture file I noticed this:
# grep "PM: Wrote" netconsole.txt
[ 1018.058475] PM: Wrote 998548 kbytes in 7.23 seconds (138.11 MB/s)
[ 1678.715363] PM: Wrote 1164040 kbytes in 8.33 seconds (139.74 MB/s)
[ 2338.250132] PM: Wrote 1151388 kbytes in 7.49 seconds (153.72 MB/s)
[ 2999.451110] PM: Wrote 1233736 kbytes in 8.75 seconds (140.99 MB/s)
[ 3659.781817] PM: Wrote 1284228 kbytes in 9.47 seconds (135.61 MB/s)
[ 4319.373342] PM: Wrote 1270104 kbytes in 9.29 seconds (136.71 MB/s)
[ 4981.954589] PM: Wrote 1511712 kbytes in 11.98 seconds (126.18 MB/s)

The size of the image written increases (but not every time), and the time it takes to write increases every time.
While this was happening I was just reading the web.
Don't know if that's a clue or not.

I have seen this behavior in:
* testing kernel 4.7.3-2 (this report)
* core kernel 4.7.2-1
* extra kernel linux-zen 4.7.2-1

It's hard to prove a negative, but I have not seen this behavior in:
* core kernel linux-lts 4.4.20-1

I've searched this bug list, Arch forums, kernel.org bug list, and the WWW.
The Arch forum has one report of similar behavior, but no solved post and no indication of activity.

Most search hits elsewhere are either concluded with "my bad!" or the behavior ends in some kind of inoperability.
The only thing happening in my report is reboot during resume, and loss of hibernated data; the system works fine otherwise.

Additional info:
* Laptop: oldish HP Pavilion G7
* BIOS/MBR
* 6G RAM, 12G swap volume
* Single 750G disk, single partition
* Entire partition is LVM on LUKS (FDE)
* 2 LVM volumes: root and swap
* /boot is inside the root volume, no separate boot partition

# lsblk -f
NAME FSTYPE LABEL UUID MOUNTPOINT
sda
└─sda1 crypto_LUKS 486d0088-c99e-46f7-b4a7-b8b06f68eaff
└─VolGrp LVM2_member zKBHeW-fzai-yVzR-fBuR-vkSg-hR3P-lZ58QI
├─VolGrp-swap swap 5f24cce5-8861-4e22-9d63-5f946dbde79f [SWAP]
└─VolGrp-root ext4 cede4248-86ab-4a8a-9476-7b9c34e65b6e /
sr0

# dmesg
[ 0.000000] microcode: microcode updated early to revision 0x29, date = 2013-06-12
[ 0.000000] Linux version 4.7.3-2-ARCH (builduser@tobias) (gcc version 6.2.1 20160830 (GCC) ) #1 SMP PREEMPT Thu Sep 8 09:44:02 CEST 2016
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=/dev/mapper/VolGrp-root rw cryptdevice=/dev/sda1:VolGrp resume=/dev/mapper/VolGrp-swap ignore_loglevel initcall_debug no_console_suspend

# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Model name: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz
Stepping: 7
CPU MHz: 818.560
CPU max MHz: 2300.0000
CPU min MHz: 800.0000
BogoMIPS: 4591.93
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts

# lsinitcpio -c /boot/initramfs-linux.img |egrep -ve "^#|^$"
MODULES=""
BINARIES=""
FILES="/crypto_keyfile.bin"
HOOKS="base udev autodetect modconf block encrypt lvm2 resume filesystems keyboard fsck"

# egrep -ve "^#|^$" /etc/default/grub
GRUB_ENABLE_CRYPTODISK=y
GRUB_DEFAULT=0
GRUB_TIMEOUT=4
GRUB_DISTRIBUTOR="Arch"
GRUB_CMDLINE_LINUX_DEFAULT="ignore_loglevel initcall_debug no_console_suspend"
GRUB_CMDLINE_LINUX="cryptdevice=/dev/sda1:VolGrp resume=/dev/mapper/VolGrp-swap"
GRUB_PRELOAD_MODULES="lvm part_gpt part_msdos"
GRUB_TERMINAL_INPUT=console
GRUB_GFXMODE=auto
GRUB_GFXPAYLOAD_LINUX=keep
GRUB_DISABLE_RECOVERY=true

Steps to reproduce:

Hibernate over and over until you get a reboot during resume.

netconsole (serial console over ethernet emulation) is enabled as a module, to capture messages on another laptop.
Since it's a module, it won't turn on the network, so it has to be started after the network.
I have not tried compiling it in to get earlier messages.

# cat /etc/systemd/system/netconsole.service
[Unit]
Description=kernel kmsg to network
Description=netconsole
Requires=network.target
After=network.target

[Service]
ExecStart=-/usr/bin/modprobe netconsole

[Install]
WantedBy=multi-user.target

# cat /etc/modprobe.d/netconsole.conf
options netconsole netconsole=6665@192.168.5.100/eno1,6666@192.168.5.104/64:31:50:5b:7d:b2

I set up a systemd user timer to hibernate the system every ten minutes, since I don't know how to otherwise trigger this behavior.
I enable and start the timer, then wait for the bug to occur.
After each hibernate's poweroff, I start the system and input my passphrase.
After the bug occurs and I've logged in, I stop and disable the timer, read the logs and think.
Thinking hasn't paid off yet.

% ls -l ~/.config/systemd/user
total 8
-rw-r--r-- 1 aaron aaron 155 Sep 2 15:49 hiboften.service
-rw-r--r-- 1 aaron aaron 139 Sep 8 23:56 hiboften.timer

% cat ~/.config/systemd/user/hiboften.timer
[Unit]
Description=hiboften.timer

[Timer]
OnActiveSec=10min
OnUnitActiveSec=10min
Unit=hiboften.service

[Install]
WantedBy=timers.target

% cat ~/.config/systemd/user/hiboften.service
[Unit]
Description=hiboften.service

[Service]
Type=simple
ExecStartPre=/usr/bin/logger "Starting hiboften.service"
ExecStart=/usr/bin/systemctl hibernate

Besides doing other searches, this file is grep input to show the hibernation cycles and the failure in the system's journal:

# cat searchHibFail.grep
save_microcode_in_initrd
PM: Hibernation image not present or could not be loaded.
Starting Reboot...
initcall init_netconsole
kernel: Command line: BOOT_IMAGE=
kernel: Linux version
Starting hiboften.service...
-- reboot
PM: Hibernation mode set to 'platform'
PM: Creating hibernation image:
PM: image restored successfully
: link up

Here I search the journal, using the grep input file.
I show from just before the last on-purpose commanded reboot,
all the hibernations including the bug,
and the reboot after the bug.
I see nothing clueful.

There are six successful hibernations, followed by a failure and reboot.
Note that just because you don't see hibernation image creation and restoration in the failed hibernation doesn't prove they didn't happen, they just may not have made it into the log. When I was running the core 4.7.2-1 kernel, these were reported as happening.

# journalctl --no-pager |egrep -if searchHibFail.grep
...
Sep 14 17:20:16 thing1 systemd[1]: Starting Reboot...
-- Reboot --
Sep 14 17:22:16 thing1 kernel: Linux version 4.7.3-2-ARCH (builduser@tobias) (gcc version 6.2.1 20160830 (GCC) ) #1 SMP PREEMPT Thu Sep 8 09:44:02 CEST 2016
Sep 14 17:22:16 thing1 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=/dev/mapper/VolGrp-root rw cryptdevice=/dev/sda1:VolGrp resume=/dev/mapper/VolGrp-swap ignore_loglevel initcall_debug no_console_suspend
Sep 14 17:22:16 thing1 kernel: calling save_microcode_in_initrd+0x0/0x36 @ 1
Sep 14 17:22:16 thing1 kernel: initcall save_microcode_in_initrd+0x0/0x36 returned 0 after 0 usecs
Sep 14 17:22:17 thing1 kernel: PM: Hibernation image not present or could not be loaded.
Sep 14 17:22:17 thing1 kernel: PM: Hibernation image not present or could not be loaded.
Sep 14 17:22:32 thing1 kernel: r8169 0000:02:00.0 eno1: link up
Sep 14 17:22:38 thing1 kernel: initcall init_netconsole+0x0/0x1000 [netconsole] returned 0 after 30472 usecs
Sep 14 17:38:24 thing1 systemd[482]: Starting hiboften.service...
Sep 14 17:38:24 thing1 kernel: PM: Hibernation mode set to 'platform'
Sep 14 17:40:27 thing1 kernel: PM: Creating hibernation image:
Sep 14 17:40:27 thing1 kernel: PM: Image restored successfully.
Sep 14 17:40:28 thing1 kernel: r8169 0000:02:00.0 eno1: link up
Sep 14 17:51:23 thing1 systemd[482]: Starting hiboften.service...
Sep 14 17:51:23 thing1 kernel: PM: Hibernation mode set to 'platform'
Sep 14 17:55:36 thing1 kernel: PM: Creating hibernation image:
Sep 14 17:55:36 thing1 kernel: PM: Image restored successfully.
Sep 14 17:55:37 thing1 kernel: r8169 0000:02:00.0 eno1: link up
Sep 14 18:06:33 thing1 systemd[482]: Starting hiboften.service...
Sep 14 18:06:33 thing1 kernel: PM: Hibernation mode set to 'platform'
Sep 14 18:08:29 thing1 kernel: PM: Creating hibernation image:
Sep 14 18:08:29 thing1 kernel: PM: Image restored successfully.
Sep 14 18:08:30 thing1 kernel: r8169 0000:02:00.0 eno1: link up
Sep 14 18:19:25 thing1 systemd[482]: Starting hiboften.service...
Sep 14 18:19:25 thing1 kernel: PM: Hibernation mode set to 'platform'
Sep 14 18:21:14 thing1 kernel: PM: Creating hibernation image:
Sep 14 18:21:14 thing1 kernel: PM: Image restored successfully.
Sep 14 18:21:14 thing1 kernel: r8169 0000:02:00.0 eno1: link up
Sep 14 18:32:10 thing1 systemd[482]: Starting hiboften.service...
Sep 14 18:32:10 thing1 kernel: PM: Hibernation mode set to 'platform'
Sep 14 18:34:32 thing1 kernel: PM: Creating hibernation image:
Sep 14 18:34:32 thing1 kernel: PM: Image restored successfully.
Sep 14 18:34:33 thing1 kernel: r8169 0000:02:00.0 eno1: link up
Sep 14 18:45:28 thing1 systemd[482]: Starting hiboften.service...
Sep 14 18:45:28 thing1 kernel: PM: Hibernation mode set to 'platform'
Sep 14 18:47:53 thing1 kernel: PM: Creating hibernation image:
Sep 14 18:47:53 thing1 kernel: PM: Image restored successfully.
Sep 14 18:47:54 thing1 kernel: r8169 0000:02:00.0 eno1: link up
Sep 14 18:58:49 thing1 systemd[482]: Starting hiboften.service...
Sep 14 18:58:49 thing1 kernel: PM: Hibernation mode set to 'platform'
-- Reboot --
Sep 14 19:02:23 thing1 kernel: Linux version 4.7.3-2-ARCH (builduser@tobias) (gcc version 6.2.1 20160830 (GCC) ) #1 SMP PREEMPT Thu Sep 8 09:44:02 CEST 2016
Sep 14 19:02:23 thing1 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=/dev/mapper/VolGrp-root rw cryptdevice=/dev/sda1:VolGrp resume=/dev/mapper/VolGrp-swap ignore_loglevel initcall_debug no_console_suspend
Sep 14 19:02:23 thing1 kernel: calling save_microcode_in_initrd+0x0/0x36 @ 1
Sep 14 19:02:23 thing1 kernel: initcall save_microcode_in_initrd+0x0/0x36 returned 0 after 0 usecs
Sep 14 19:02:23 thing1 kernel: PM: Hibernation image not present or could not be loaded.
Sep 14 19:02:23 thing1 kernel: PM: Hibernation image not present or could not be loaded.
Sep 14 19:02:39 thing1 kernel: r8169 0000:02:00.0 eno1: link up
Sep 14 19:02:46 thing1 kernel: initcall init_netconsole+0x0/0x1000 [netconsole] returned 0 after 28778 usecs

On the external laptop I was running the Arch installer .iso, which conveniently has netcat installed.
I captured the file, on the external laptop, like so:

# nc -u -l -p 6666 |tee netconsole.txt

On the external laptop that receives netconsole messages, here is the last successful hibernation, followed by the failed hibernation. Extra debugging messages were filtered out; see the attached netconsole.txt file for full messages.

[ 3660.002152] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 3652.068733] r8169 0000:02:00.0 eno1: link up

[ 4307.109118] PM: Hibernation mode set to 'platform'
[ 4307.319193] PM: Syncing filesystems ... done.
[ 4307.690014] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 4307.694411] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
[ 4307.697563] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
[ 4307.700342] PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
[ 4307.702827] PM: Marking nosave pages: [mem 0xace3f000-0xacffefff]
[ 4307.705275] PM: Marking nosave pages: [mem 0xad000000-0xffffffff]
[ 4307.710041] PM: Basic memory bitmaps created
[ 4307.712551] PM: Preallocating image memory... done (allocated 318017 pages)
[ 4308.053875] PM: Allocated 1272068 kbytes in 0.34 seconds (3741.37 MB/s)
[ 4308.054992] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 4311.814841] r8169 0000:02:00.0 eno1: link up
[ 4311.873217] PM: Image saving progress: 60%
[ 4312.159023] PM: Image saving progress: 70%
[ 4312.494732] PM: Image saving progress: 80%
[ 4312.794182] PM: Image saving progress: 90%
[ 4313.004730] PM: Image saving progress: 100%
[ 4319.370904] PM: Image saving done.
[ 4319.373342] PM: Wrote 1270104 kbytes in 9.29 seconds (136.71 MB/s)
[ 4319.376336] PM: S|
[ 4319.579868] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 4319.580036] sd 0:0:0:0: [sda] Stopping disk
[ 4311.808394] r8169 0000:02:00.0 eno1: link up

[ 4967.042073] PM: Hibernation mode set to 'platform'
[ 4967.256155] PM: Syncing filesystems ... done.
[ 4967.525124] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 4967.529389] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
[ 4967.531627] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
[ 4967.533638] PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
[ 4967.535627] PM: Marking nosave pages: [mem 0xace3f000-0xacffefff]
[ 4967.537623] PM: Marking nosave pages: [mem 0xad000000-0xffffffff]
[ 4967.541421] PM: Basic memory bitmaps created
[ 4967.543314] PM: Preallocating image memory... done (allocated 378172 pages)
[ 4967.889844] PM: Allocated 1512688 kbytes in 0.34 seconds (4449.08 MB/s)
[ 4967.890974] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 4971.522655] r8169 0000:02:00.0 eno1: link up
[ 4971.568146] PM: Image saving progress: 40%
[ 4971.845575] PM: Image saving progress: 50%
[ 4972.205031] PM: Image saving progress: 60%
[ 4972.549352] PM: Image saving progress: 70%
[ 4972.952281] PM: Image saving progress: 80%
[ 4973.361944] PM: Image saving progress: 90%
[ 4973.615032] PM: Image saving progress: 100%
[ 4981.952255] PM: Image saving done.
[ 4981.954589] PM: Wrote 1511712 kbytes in 11.98 seconds (126.18 MB/s)
[ 4981.958070] PM: S|
[ 4982.136313] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 4982.136476] sd 0:0:0:0: [sda] Stopping disk

That's all I got.
This task depends upon

Closed by  Doug Newgard (Scimmia)
Tuesday, 17 October 2017, 13:28 GMT
Reason for closing:  No response
Comment by Doug Newgard (Scimmia) - Monday, 19 September 2016, 13:06 GMT
Is this on a thin volume?
Comment by Aaron (a3n) - Monday, 19 September 2016, 16:03 GMT
No, not a thin volume.

I also don't think it's related to lvm or luks. I reinstalled just a bare, non-encrypted, non-lvm system, with no X. Just what you get after the vanilla installation.

I then ran the following script from the root (the only) account. I had to turn on network wake in the bios settings.

for i in $(seq 100); do
rtcwake –m disk –s 10
sleep 5
done

After 30 something hibernations and resumes, I got the failed to resume, reboot instead. Unfortunately that laptop is in the shop (failed fan bearings, it's an old laptop) so I don't have the logs.

I got the script from here: https://01.org/blogs/rzhang/2015/best-practice-debug-linux-suspend/hibernate-issues
Comment by Aaron (a3n) - Monday, 19 September 2016, 16:30 GMT
Also, by the time I ran the script on the bare system, 4.7.3-2 had moved out of testing and into core. In case that matters.
Comment by mattia (nTia89) - Tuesday, 03 October 2017, 20:06 GMT
is this issue still valid?

Loading...