FS#42692 - [linux] random freeze during boot with 3.17

Attached to Project: Arch Linux
Opened by patrick (potomac) - Wednesday, 05 November 2014, 18:26 GMT
Last edited by freswa (frederik) - Saturday, 26 September 2020, 22:56 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:

I notice a random freeze every 5~10 boots with systemd 216-3 ( I tested also systemd 217-5 in testing, it's the same problem ),

every 5~10 boots a freeze can occur shortly after the load of the kernel, I can see these messages on screen and then nothing, it seems like a freeze :

:: running early hook [udev]
:: running hook [udev]
:: Triggering uvents...

sometimes the freeze happens a few seconds after systemd starts ( after the message "mount /home" for example )

5 minutes after the freeze systemd displays this message :

task systemd-udevd:236 blocked for more than 120 seconds

and systemd is still unable to boot, so I have to do a reset of my PC,

I tested also with kernel 3.18rc3 and the bug is still here, random freeze/stop at boot with systemd

my configuration :
archlinux 64 bits
cpu pentium dual core E6800 3.33 Ghz
ati radeon HD4650 PCIe ( radeon open source driver, KMS early start )

I tried to disable "kernel mode setting" but the bug is still here,

I created a bug report in systemd website but I don't have answers from the developpers,

I created also a bug report in kernel's bugzilla but again I don't have answers,

I took a screenshot when systemd displays the error message after a freeze

Additional info:
* package version(s) systemd 216-3 ( same problem with systemd 217-5 ), kernel 3.17-2 ( tested also with kernel 3.18rc3 )
* config and/or log files etc.


Steps to reproduce:
- start your PC
- sometimes a random freeze can occur at boot ( every 5~10 boots, it's hard to trigger the bug )
- 5 minutes after the freeze systemd displays this error message :
task systemd-udevd:236 blocked for more than 120 seconds
This task depends upon

Closed by  freswa (frederik)
Saturday, 26 September 2020, 22:56 GMT
Reason for closing:  No response
Comment by Dave Reisner (falconindy) - Wednesday, 05 November 2014, 18:29 GMT
Definitely not a systemd bug.

 FS#42509 ? Same results with kernel 3.16.x?
Comment by patrick (potomac) - Wednesday, 05 November 2014, 18:36 GMT
I notice this bug since the release of kernel 3.17-1,

it's not like the bug  FS#42509  because I don't use EFI, my motherboard only supports "bios" ( it's an old motherboard ),

it could be a kernel bug, but it may be also a problem with systemd if systemd 216-3 is not fully compatible with the new features of kernel 3.17.x

Comment by Dave Reisner (falconindy) - Wednesday, 05 November 2014, 18:44 GMT
> it could be a kernel bug, but it may be also a problem with systemd if systemd 216-3 is not fully compatible with the new features of kernel 3.17.x
No, that's not how kernel development works. Newer kernels *must* work with older userspace, or else it's considered a regression in the kernel. The opposite does not hold true, but that's a "feature" of being a userspace developer: newer userspace may not work properly with older kernels.
Comment by patrick (potomac) - Wednesday, 05 November 2014, 18:51 GMT
so what is the cause of my problem ?

the kernel 3.17 ?

Comment by Dave Reisner (falconindy) - Wednesday, 05 November 2014, 18:57 GMT
If upgrading from 3.16 to 3.17 made this problem appear, I'd say that's a good indicator of a problem in 3.17.
Comment by patrick (potomac) - Wednesday, 05 November 2014, 19:06 GMT
ok, so you could edit the title of my bug report if it's the kernel 3.17 the culprit,

the problem occurs also with kernel 3.18,

I tried to disable intel-ucode in grub --> same problem, so it's not a problem with intel-ucode/microcode

Comment by patrick (potomac) - Thursday, 06 November 2014, 07:05 GMT
when the freeze occurs I can read these error messages :

:: running hook [udev]
:: Triggering uevents...

worker [53] /devices/pci0000:00/0000:00:1f.2/ata8/host7/target7:0:1/7:0:1:0 is taking a long time
worker [60] /devices/pci0000:00/0000:00:1f.2/ata7/host6/target6:0:0/6:0:0:0/block/sr0 is taking a long time

https://bugzilla.kernel.org/attachment.cgi?id=156871
Comment by patrick (potomac) - Thursday, 06 November 2014, 20:28 GMT
another user has the same problem :

https://bbs.archlinux.org/viewtopic.php?pid=1473209#p1473209

it seems a race condition during boot who can trigger a random hang
Comment by patrick (potomac) - Sunday, 09 November 2014, 12:44 GMT
I'm doing a git bisect between kernel 3.16.7 ( who doesn't have the bug ) and kernel 3.17 ( who has the bug ), I hope I will find the commit who has introduced the bug
Comment by patrick (potomac) - Monday, 10 November 2014, 22:54 GMT
I found the commit who has introduced the bug after doing a git bisect,

the first bad commit is :

first bad commit: [045065d8a300a37218c548e9aa7becd581c6a0e8] [SCSI] fix qemu boot hang problem

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=045065d8a300a37218c548e9aa7becd581c6a0e8

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9c44392..ce62e87 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue *q)
blk_requeue_request(q, req);
atomic_dec(&sdev->device_busy);
out_delay:
- if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
+ if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
blk_delay_queue(q, SCSI_QUEUE_DELAY);
}
Comment by patrick (potomac) - Tuesday, 11 November 2014, 02:14 GMT
I created a patch who solves the bug, this patch reverts the faulty commit 045065d8a300a37218c548e9aa7becd581c6a0e8,

I rebuilt kernel 3.17.2 with this patch and now the bug is gone, no problems with my PC
Comment by patrick (potomac) - Sunday, 16 November 2014, 18:14 GMT


there is another way to solve my bug, with this patch :

--- a/drivers/scsi/scsi_lib.c 2014-10-05 21:23:04.000000000 +0200
+++ b/drivers/scsi/scsi_lib.c 2014-11-16 17:39:16.819674725 +0100
@@ -1776,7 +1776,7 @@ static void scsi_request_fn(struct reque
atomic_dec(&sdev->device_busy);
out_delay:
if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
- blk_delay_queue(q, SCSI_QUEUE_DELAY);
+ blk_delay_queue(q, 40);
}

static inline int prep_to_mq(int ret)

it gives a little more time ( 40 ms instead of 3 ms, which is the default value for SCSI_QUEUE_DELAY )
Comment by patrick (potomac) - Wednesday, 19 November 2014, 20:24 GMT
I solved the mystery,

I found that the element who triggers the bug ( random hang at boot with
kernel 3.17 and 3.18 ) is the combination of 3 elements :

- the use of a SATA DVD burner ( Liteon iHAS124 C ) on a ICH7 Sata controler
- the use of a gigabyte motherboard GA-P31-DSL3 ( bios F10A, ICH7
controler, intel P31 chipset )
- commit 74665016086615bbaa3fa6f83af410a0a4e029ee ( scsi: convert
host_busy to atomic_t )

If I connect this Sata DVD burner and a sata harddisk by using the SATA
ports of the motherboard then the bug will occur ( but the bug will
occur only on kernels 3.17 and 3.18, there is no problems with older
kernels, and no problems with Windows 7 )

If I disconnect the SATA DVD burner then the bug is gone, no problems
with kernels 3.17 and 3.18,

And if I connect the SATA DVD burner on my JMicron SATA/IDE PCIe card
then there is no problem, no bugs, this is a perfect workaround for my
problem, because I can use kernel 3.17/3.18 without problem with this
configuration.

But I don't know which element I should blame, my gigabyte motherboard ?
( faulty bios ? ) The use of "atomic_t" in scsi source code ? (
innapropriate way to handle SATA devices, it breaks compatibility with
some PC configurations ? )
Comment by patrick (potomac) - Friday, 21 November 2014, 16:57 GMT
a test patch made by a linux developer who solves the problem
Comment by Eli Schwartz (eschwartz) - Monday, 20 August 2018, 19:21 GMT
Does this issue still exist with a modern kernel? Did this mysterious developer have a name, and did they try pushing the patch into mainline? It is definitely not merged...
Comment by patrick (potomac) - Tuesday, 21 August 2018, 03:23 GMT
I don't know if this issue exists with the recent kernel linux, because I use a workaround by setting my DVD driver to a pci-e sata card controler, and I use now a different motherboard,

with kernel 3.17 the bug is trigerred if a slow sata device (like dvd driver) is set on a ICH7 Sata controler, I don't know if this problem is solved with a recent kernel linux
Comment by Stefan Schick (pommes_) - Saturday, 03 August 2019, 11:26 GMT
I contacted the other people affected by the issue (according to the forum thread), in the hope, that someone is able to test and confirm the status of the ticket. Is anyone else able to somehow confirm or deny if the issue is fixed?
Upstream Ticket:
https://bugzilla.kernel.org/show_bug.cgi?id=87581
Comment by Daniel Harrysson (nadley) - Saturday, 03 August 2019, 19:23 GMT
I was affected by this bug and am still using the same hardware but have not encountered the problem for years. I currently use it with an up-to-date Arch system with kernel 5.2.5 and systemd 242.84-1. I cannot remember when the problem went away, but at least for me it is not reproducible any more.
Comment by patrick (potomac) - Saturday, 03 August 2019, 22:00 GMT
@Stefan : my advice as a workaround is to connect your sata devices on other sata ports combination,

for me I solved the problem by using a sata controler pci-e card, I connected on this card my sata dvd burner, I think the bug is triggered by slow sata devices like dvd burner when it is connected on sata ports located on the motherboard (for some models like gigabyte P31, P35 chipsets)

Loading...