FS#21558 - [kernel26] 2.6.36 BUG: scheduling while atomic

Attached to Project: Arch Linux
Opened by Mathias Burén (fackamato) - Monday, 01 November 2010, 12:47 GMT
Last edited by Allan McRae (Allan) - Saturday, 16 April 2011, 09:18 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:
Upgraded to 2.6.36 from testing, now I receive a lot of errors during boot (and while running):
BUG: scheduling while atomic: rc.sysinit/1376/0x00000002
Modules linked in: joydev hid_logitech ff_memless snd_hda_codec_nvhdmi usbhid hid snd_hda_codec_realtek snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_hda_intel snd_hda_codec snd_pcm_oss snd_hwdep ohci_hcd snd_pcm ehci_hcd evdev snd_mixer_oss i2c_nforce2 snd_timer pcspkr psmouse usbcore shpchp sg forcedeth wmi i2c_core snd processor button serio_raw thermal pci_hotplug snd_page_alloc soundcore raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx uvesafb cn sd_mod
Pid: 1376, comm: rc.sysinit Not tainted 2.6.36-ARCH #1
Call Trace:
[<ffffffff810397ce>] __schedule_bug+0x5e/0x70
[<ffffffff81403110>] schedule+0x950/0xa70
[<ffffffff81060bad>] ? insert_work+0x7d/0x90
[<ffffffff81060fbd>] ? queue_work_on+0x1d/0x30
[<ffffffff81061127>] ? queue_work+0x37/0x60
[<ffffffff8140377d>] schedule_timeout+0x21d/0x360
[<ffffffff812031c3>] ? generic_make_request+0x2c3/0x540
[<ffffffff81402680>] wait_for_common+0xc0/0x150
[<ffffffff81041490>] ? default_wake_function+0x0/0x10
[<ffffffff812034bc>] ? submit_bio+0x7c/0x100
[<ffffffff810680a0>] ? wake_bit_function+0x0/0x40
[<ffffffff814027b8>] wait_for_completion+0x18/0x20
[<ffffffff8120a969>] blkdev_issue_discard+0x1b9/0x210
[<ffffffff811ba03e>] ext4_free_blocks+0x68e/0xb60
[<ffffffff811b1650>] ? __ext4_handle_dirty_metadata+0x110/0x120
[<ffffffff811b098c>] ext4_ext_truncate+0x8cc/0xa70
[<ffffffff810d713e>] ? pagevec_lookup+0x1e/0x30
[<ffffffff81191618>] ext4_truncate+0x178/0x5d0
[<ffffffff810eacbb>] ? unmap_mapping_range+0xab/0x280
[<ffffffff810d8976>] vmtruncate+0x56/0x70
[<ffffffff811925cb>] ext4_setattr+0x14b/0x460
[<ffffffff811319e4>] notify_change+0x194/0x380
[<ffffffff81117f80>] do_truncate+0x60/0x90
[<ffffffff811e08fa>] ? security_inode_permission+0x1a/0x20
[<ffffffff811eaec1>] ? tomoyo_path_truncate+0x11/0x20
[<ffffffff81127539>] do_last+0x5d9/0x770
[<ffffffff811278bd>] do_filp_open+0x1ed/0x680
[<ffffffff8140644f>] ? page_fault+0x1f/0x30
[<ffffffff81132bfc>] ? alloc_fd+0xec/0x140
[<ffffffff81118db1>] do_sys_open+0x61/0x120
[<ffffffff81118e8b>] sys_open+0x1b/0x20
[<ffffffff81002e6b>] system_call_fastpath+0x16/0x1b

Additional info:
dmesg is here: http://nopaste.info/cf4db3e9f9.html
config is here: http://nopaste.info/3ec80514fe.html
lspci is here: http://nopaste.info/fe289c947d.html
cpuinfo is here: http://nopaste.info/ccf5e02b42.html


Steps to reproduce:
Update to 2.6.36, reboot, observe error.
This task depends upon

Closed by  Allan McRae (Allan)
Saturday, 16 April 2011, 09:18 GMT
Reason for closing:  None
Additional comments about closing:  User can no longer reproduce
Comment by Tobias Powalowski (tpowa) - Monday, 01 November 2010, 18:35 GMT
you are running a custom kernel, how should this be possible to be fixed here?
Comment by Mathias Burén (fackamato) - Wednesday, 03 November 2010, 10:55 GMT
  • Field changed: Percent Complete (100% → 0%)
I switched to 2.6.36-ARCH from testing (from Arch repos) and I get the same result.
Comment by Jan de Groot (JGC) - Wednesday, 03 November 2010, 10:58 GMT
On the forum we found out this is caused by the discard mount option. I can't reproduce this with my Intel Postville SSD, and after data loss on my OCZ Vertex 30GB SSD I don't want to try it on that one anymore. Looking around for this issue, Gentoo users have the same trace with 2.6.36, so it's an upstream kernel problem. It looks like the kernel is doing something with TRIM that is not executed correctly by the Sandforce controller in this SSD. The Sandforce controller is less forgiving when it comes to TRIM, even the Microsoft AHCI driver has problems with it.
Comment by Mathias Burén (fackamato) - Wednesday, 03 November 2010, 11:08 GMT
Ah, thanks for the information. I've removed the "discard" mount option, and at the moment it seems to work fine.
For the record I'm using a Corsair F60:

[fackamato@ion ~]$ sudo hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
Model Number: Corsair CSSD-F60GB2
Serial Number: 10326505580009990027
Firmware Revision: 1.1
Transport: Serial
Standards:
Used: unknown (minor revision code 0x0028)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 117231408
LBA48 user addressable sectors: 117231408
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 57241 MBytes
device size with M = 1000*1000: 60022 MBytes (60 GB)
cache/buffer size = unknown
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 1
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* NOP cmd
* DOWNLOAD_MICROCODE
SET_MAX security extension
* 48-bit Address feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
* IDLE_IMMEDIATE with UNLOAD
* WRITE_UNCORRECTABLE_EXT command
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Host-initiated interface power management
* Phy event counters
DMA Setup Auto-Activate optimization
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT LBA Segment Access (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
* Data Set Management TRIM supported (limit 1 block)
* Deterministic read data after TRIM
Security:
supported
not enabled
not locked
not frozen
not expired: security count
not supported: enhanced erase
Logical Unit WWN Device Identifier: 5000000009990027
NAA : 5
IEEE OUI : 000000
Unique ID : 009990027
Checksum: correct
Comment by Mathias Burén (fackamato) - Wednesday, 03 November 2010, 11:09 GMT
Perhaps the severity should be set to low or normal, as it works without the discard option.
Comment by Andreas (poison) - Wednesday, 15 December 2010, 08:20 GMT
>Perhaps the severity should be set to low or normal

Rather not, since it reliably overwrites/prepends touched files with binary garbage on my X25-M when mounted with the discard option.
Running linux-2.6.36.2
Comment by Jelle van der Waa (jelly) - Friday, 15 April 2011, 10:30 GMT
any update?
Comment by Mathias Burén (fackamato) - Friday, 15 April 2011, 10:50 GMT
I believe this is resolved in the latest kernel, however I can't test it as I don't have a SSD anymore.

Loading...