FS#32025 - [linux] 3.6.2 - 3.11.x results laptop 10 degrees warmer

Attached to Project: Arch Linux
Opened by Bryan (bryan) - Wednesday, 17 October 2012, 06:33 GMT
Last edited by Tobias Powalowski (tpowa) - Thursday, 10 October 2013, 09:51 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture x86_64
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 43
Private No

Details

Description: With nothing else changed (package, config, used software), upgrading from kernel 3.5.6 to 3.6.2 results in my laptop being 10 degrees warmer (~60-65°C). Reverting to 3.5.6-1-ARCH brings lower temps (~50°C) back.

I'm using kernels from package, unmodified with only two modules loaded through modules-load.d (ecryptfs and oss-sound) :

- Linux version 3.6.2-1-ARCH (tobias@T-POWA-LX) (gcc version 4.7.2 (GCC) ) #1 SMP PREEMPT Fri Oct 12 23:58:58 CEST 2012

- Linux version 3.5.6-1-ARCH (tobias@T-POWA-LX) (gcc version 4.7.1 20120721 (prerelease) (GCC) ) #1 SMP PREEMPT Sun Oct 7 19:30:49 CEST 2012


Laptop is Lenovo ThinkPad X220, model 429137G

This task depends upon

Closed by  Tobias Powalowski (tpowa)
Thursday, 10 October 2013, 09:51 GMT
Reason for closing:  Upstream
Comment by Gaetan Bisson (vesath) - Wednesday, 17 October 2012, 07:54 GMT
I also have a Lenovo x220 and can confirm that I have been observing the same issue since the upgrade to linux-3.6.2 (although I originally attributed it to the early-Spring heat wave we have been having the last few days).

All I can see in dmesg are quite a few lines going "CPU0: Package temperature above threshold, cpu clock throttled", "CPU0: Package temperature/speed normal", "CPU0: Package power limit notification", and "CPU0: Package power limit normal" (variations include replacing "Package" by "Core"), which I've certainly never seen before.

Edit: To illustrate the severity, my idle laptop just went from 82°C to 44°C when I downgraded to linux-3.5.6 and rebooted.
Comment by Thomas Bächler (brain0) - Wednesday, 17 October 2012, 12:00 GMT
I've seen this on the forums, but cannot reproduce. You need to check the output of powertop and compare to 3.5, this will hopefully show which component is the problem.

Personally, my battery times and heat are fine with 3.6, so the bug is specific to certain hardware and/or drivers.
Comment by Gaetan Bisson (vesath) - Wednesday, 17 October 2012, 13:35 GMT
After downgrading to 3.5.6 and rebooting, I upgraded to 3.6.2 and rebooted again (in the hope of running powertop and finding the culprit), but the temperature has so far behaved perfectly normally. I'll report here if it goes wonky again.
Comment by Thomas Bächler (brain0) - Wednesday, 17 October 2012, 13:46 GMT
Please also write down some powertop data for later comparison.
Comment by Gaetan Bisson (vesath) - Wednesday, 17 October 2012, 14:24 GMT
I rebooted under 3.6.2 again and this time it started overheating. Attached are powertop outputs from my first 3.6.2 reboot (when things were fine) and the last one (when things got hot). Looks like the main issue is the GPU but everything (processes, etc.) is quite higher in terms of events/seconds too...
Comment by Gaetan Bisson (vesath) - Wednesday, 17 October 2012, 14:29 GMT
I-ve rebooted again and it's back to normal - it only seems to overhead every second reboot...
Comment by Tobias Powalowski (tpowa) - Wednesday, 17 October 2012, 14:35 GMT
You can also try cold and warm reboot, cold means poweroff before turning on again.
Comment by Gaetan Bisson (vesath) - Wednesday, 17 October 2012, 14:44 GMT
For information, I've only been doing warm reboots. Since they suffice to make the laptop temperature switch from fine to hot and back, I haven't bothered with cold reboots.
Comment by Thomas Bächler (brain0) - Wednesday, 17 October 2012, 16:02 GMT
This problem looks like there is something seriously wrong in the scheduler. I don't claim to understand it at all, this needs to go upstream.

If you can be bothered with building 15 kernels and testing, you can bisect it: First, test if it is fine with version 3.5.0 and problematic with 3.6.0, then (preferably built with the same configuration):

$ git bisect start
$ git bisect good v3.5
$ git bisect bad v3.6
Bisecting: 5071 revisions left to test after this (roughly 13 steps)
Comment by Anthony Ruhier (Anthony25) - Wednesday, 17 October 2012, 17:15 GMT
Same thing on my laptop (Asus N55SF, Sandy bridge), the battery life is now the half of the one I had with linux 3.5.6 and my CPU is hotter.

Downgrading to the 3.5.6 version fix the problem, so I will be waiting for a real fix to switch to the 3.6 branch.
Comment by Gaetan Bisson (vesath) - Thursday, 18 October 2012, 01:38 GMT
Anthony: If your laptop is constantly overheating, and if rebooting (still with 3.6.2) does not change that, please do the bisect that Thomas suggested. I would do it myself if it behaved deterministically on my computer but it does not.
Comment by KaiSforza (KaiSforza) - Thursday, 18 October 2012, 10:05 GMT
I also am having this problem. What I and a few others on the forums noticed was a lack of frequency scaling for our processors. See the powertop output attached.

With 3.6.2-1: 2.5GHz across the board (edit: Oddly enough, less wakeups and gpu ops/sec)
With 3.5.6-1: 800MHz for all processors

Both of these powertop snapshots were taken at boot time.
Comment by Greg (dolby) - Thursday, 18 October 2012, 10:08 GMT Comment by Jan Alexander Steffens (heftig) - Thursday, 18 October 2012, 13:26 GMT
I get hit with this after resuming from suspend (Thinkpad X220t i7 SNB).
Comment by Jonas Jelten (TheJJ) - Friday, 19 October 2012, 07:58 GMT
The explanation for this:

RC6 is no longer active. Look at this:

| GPU |
| |
| Active 100.0% |
| RC6 0.0% |
| RC6p 0.0% |
| RC6pp 0.0% |
| |
| |


| GPU |
| |
| Active 0.0% |
| RC6 100.0% |
| RC6p 0.0% |
| RC6pp 0.0% |
| |
| |

See this thread:
http://lists.freedesktop.org/archives/intel-gfx/2012-October/021327.html
Comment by Jonas Jelten (TheJJ) - Friday, 19 October 2012, 08:35 GMT Comment by Federico Colnago (curson) - Wednesday, 24 October 2012, 10:44 GMT
I am having a very similar problem on my Dell XPS 14z. Everything is just fine with 3.6.2 but with 3.6.3 kernel the idle temperature (straight after boot) is at least 10~15°C higher (it is usually around 48-50°C while it goes up to close 70°C with 3.6.3). The fans seems to be spinning hard to try and compensate, while usually are clearly at minimum and barely audible.

Downgrading to 3.6.2 fixed the issue without fault.
Comment by Sudhir Khanger (donniezazen) - Wednesday, 24 October 2012, 17:15 GMT
Wow didn't realise it would get that worse 30W on 3.6.3
Comment by Jonas Jelten (TheJJ) - Wednesday, 24 October 2012, 17:23 GMT
https://bugzilla.kernel.org/attachment.cgi?id=84431
this patch fixed it for me. Before it was at 35W and 96 degrees in idle... Now 8W and 40 degree.
Comment by Sudhir Khanger (donniezazen) - Wednesday, 24 October 2012, 18:27 GMT
How do i patch my current kernel? Rebuild?
Comment by Bryan (bryan) - Thursday, 25 October 2012, 08:06 GMT
That patch fixes the issue for me too.
Comment by Bryan (bryan) - Thursday, 25 October 2012, 16:17 GMT
Hum, no fix at cold reboot: GPU is no longer in RC6 state (Active 100%), fan high, CPU at full freq. My test was with the patch applied to a v3.6 tag of a linux git clone.
Comment by David Rosenstrauch (darose) - Monday, 29 October 2012, 13:46 GMT
Is there a kernel package update that includes the above patch?
Comment by David Rosenstrauch (darose) - Monday, 05 November 2012, 02:56 GMT
Still no update on this bug? Is this by any chance fixed by upgrading to a later kernel version? (I see Arch is up to 3.6.5 now.) I've been holding my system at kernel version 3.5.6 pending this issue getting resolved.
Comment by phanisvara das (phani00) - Tuesday, 06 November 2012, 04:17 GMT
i haven't experienced the bug since the last two kernel updates -- i think. it's never been reliably reproducible, and once during the the previous kernel (3.6.4-1) i thought it was happening again, but it could have been something else as well, like nepomuk/strigi getting out of line (didn't check at the time, just rebooted). since upgrading to 3.6.5-1 i'm sure it hasn't happened again, neither on cold- or warm boots.
Comment by Bryan (bryan) - Tuesday, 06 November 2012, 07:53 GMT
My last test (3.6.4-1) on Oct 29 showed the issue is still there, after a cold or warm boot, and I reverted to 3.5.6-1.
Comment by Bryan (bryan) - Tuesday, 06 November 2012, 07:56 GMT
tpowa, are you working on a 3.7.x package already? If so I'd gladly test a package or rebuild it myself if you can send me PKGBUILD and additional files to rebuild it.
Comment by phanisvara das (phani00) - Tuesday, 06 November 2012, 08:00 GMT
what you wrote, bryan, seems to confirm what i observed: still there at 3.6.4-1, but gone since. today i got 3.6.6-1, and this bug still hasn't raised it's ugly head. perhaps you want to try this one?
Comment by Bryan (bryan) - Tuesday, 06 November 2012, 09:27 GMT
With 3.6.6-1 kernel from package, I still have my GPU active 100% all the time (while the laptop is idle) according to powertop. Temperature is lower (and thus the fan level too) so there is some progress, but obviously there is still an issue. Power consumption is high as the battery charge gets down really quickly. 3.6.6 doesn't fix the issue for me.
Comment by David Rosenstrauch (darose) - Wednesday, 07 November 2012, 02:37 GMT
I just tried out 3.6.6 - still broken. Sensors shows temps in the 80C range, and fan whirrs like crazy. (Vs. ~70C when I downgrade back to 3.5.6)
Comment by Eric Donkersloot (lordchaos) - Thursday, 08 November 2012, 22:19 GMT
For me, the bug is reproducible. I only hit the bug when resuming the session from suspend. The CPU temp rises to 67C when the laptop is idling (as supposed to ~50C). Rebooting always seems to fix the issue. Kernel 3.6.6-1.
Comment by phanisvara das (phani00) - Thursday, 08 November 2012, 22:25 GMT
really weird: after reporting earlier that it didn't happen to me anymore, it did happen again (once). but in my case, suspend & resume fixes the hight temperature issue, instead of causing it...
Comment by Eric Donkersloot (lordchaos) - Thursday, 08 November 2012, 22:51 GMT
I may be hitting another bug, I don't know. With all kernels I've tested I see the same behaviour:

1. Boot your laptop
2. Powertop reports a discharge rate of 9 ~ 11W. CPU Temp is around ~50C.
3. Suspend the laptop
4. Resume the laptop
5. Powertop reports a discharge rate of 15 ~ 18W. CPU Temp rises up to ~67C.

Only a reboot seems to fix this. Tested with kernels 3.5.6-1, 3.6.2-1 & 3.6.6-1.
Comment by Jan Alexander Steffens (heftig) - Thursday, 08 November 2012, 23:22 GMT
This seems to be caused by multiple issues. I've been seeing:
- The GPU not using deep sleep states anymore after suspend; RC6 does not get used, it is permanently Active. Cleared on reboot.
- The CPU not using deep sleep states on occasion, even without suspend; a mostly idle core spends all of its time in either C0 or (mostly) C1, C7 being ignored. Cleared by waiting, suspend, or reboot.
- The CPU being stuck at a high frequency. Seems to be caused by suspend. Fiddling with cpupower does nothing. Cleared on reboot.
Comment by Eric Donkersloot (lordchaos) - Sunday, 11 November 2012, 14:38 GMT
I've tested with both the Ubuntu 12.04-1 livecd and the newer 12.10 which both use older kernels, but I'm experiencing the same behavior on Ubuntu as well. Which surprised me, as I'm reading this bug should not be present in the 3.5.x series. Can't confirm that I'm afraid; after resume from suspend my laptop uses on average 5W more, regardless which kernel I've tested so far.
Comment by Adrian Goll (goll) - Sunday, 11 November 2012, 20:44 GMT
Unfortunately, I too can confirm this bug, running 3.6.6-1 on a HP EliteBook 8560p system with a i7-2620M processor and a ATI 6470M.

Here is my powerstat output http://pastebin.com/raw.php?i=8aEtZgxC which shows usage from 21-27W. Powertop shows 23W usage when idle.

My cpu temperature is constantly around 70C, my gpu temp is also in the 70C range.

I noticed really high values in powertop for my snd_hda_intel card, always at 100%, seems to be this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/877560, so this seems to improve thing *slightly*:

In file /etc/modprobe.d/modprobe.conf:

options snd_hda_intel power_save=1 power_save_controller=Y

But as far as I can tell, I don't suffer from the CPU being stuck at a frequency bug, cpupower seems to be doing it's job and keeping the frequency at 800MHz which is the minimum freq for my cpu http://pastebin.com/raw.php?i=2Tvb1hbB, /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq shows 800000, as does /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
Comment by Anthony Ruhier (Anthony25) - Saturday, 17 November 2012, 23:15 GMT
As Adrian Goll I am not affected anymore by this bug since few kernel versions, but I also have this bug with the intel GPU,

I have to set the CPU governor on powersave before the suspend if I want that the GPU powersave stays active.
Comment by Bryan (bryan) - Thursday, 29 November 2012, 20:19 GMT
With 3.6.8-1 package kernel, still the GPU at 100% Active state.
Comment by alex (kabolt) - Friday, 30 November 2012, 17:15 GMT
thx guys, I thought it was my battery and wondered why my laptop is so hot in standby.
Comment by Anthony Ruhier (Anthony25) - Friday, 07 December 2012, 14:46 GMT
It seems the bug is fixed for me since the 3.6.9 kernel version.
Comment by David Rosenstrauch (darose) - Friday, 07 December 2012, 16:41 GMT
Can anyone else confirm?
Comment by Anthony Ruhier (Anthony25) - Friday, 07 December 2012, 18:12 GMT
I was just lucky during one day, I just woke up my laptop few minutes ago from suspend and I got the GPU 100% Active state bug...
Comment by phanisvara das (phani00) - Sunday, 09 December 2012, 03:49 GMT
this is the weirdest bug i've encountered on arch so far. happens rarely to me, perhaps once every two weeks, but just now it did again (3.6.9-1) -- unless it's something else, but it certainly looks the same: ~10 centigrades above normal, and no processes stuck :(
Comment by phanisvara das (phani00) - Monday, 10 December 2012, 11:32 GMT
for some reason, i'm getting hit by this more frequently now :(

no recent kernel update and nothing else that i could understand to be related to this. but right now, as my CPU temp. is 10 centigrades plus again, i find this in my dmesg.log:

[ 6.499362] ACPI Warning: 0x0000000000000428-0x000000000000042f SystemIO conflicts with Region \GPIS 1 (20120320/utaddress-251)
[ 6.499367] ACPI Warning: 0x0000000000000428-0x000000000000042f SystemIO conflicts with Region \PMIO 2 (20120320/utaddress-251)
[ 6.499372] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[ 6.499375] ACPI Warning: 0x0000000000000500-0x000000000000057f SystemIO conflicts with Region \GPIO 1 (20120320/utaddress-251)
[ 6.499379] ACPI Warning: 0x0000000000000500-0x000000000000057f SystemIO conflicts with Region \GP01 2 (20120320/utaddress-251)
[ 6.499383] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[ 6.499384] lpc_ich: Resource conflict(s) found affecting gpio_ich
[ 6.499418] ACPI Warning: 0x000000000000e040-0x000000000000e05f SystemIO conflicts with Region \SMB0 1 (20120320/utaddress-251)
[ 6.499423] ACPI Warning: 0x000000000000e040-0x000000000000e05f SystemIO conflicts with Region \_SB_.PCI0.SBUS.SMBI 2 (20120320/utaddress-251)
[ 6.499426] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver

haven't seen this earlier but assume it's happening whenever it gets hot under the hood. don't really know what to make of it though.
Comment by David Wilkins (dwilkins) - Monday, 10 December 2012, 18:40 GMT
I just wanted to cross-post that a bug similar to this is being tracked in Fedora as well:

https://bugzilla.redhat.com/show_bug.cgi?id=859597

The workaround I (Lenovo T420) found was to use Fedora kernel package 3.5.5-2, but I think that I probably need to go back and get the version of the i915 driver instead

You can tell if it's happening by doing a

cat /sys/kernel/debug/dri/0/i915_drpc_info

a few times and seeing if RC6+ or RC6++ change at all. If those values stay 0, then your box will probably heat up. I wrote a script to check that (attached to the Redhat bugzilla ticket)

dhw
Comment by Olivier Toupin (oliviertoupin) - Tuesday, 11 December 2012, 01:03 GMT
A quick and supported fix 4 me was to use the LTS kernel.
Comment by Bryan (bryan) - Tuesday, 11 December 2012, 20:25 GMT
I installed 3.7.0-1 available from testing tonight. All 3.6.x kernel showed (with powertop's idle stats) GPU in Active state 99-100%. With 3.7.0, after a cold boot, I see Active 46% and RC6 54% in idle stats. The fan isn't starting at full speed and the temperature stays around 50°C. It's not perfect as 3.5.6 was, but 3.7.0 is usable for more than 30 minutes. But... when back from sleep (S3), the GPU is in Active state 100% of the time (still according to powertop's idle stats), temperature raises to 60°C, CPU throttling is dead and the fan is of course at high speed.

Back to kernel 3.5.6.
Comment by phanisvara das (phani00) - Tuesday, 11 December 2012, 21:50 GMT
for me it seems to be the opposite: after switching to 3.7 the laptop starts up hot (+10 deg.), and even a couple reboots don't bring it back to normal. instead of going back to the previous kernel, i'll look into the LTS kernel now, since i don't have much hope for this going to get fixed any time soon...
Comment by Gaetan Bisson (vesath) - Monday, 07 January 2013, 07:00 GMT
This bug is fixed in linux-3.8-rc1 (which by the way works perfectly on my X220).
Comment by KaiSforza (KaiSforza) - Monday, 07 January 2013, 07:08 GMT
I can confirm. 3.8-rc1 and rc2 both do not have this issue. Hopefully this all gets backported to 3.7.x.
Comment by Elvis Stansvik (estan) - Monday, 07 January 2013, 10:16 GMT
That's great news. Any idea which commit fixed it? So we can keep an eye out for a backport to 3.7.x.
Comment by Gaetan Bisson (vesath) - Monday, 07 January 2013, 11:54 GMT
I forgot but you won't miss it: it has rc6 and either i915 or drm in the log if I recall correctly.
Comment by Elvis Stansvik (estan) - Monday, 07 January 2013, 11:56 GMT
Alright.
Comment by Eric Donkersloot (lordchaos) - Monday, 11 February 2013, 09:30 GMT
I haven't come across this bug since 3.7.6-1-ARCH. Can anyone else confirm?
May be related to this commit: https://bbs.archlinux.org/viewtopic.php?pid=1227526#p1227526
Comment by Bryan (bryan) - Monday, 11 February 2013, 09:57 GMT
Have tried 3.7.6-1-ARCH this week-end. Cold boot shows proper behaviour: RC6 state for GPU most of the time, CPUs idle, fan silent. After suspend/resume things are no longer so nice: GPU is again in active state (according to powertop) most of its time (> 90%) and the fan is triggered very regularly. The remaining is ok (CPU throttling, temperature). So it's usable if you don't suspend.

Reverting to 3.5.6.
Comment by David Rosenstrauch (darose) - Monday, 11 February 2013, 15:23 GMT
I haven't tried it yet. (I've been sticking with LTS kernel on my laptop until this gets fixed.) But comments in the upstream bug (https://bugzilla.kernel.org/show_bug.cgi?id=48721) seemed to indicate this wasn't fixed until 3.8.
Comment by Kai Hendry (hendry) - Thursday, 21 February 2013, 02:41 GMT
3.7.9 is a disaster. Just seems to sit in Active state. :( http://stats.webconverger.org/x220/temp/052.png
Comment by Tobias Powalowski (tpowa) - Thursday, 21 February 2013, 05:49 GMT
You can try 3.8.0 from testing repository.
Comment by Mario Kozjak (archman-cro) - Friday, 22 February 2013, 21:42 GMT
3.8.0 from testing doesn't help here. Compaq nx6310, i945 GPU, Celeron M 1.73 GHz.
Comment by phanisvara das (phani00) - Saturday, 23 February 2013, 08:05 GMT
for me the heating issue hasn't occured since i'm using 3.8. but even with the later 3.7 kernels it happened only rarely. this whole thing seems to be very hardware-specific, affecting different machines differently. specs:

Machine: Mobo: ASUSTeK model: K53SK version: 1.0 Bios: American Megatrends version: K53SK.203 date: 10/11/2011
CPU: Quad core Intel Core i7-2670QM CPU (-HT-MCP-) clocked at 800.00 MHz
Graphics: Card-1: Intel 2nd Generation Core Processor Family Integrated Graphics Controller
Card-2: Advanced Micro Devices [AMD] nee ATI Whistler LE [AMD Radeon HD 6625M Graphics]
X.org: 1.13.2.901 driver: intel Resolution: 118x37
Comment by Kai Hendry (hendry) - Saturday, 23 February 2013, 08:07 GMT
3.8.0 runs cool for me, except sometimes out of suspend. :/
Comment by phanisvara das (phani00) - Saturday, 23 February 2013, 08:23 GMT
oh my, now that you mentioned it i tried suspending (haven't done that in a while), and it's 10 deg. hotter again. arg@#$@#!
Comment by Eric Donkersloot (lordchaos) - Wednesday, 27 February 2013, 12:20 GMT
Be aware, AFAIK there are two possible scenarios why your laptop temperature may be ~10 degrees higher after a suspend/resume cycle:

1. You have a laptop with two GPU's. On boot, you have disabled the discrete card with vgaswitcheroo. After the laptop resumes, the discrete card is turned on again, which explains the rise in temperature.
2. After the laptop resumes, the integrated Intel GPU is 100% active and does not return to RC6 mode.

I use a workaround for scenario 1 and I personally haven't come across scenario 2 since 3.7.6.
Comment by phanisvara das (phani00) - Wednesday, 27 February 2013, 12:24 GMT
thanks, i'll check that. so far i've checked which driver is being used, and it was only the one for the integrated card (intel), but that doesn't mean the other card (radeon) isn't activated.
Comment by Bryan (bryan) - Friday, 15 March 2013, 19:34 GMT
I have rather good news (at least for me) with kernel 3.8.3-1 from testing. It hasn't show a bad behaviour for the whole day. RC6 vs. Active GPU state is reasonable and never stucks in 100% RC6 mode (except when GPU is working much of course). Temperature stays around 50°C during standard operations and use. After a few suspend and resume, nothing bad happens. Battery lifetime is above 4h, with a power rate of 14W when the fan is triggered. It's the first time since kernel 3.5.6 I can use my laptop (Lenovo x220) with a recent kernel.

I hope you'll all have similar successes with 3.8.3.
Comment by David Rosenstrauch (darose) - Friday, 15 March 2013, 19:45 GMT
Cool! I'm looking forward to 3.8 getting released to core so this issue will (hopefully) be fixed. (Sorry, I can't live on the cutting edge with testing.) Anyone know when that's expected to happen?

I've been pinned to LTS on my laptop until this bug gets resolved, but that's starting to get problematic, as I'm starting to get plagued by other issues as a result of running LTS. (See: https://bugs.archlinux.org/task/34209)
Comment by David Rosenstrauch (darose) - Tuesday, 19 March 2013, 14:53 GMT
Upgraded to 3.8 last night, now that it's been released, and this temperature issue appears solved.
Comment by KaiSforza (KaiSforza) - Wednesday, 20 March 2013, 17:02 GMT
  • Field changed: Percent Complete (100% → 0%)
While I really don't want to burst peoples bubbles, this is still an issue with 3.9-rcN. For each of the three so far it has happened once (only once) and has been fixed either by a suspend/resume cycle or a quick reboot. I don't know what to think about, to be honest, as I cannot reliably reproduce it ever.
Comment by Anthony Ruhier (Anthony25) - Wednesday, 20 March 2013, 17:09 GMT
Same thing for me, it happened one time and I had to suspend and resume my laptop again to reactivate the RC6 state.

It happened only one time on my laptop, so we can say it's almost fixed, but not entirely in my own.
Comment by Federico Colnago (curson) - Wednesday, 20 March 2013, 17:15 GMT
Running on 3.8.3-2 right now, seems to be stable, but I don't go through suspend/resume cycles so I can't confirm that aspect of the issue. The three boots I've done so far today, all went smooth and without the issue presenting itself though.
Comment by Anthony Ruhier (Anthony25) - Wednesday, 20 March 2013, 17:26 GMT
It is stable on my laptop too, and the suspend issue happened one time on 10 or 15 suspend in total, so it's not a big deal but it's still there.
Comment by Bryan (bryan) - Thursday, 21 March 2013, 08:27 GMT
Unfortunately I have the bad behaviour again after half a dozen suspend/resumes: temperature closer to 60 than 50, fan engaged all the time, GPU in Active state at 100%, battery lifetime divided by too. A reboot fixes the problem, but reappears after N resumes.

Does someone know why this was broken at all upstream?
Comment by David Rosenstrauch (darose) - Thursday, 21 March 2013, 11:37 GMT
Perhaps this should be split off to a separate FS issue. The core problem seems to be resolved. And the problem with overheating on suspend/resume seems to be a separate (possibly unrelated) issue.
Comment by Anthony Ruhier (Anthony25) - Thursday, 21 March 2013, 14:44 GMT
@Bryan : I think it's due to a big energy gesture recast for PCI-E devices, started in linux 3.6. Before that there already was an issue with the energy gesture, but it was because the RC6 was disabled by default to avoid stability problems.
Comment by alex (kabolt) - Saturday, 23 March 2013, 20:35 GMT
a question: who of you uses gdm? I noticed recently that the dbus-daemon of gdm sometimes malfunctions sucking 100% of a CPU. Maybe this is a part of the issue.
I mean: could it be that the ones who still have sometimes an hot laptop be hit by this bug?
Comment by Anton S (lfxgroove) - Sunday, 05 May 2013, 06:03 GMT
@alex: I don't use it and still see the problem
Comment by KaiSforza (KaiSforza) - Sunday, 05 May 2013, 06:20 GMT
People who are getting this issue because something is taking up 100% CPU, that is not what this bug report is about. That is not an issue with the Linux kernel, that is an issue with an application.

If you are having this issue, though your CPU usage and load averages are down at idle levels, then this is a CPU/GPU scaling issue and is part of this bug report.

Thank you.

On a side note, I haven't had this issue with 3.9.0.
Comment by Markus (fixje) - Thursday, 09 May 2013, 09:20 GMT
It still happens to me on my X220 and seems to be even worse:
Some people already reported that GPU is 100% in running state even if they don't suspend/resume. I didn't experience that before updating to kernel 3.8.11
When I start up the notebook on AC it is fine (20-50% RC6 depending on what I'm doing). Afte a while the GPU starts getting crazy and the device heats up. I did not find out what triggers this behavior. Rebooting fixes it again, suspend/resume doesn't. I also made sure to turn off things like laptop-mode

kernel options: pcie_aspm=force i915.i915_enable_rc6=1 i915.i915_enable_fbc=1 i915.lvds_downclock=1
Comment by Federico Colnago (curson) - Thursday, 09 May 2013, 09:26 GMT
Since the 3.8.x series came around, this bug seems resolved for me on my Dell XPS 14z. I'm currently on 3.8.11 and reporting no problem, but I never go through suspend/resume cycles as I don't use that feature, so my "testing" is limited to the problem presenting itself (or not, in this case) at boot.
Comment by Kai Hendry (hendry) - Thursday, 09 May 2013, 09:30 GMT
I can confirm coming out of suspend, this X220 machine heats up 65+C (fan whirring) instead of ~40C on 3.8.11-1-ARCH.

Generally nowadays suspend just doesn't work, so I have not been using it very often. :(
Comment by Elvis Stansvik (estan) - Saturday, 11 May 2013, 13:00 GMT
Since some responses seem to indicate the problem is gone in 3.9.x, is anyone on 3.9.x still having the problem, or is it finally fixed?
Comment by Philip Munksgaard (Munksgaard) - Tuesday, 14 May 2013, 08:13 GMT
I am still experiencing this on a Lenovo X220.

$ uname -a
Linux hertz 3.9.2-1-ARCH #1 SMP PREEMPT Sat May 11 20:31:08 CEST 2013 x86_64 GNU/Linux

It runs hot after suspend.
Comment by Elvis Stansvik (estan) - Thursday, 16 May 2013, 14:36 GMT
I can confirm :( (also on X220 and 3.9.2-1-ARCH). Quite annoying that the problem has been present in 4 minor kernel releases now, and still no fix in sight AFAIK.
Comment by Philip Munksgaard (Munksgaard) - Wednesday, 22 May 2013, 13:42 GMT
I noticed something weird today.

First of all, it doesn't happen every time I continue from suspend, but it does happen once in a while.

Secondly, right now I've just unsuspended, and the computer isn't running crazy hot, but the fan is still running (which is weird, considering I'm just working in some LaTeX in Emacs). I opened powertop and noticed that the "Powered on" and "RC6" counters under GPU fluctuate wildly. One moment it'll look like this: http://i.imgur.com/zTsNJvY.jpg and if i press 'r' to refresh, not even a second later, it can look like this http://i.imgur.com/78XVVe8.jpg and everything in between. I'm guessing it should be pretty stable at 100% RC6 if I'm just idling?
Comment by Jonas Jelten (TheJJ) - Friday, 24 May 2013, 06:58 GMT
I got hit by the bug again, after resuming on 3.10.0-rc2. GPU does never enter RC6, after a suspend-resume cycle everything is normal again. Thinkpad X220t, Core i5-2520M.
Comment by Andrew Cowie (afcowie) - Monday, 03 June 2013, 07:54 GMT
You're lucky a suspend-resume cycle fixes it for you; once my system is normally 80-90% in RC6, but once this bug hits it's 0% RC6 / 100% Powered On. Kernel is stock 3.9.4-1-ARCH, cmdline is the usual i915_enable_rc6=7 i915_enable_fbc=1 lvds_downclock=1. I do wonder if some of that should go away...
Comment by Jonas Jelten (TheJJ) - Monday, 03 June 2013, 10:02 GMT
andrew, which cpu do you have? even if you are unter ivybridge, rc6pp is not stable, and on sandybridge you should only use rc6, so you can try changing i915_enable_rc6=7 to i915_enable_rc6=1 (assuming you have SNB)
Comment by Andrew Cowie (afcowie) - Monday, 03 June 2013, 12:37 GMT
@Thejj done. Have to wait and see if it strikes again; I'll report back here either way. (For what it's worth, I never saw it drop to RC6p or RC6pp, but hey)
Comment by Jan Alexander Steffens (heftig) - Tuesday, 04 June 2013, 03:12 GMT
That's because the proper kernel parameters would be i915.i915_enable_rc6, i915.i915_enable_fbc and i915.lvds_downclock.
Comment by Andrew Cowie (afcowie) - Tuesday, 04 June 2013, 04:30 GMT
@heftig Thanks. I should have picked that up. Anyway, fixing that (and verifying via /sys/modules/i915/parameters), both i915_enable_rc6 at 7 or 1 had no effect; still running at 100% Powered On as I write this.
Comment by Tobias Powalowski (tpowa) - Wednesday, 07 August 2013, 12:42 GMT
Status on 3.10.x?
Comment by Jan Alexander Steffens (heftig) - Wednesday, 07 August 2013, 12:43 GMT
Seems to be fine after boot, but has a reproducible, dramatic temperature raise after resuming from suspend.
Comment by KaiSforza (KaiSforza) - Wednesday, 07 August 2013, 16:56 GMT
> Seems to be fine after boot, but has a reproducible, dramatic temperature raise after resuming from suspend.

Could you please specify what exactly needs to be done to reproduce this?
Is it just resuming from suspend, or is there more you need to do?
Comment by Jan Alexander Steffens (heftig) - Wednesday, 07 August 2013, 17:49 GMT
Correction, it's not very reproducible. At first I had two suspends where it hit (using 3.10.3 and 3.10.4), and now three where it didn't (using 3.10.5).

I am running [testing] and gnome-shell, and pressing the power button to suspend.

Ohh, looking at the 3.10.5 changelog, there's f4332be drm/i915: fix long-standing SNB regression in power consumption after resume v2.

Could be fixed indeed. :)
Comment by David Wilkins (dwilkins) - Wednesday, 07 August 2013, 17:59 GMT
I'm a Fedora user, so this may or may not be applicable, but I'll describe what happens to me...

Generally when resuming after a suspend the graphics card "goes rogue" with respect to RC6+ or RC6++ as I mentioned above. It's possible due to resuming with the power status *different* than when suspend happened (on AC when suspending, on battery when resuming). I wrote a script to tell me if it's the graphics card or not. Generally a keyboard suspend/resume cycle will bring things back into order.

the script (debugi915.sh) that I've attached is run like:

sudo ./debugi915.sh

This script prints the same information (cat /sys/kernel/debug/dri/0/i915_drpc_info) on 1 second intervals and the output looks something like this:

# ------------------------- Wed Aug 7 12:03:12 CDT 2013
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: RC6
Core Power Down: no
RC6 "Locked to RPn" residency since boot: 0
RC6 residency since boot: 2152244865 <<------ The number to watch....
RC6+ residency since boot: 0
RC6++ residency since boot: 0
RC6 voltage: 450mV
RC6+ voltage: 245mV
RC6++ voltage: 245mV
Temp: 63000


If the RC6 residency (the number to watch above) since boot number is less than 100,000 and doesn't change, then the graphics card will heat up your box like mad.

Can someone verify that this script works on Arch? I've submitted the bug report to Fedora (I'm running 19 and it happens frequently)


Edit:
I just did the testing I should have done before I posted this:
uname -a: Linux t420w 3.10.4-300.fc19.x86_64 #1 SMP Tue Jul 30 11:29:05 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Lenovo Thinkpad T420, 16gb, Type 4177-CTO

Only happens when resuming via opening the lid on battery. Doesn't happen via Fn-F4 or on AC
Comment by Andrew Cowie (afcowie) - Thursday, 08 August 2013, 00:49 GMT
As an aside, the pm_async=0 workaround mentioned in the kernel bug thread did *not* work here.

With 3.10.5 now onboard here so far so good. One stuck at 100% would kill that, of course. Will report back.

Incidentally, I've boosted to i915.i915_enable_rc6=7 and powertop is indeed reporting the chip in RC6pp.

AfC
Comment by Bryan (bryan) - Friday, 16 August 2013, 07:29 GMT
David, your debugi915.sh script works for Arch too. I've been resuming/suspending using 3.10.x serie, with or without AC tens of times and haven't hit the current bug once yet. I'm now using kernel 3.10.6.
Comment by David Wilkins (dwilkins) - Friday, 06 September 2013, 13:45 GMT
I've still had problems, mainly occasionally the temp would rise and require a reboot to fix it. I suspected that the problem arose when I used GoogleTalk to make video and audio phone calls (GoogleTalk Plugin). Also, WebGL never seemed to work right for me. I did this yesterday and haven't had a problem since, and WebGL in Chrome is happy. Firefox was always cool with WebGL.

Disclaimer: I'm on Fedora
Props to the Arch Wiki for having the answer that works for all of us.

Link: https://wiki.archlinux.org/index.php/Intel_Graphics#Choose_acceleration_method

I added the attached file at /etc/X11/xorg.conf.d/20-intel.conf
Comment by Tobias Powalowski (tpowa) - Thursday, 10 October 2013, 09:51 GMT
Please take this issue to upstream developers, I cannot do anything.

Loading...