FS#4192 - latest kernel26s cause random lockups with Athlon64 X2 and nforce4 chipset

Attached to Project: Arch Linux
Opened by Paul Mattal (paul) - Saturday, 18 March 2006, 17:11 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To No-one
Architecture not specified
Severity Critical
Priority Normal
Reported Version 0.7.1 Noodle
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

I've a dual-core Athlon 64 X2 box (my dev box) which is experiencing random lockups with kernels later than kernel26 2.6.15.2-2. (tpowa, this includes the test 2.6.16-pre6 kernel you built)

While generally random, the lockups seem to occur when some significant I/O (disk or network) is occurring.

Could it be some patch we started including? I just wanted to get this bug into the collective consciousness so we can all chip away at it.
This task depends upon

Closed by  Tobias Powalowski (tpowa)
Friday, 31 March 2006, 15:15 GMT
Reason for closing:  Not a bug
Additional comments about closing:  bad memory caused the trouble
Comment by Tobias Powalowski (tpowa) - Saturday, 18 March 2006, 18:51 GMT
well rc6 doesn'T inlcude so many patches because 2.6.15 has them as backports.
seems an upstream issue then in .16 kernels hope they fix it soon.
Comment by Tobias Powalowski (tpowa) - Saturday, 18 March 2006, 20:12 GMT Comment by Paul Mattal (paul) - Saturday, 18 March 2006, 20:52 GMT
As a first test, I'm modifying the 2.6.15.6-2 PKGBUILD to use only the patches that were already applied to 2.6.12.2-2.

This will allow us to narrow it down as between patches added to the 2.6.15.x series and actual upstream stuff in 2.6.15.6.

If this build is stable, I'll ask you to tell me which of the patches I removed came from upstream and which not. Then I can try removing any of the non-upstream ones.

If this build is NOT stable, then it will have had to be something introduced in 2.6.16.6, which would be weird because those guys are pretty conservative.. but you never know.
Comment by Tobias Powalowski (tpowa) - Saturday, 18 March 2006, 20:56 GMT
our patches are only taken from upstream and the acpi patches, but the .16 kernel has no acpi patch anymore.
Comment by Paul Mattal (paul) - Sunday, 19 March 2006, 17:08 GMT
So here's the crazy news: My test kernel (2.6.15.6-2 but with only patches applied from 2.6.12.2-2 package) is NOT stable.

This seems to indicate that it must have something to do with something introduced in 2.6.15.{4,5,6}.

I've retreated to 2.6.15.2-2 to re-verify that I have no problems using it.
Comment by Hylke Witjens (moto-moi) - Monday, 20 March 2006, 18:50 GMT
You are aware there are some problems with the closedsource nvidia driver and x.org/Xfree86 ?
It will sometimes freeze your X, but the mousecursor remains moveable. As far as I know the reason behind this bug hasn't been found yet.
Comment by Dale Blount (dale) - Tuesday, 21 March 2006, 20:35 GMT
Paul,

I've seen several notes of people with problems using certain drives with nforce4 chipsets. Bios/Firmware flashes might be a good place to start.
Comment by Paul Mattal (paul) - Tuesday, 21 March 2006, 21:26 GMT
Thanks to both of you for your feedback.

Hylke, I wasn't aware of the problems with the closed source nvidia driver, which I AM using. I don't need it anymore to use DVI output, do I? I think xorg got smarter about that. If they did, I don't really need the increased performance of the closed source driver and should probably just switch to the standard xorg "nv" one. That said, my problem actually freezes the mouse cursor too, so it sounds like it isn't this particular bug.

Dale, this hadn't occurred to me at all -- don't know why it didn't! I'll try flashing BIOS tonight.

- P
Comment by Paul Mattal (paul) - Tuesday, 21 March 2006, 21:29 GMT
Incidentally, 2.6.15.2-2 has been running flawlessly, with the latest nvidia driver, since Sunday at noon. So it does really appear to be something introduced on the stable branch that is interfering. The good part about that is that it limits the possibilities drastically.

Assuming no joy with the BIOS solution, I'm going to compile my own 2.6.15.5 tonight and work backward from there until I get something stable.

- P
Comment by Paul Mattal (paul) - Wednesday, 22 March 2006, 05:13 GMT
In the ideal world, I'd have time to thoroughly pinpoint this bug. But I'm happy to sidestep it and not know exactly what it was, as long as it goes away. ;)

So I've installed the testing kernel 2.6.16-3 and also done a BIOS upgrade to the most current.

Things seem happy. We'll see what happens.

- P
Comment by Tobias Powalowski (tpowa) - Wednesday, 22 March 2006, 17:49 GMT
is this fixed now?
Comment by Paul Mattal (paul) - Wednesday, 22 March 2006, 18:12 GMT
I'd like to wait another day or two on this to make sure, and to do one more test to try to determine if it was my BIOS update or 2.6.16 that fixed my problem.

But so far, so good.

Thanks all for the help!
Comment by Börje Holmberg (linfan) - Thursday, 23 March 2006, 20:49 GMT
I ran into great problems with the 2.6.16 kernel. I read above that the acpi is disabled. My puter switches itself off randomly. I have now gone back to old kernel.

On my old puter - a p2 450 MHz, i have had to add the line acpi=force for over a year now.

But I do not feel like experimenting with this one.

Regards,

linfan
Comment by Börje Holmberg (linfan) - Saturday, 25 March 2006, 07:04 GMT
Just ignore my comment above. Sorry, I posted it. Seems my puter is showing some signs of aging. I will get a new puter in a couple of days.

linfan
Comment by Paul Mattal (paul) - Sunday, 26 March 2006, 16:21 GMT
So far I've had exactly ONE crash with the 2.6.16 kernel, of the same type I had with the 2.6.15.6 -- seemingly less frequent.

Let's leave this open longer; I suspect it isn't solved quite yet.
Comment by Paul Mattal (paul) - Friday, 31 March 2006, 04:12 GMT
I'm definitely getting more crashes, with 2.6.16.1-1. Same symptom: machine totally locks up.. mouse cursor freezes on the screen. It usually happens right after I've clicked something in my browser or mailer, causing me to wonder if it's something hhaving to do with network or disk IO, or else something in X.

I've tried switching graphics drivers, but it doesn't seem to help (using vesa instead of nvidia driver). I guess I'll fool around with switching off the forcedeth driver next.

I'm running on an Athlon dual-core X2 3800. Maybe SMP is to blame somehow in combination with some driver?

Anyone know if there's a quick way to turn off one core on boot? I'd like to find out if it's a unique problem with SMP or a general symptom.
Comment by Tobias Powalowski (tpowa) - Friday, 31 March 2006, 05:08 GMT
if you use nvidia drivers try to disable RENDER ACCEL that cuases such freezes as far s i know.
Option "RenderAccel" "false"
in nvidia section
Comment by Tobias Powalowski (tpowa) - Friday, 31 March 2006, 08:39 GMT
ok updated acpi in 2.6.16.1-2 perhaps that helps on your system.
Comment by Paul Mattal (paul) - Friday, 31 March 2006, 14:05 GMT
New helpful info.

I tried running without X for awhile (a throwback to the old terminal days). Then I thought, let me try to do something vicious to crash the thing -- so I tried building the eclipse package. Bingo, that did it, and seems to do it reliably.

Now I can see what happens when it crashes, and it's not pretty:

CPU 0: Machine Check Exception: 0000000000000004
Bank 2: d000400000000863
Bank 4: f603200100000813 at 000000006dcde860
Kernel panic - not syncing: CPU context corrupt

Obviously, I had to copy that painstakingly by hand.
I'm going to try using the utility that's floating around to find out what this error corresponds to.

Does make me wonder if this could have something to do with using ECC RAM. I've got 2 1GB sticks of unbuffered ECC RAM in this puppy, and it's recognized and supported by the motherboard in ECC mode. I may try turning that off, too.

So it looks like we can rule out X or any X driver as the direct culprit.

Also, I tried booting noapic, same result. Also tried booting ide=nodma, due to some other posts I came across; same result.
Comment by Tobias Powalowski (tpowa) - Friday, 31 March 2006, 14:10 GMT
also broken power supply can cause such strange effects.
Comment by Dale Blount (dale) - Friday, 31 March 2006, 14:13 GMT
I had a similar MCE on my amd64, turned out it was caused by a bad dvd-burner. 2.6.9 worked OK, but anything more wouldn't boot because of an MCE. Not saying yours is a dvd-burner problem, but I'd try booting without all the hardware you can live without hooked up.
Comment by Paul Mattal (paul) - Friday, 31 March 2006, 14:24 GMT
Here are my errors. I'm beginning to wonder if this isn't a hardware failure, RAM or CPU, manifesting itself in a strange way.

I copied the errors I'm getting into the "mcedump.txt" file on my other box (mythic) to do this analysis.

As I tried to create a set of successive error reports to think about, the two attempts to build eclipse after these two simply resulted in a reboot with no error message thrown. CPU overheating? Why now all of a sudden, when it was fine before and nothing apparent has changed?

[pjmattal@mythic ~]$ more mcedump.txt
CPU 0: Machine Check Exception: 0000000000000004
Bank 2: d000400000000863
Bank 4: f603200100000813 at 000000006dcde860
Kernel panic - not syncing: CPU context corrupt

CPU:1 Machine Check Exception: 0000000000000004
Bank 2: f200200000000863
Kernel panic - not syncing: CPU context corrupt
[pjmattal@mythic ~]$ ./parcemce -f mcedump.txt
CPU 0
Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(4): f603200100000813 @ 6dcde860
External tag parity error
Uncorrectable ECC error
CPU state corrupt. Restart not possible
Address in addr register valid
Error enabled in control register
Error not corrected.
Error overflow
Bus and interconnect error
Participation: Local processor originated request
Timeout: Request did not timeout
Request: Generic error
Transaction type : Instruction
Memory/IO : Other
CPU 1
Status: (4) Machine Check in progress.
Restart IP invalid.
Comment by Tobias Powalowski (tpowa) - Friday, 31 March 2006, 14:26 GMT
hey paul you don't run a memtest on your machine? pacman -Ss memtest
Comment by Paul Mattal (paul) - Friday, 31 March 2006, 14:37 GMT
And the prize goes to: tpowa! I've never had memtest give me a definitive answer on anything like it's doing for this. Great call.

By 10% through the first pass, I'm seeing 300 errors. All seem to be around 1756.7 MB. Do we think this means only my second stick of RAM is bad?

I will try pulling the second stick tonight and see what happens.. must go to work now. Thank you ALL for being the best help anyone's ever had debugging anything.. ever.

I could imagine life without Arch (if trying very hard) but not without the Arch community.

- P
Comment by Tobias Powalowski (tpowa) - Friday, 31 March 2006, 15:14 GMT
that was exatcly the reason for adding memtest to the repos
i have it installed on all machines to be able to do this quick check :)

Loading...