FS#4192 - latest kernel26s cause random lockups with Athlon64 X2 and nforce4 chipset
|
Details
I've a dual-core Athlon 64 X2 box (my dev box) which is
experiencing random lockups with kernels later than kernel26
2.6.15.2-2. (tpowa, this includes the test 2.6.16-pre6
kernel you built)
While generally random, the lockups seem to occur when some significant I/O (disk or network) is occurring. Could it be some patch we started including? I just wanted to get this bug into the collective consciousness so we can all chip away at it. |
This task depends upon
Closed by Tobias Powalowski (tpowa)
Friday, 31 March 2006, 15:15 GMT
Reason for closing: Not a bug
Additional comments about closing: bad memory caused the trouble
Friday, 31 March 2006, 15:15 GMT
Reason for closing: Not a bug
Additional comments about closing: bad memory caused the trouble
seems an upstream issue then in .16 kernels hope they fix it soon.
but this is already in rc6 so perhaps it was not enough
This will allow us to narrow it down as between patches added to the 2.6.15.x series and actual upstream stuff in 2.6.15.6.
If this build is stable, I'll ask you to tell me which of the patches I removed came from upstream and which not. Then I can try removing any of the non-upstream ones.
If this build is NOT stable, then it will have had to be something introduced in 2.6.16.6, which would be weird because those guys are pretty conservative.. but you never know.
This seems to indicate that it must have something to do with something introduced in 2.6.15.{4,5,6}.
I've retreated to 2.6.15.2-2 to re-verify that I have no problems using it.
It will sometimes freeze your X, but the mousecursor remains moveable. As far as I know the reason behind this bug hasn't been found yet.
I've seen several notes of people with problems using certain drives with nforce4 chipsets. Bios/Firmware flashes might be a good place to start.
Hylke, I wasn't aware of the problems with the closed source nvidia driver, which I AM using. I don't need it anymore to use DVI output, do I? I think xorg got smarter about that. If they did, I don't really need the increased performance of the closed source driver and should probably just switch to the standard xorg "nv" one. That said, my problem actually freezes the mouse cursor too, so it sounds like it isn't this particular bug.
Dale, this hadn't occurred to me at all -- don't know why it didn't! I'll try flashing BIOS tonight.
- P
Assuming no joy with the BIOS solution, I'm going to compile my own 2.6.15.5 tonight and work backward from there until I get something stable.
- P
So I've installed the testing kernel 2.6.16-3 and also done a BIOS upgrade to the most current.
Things seem happy. We'll see what happens.
- P
But so far, so good.
Thanks all for the help!
On my old puter - a p2 450 MHz, i have had to add the line acpi=force for over a year now.
But I do not feel like experimenting with this one.
Regards,
linfan
linfan
Let's leave this open longer; I suspect it isn't solved quite yet.
I've tried switching graphics drivers, but it doesn't seem to help (using vesa instead of nvidia driver). I guess I'll fool around with switching off the forcedeth driver next.
I'm running on an Athlon dual-core X2 3800. Maybe SMP is to blame somehow in combination with some driver?
Anyone know if there's a quick way to turn off one core on boot? I'd like to find out if it's a unique problem with SMP or a general symptom.
Option "RenderAccel" "false"
in nvidia section
I tried running without X for awhile (a throwback to the old terminal days). Then I thought, let me try to do something vicious to crash the thing -- so I tried building the eclipse package. Bingo, that did it, and seems to do it reliably.
Now I can see what happens when it crashes, and it's not pretty:
CPU 0: Machine Check Exception: 0000000000000004
Bank 2: d000400000000863
Bank 4: f603200100000813 at 000000006dcde860
Kernel panic - not syncing: CPU context corrupt
Obviously, I had to copy that painstakingly by hand.
I'm going to try using the utility that's floating around to find out what this error corresponds to.
Does make me wonder if this could have something to do with using ECC RAM. I've got 2 1GB sticks of unbuffered ECC RAM in this puppy, and it's recognized and supported by the motherboard in ECC mode. I may try turning that off, too.
So it looks like we can rule out X or any X driver as the direct culprit.
Also, I tried booting noapic, same result. Also tried booting ide=nodma, due to some other posts I came across; same result.
I copied the errors I'm getting into the "mcedump.txt" file on my other box (mythic) to do this analysis.
As I tried to create a set of successive error reports to think about, the two attempts to build eclipse after these two simply resulted in a reboot with no error message thrown. CPU overheating? Why now all of a sudden, when it was fine before and nothing apparent has changed?
[pjmattal@mythic ~]$ more mcedump.txt
CPU 0: Machine Check Exception: 0000000000000004
Bank 2: d000400000000863
Bank 4: f603200100000813 at 000000006dcde860
Kernel panic - not syncing: CPU context corrupt
CPU:1 Machine Check Exception: 0000000000000004
Bank 2: f200200000000863
Kernel panic - not syncing: CPU context corrupt
[pjmattal@mythic ~]$ ./parcemce -f mcedump.txt
CPU 0
Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(4): f603200100000813 @ 6dcde860
External tag parity error
Uncorrectable ECC error
CPU state corrupt. Restart not possible
Address in addr register valid
Error enabled in control register
Error not corrected.
Error overflow
Bus and interconnect error
Participation: Local processor originated request
Timeout: Request did not timeout
Request: Generic error
Transaction type : Instruction
Memory/IO : Other
CPU 1
Status: (4) Machine Check in progress.
Restart IP invalid.
By 10% through the first pass, I'm seeing 300 errors. All seem to be around 1756.7 MB. Do we think this means only my second stick of RAM is bad?
I will try pulling the second stick tonight and see what happens.. must go to work now. Thank you ALL for being the best help anyone's ever had debugging anything.. ever.
I could imagine life without Arch (if trying very hard) but not without the Arch community.
- P
i have it installed on all machines to be able to do this quick check :)