hardware joy
Dec. 20th, 2004 11:50 pmThus far I've been unsuccessful in getting the new machine to talk to the digital camera. I'm awaiting a response from tech support for the camera. Aside from that, the new machine is behaving splendidly so far.
My old machine (called, for the nonce, Bouncy) is now failing in the exact same way its predecessor (Doornail) did: after increasingly-shorter periods of uptime, it reboots and, more often than not, produces a blue screen. Attempts to reboot at that point always fail; turning the machine off for a couple hours and then trying again gets a short-lived boot. This says "overheating" to me, but it's not appreciably quieter than normal, so I'm guessing the fan is still running. All the usual precautions have been in place all along -- UPS, antivirus, automatic updates (OS and virus), safe computing practices... I don't get it. If I knew what I was looking for I'd pop the cases and look around. But I'm pretty clueless about hardware. (And we just had Bouncy open a couple months ago to poke a graphics card, so I know it's not full of dustbunnies. I don't think Doornail was the last time I powered it up, either.)
The questions in my mind right now are: what happened to Doornail and Bouncy, can it be reversed, and what do I do to prevent it from happening to my new machine?
Could I have a faulty UPS? Could a faulty UPS do damage consistent with these symptoms?
(Oh, and just to clarify: this failure pattern is not the only reason I replaced Bouncy; it's just the final step in a series of annoying failures. The CD burner hasn't worked in months... stuff like that. If it were just a hard drive, that'd be different.)
My old machine (called, for the nonce, Bouncy) is now failing in the exact same way its predecessor (Doornail) did: after increasingly-shorter periods of uptime, it reboots and, more often than not, produces a blue screen. Attempts to reboot at that point always fail; turning the machine off for a couple hours and then trying again gets a short-lived boot. This says "overheating" to me, but it's not appreciably quieter than normal, so I'm guessing the fan is still running. All the usual precautions have been in place all along -- UPS, antivirus, automatic updates (OS and virus), safe computing practices... I don't get it. If I knew what I was looking for I'd pop the cases and look around. But I'm pretty clueless about hardware. (And we just had Bouncy open a couple months ago to poke a graphics card, so I know it's not full of dustbunnies. I don't think Doornail was the last time I powered it up, either.)
The questions in my mind right now are: what happened to Doornail and Bouncy, can it be reversed, and what do I do to prevent it from happening to my new machine?
Could I have a faulty UPS? Could a faulty UPS do damage consistent with these symptoms?
(Oh, and just to clarify: this failure pattern is not the only reason I replaced Bouncy; it's just the final step in a series of annoying failures. The CD burner hasn't worked in months... stuff like that. If it were just a hard drive, that'd be different.)
Questions
Date: 2004-12-21 06:48 am (UTC)When the system is cold (literally) how long until it fails, and then how long until it fails the second and third times?
There may be more than one fan, especially if the processor runs faster than 1 Ghz. You'd have to open the machine to know for sure. The loud fan is more likely to be the one in the power supply, and that can mask the sound of a cpu or case fan. If any of the fans goes bad, you'll get temperature problems. Even if there are no dustbunnies, dust can accumulate on fan blades and cause problems.
Under normal conditions, it would be unusual for two successive machines to fail in the same way, so I think we need to look for "abnormalities" especially of the environmental kind.
What is the normal temp/humidity of the room with the computer? The room could be too dry, causing static discharge problems internally with the fan(s). When this problem gets bad, you can smell the ozone or a vague burning smell, usually from the dust, even if there aren't bunnies.
How much does the room vibrate? Do passing trucks make the room or table shake? If the components get just barely unseated they can exhibit odd symptoms, but these are usually permanent failures.
Re: Questions
Date: 2004-12-21 06:02 pm (UTC)I'll transcribe the relevant blue-screen text tonight when I do the timing experiments. I suspect the resolution on the digital picture I took won't be good enough -- but since I can't get the new machine to talk to the camera, that doesn't help anyway. :-)
OS is Win2k Professional (with updates).
I hadn't considered the possibility of multiple fans -- thanks!
Environment: normal temperature (except in summer) is between about 58 and 70 degrees. In the summer I often use a window AC when working in the office; otherwise it's the ambient temperature. I'm not sure how to measure humidity; I'm doing nothing special there, so it's possible that in the winter it's too low and your speculations about static are correct. (I've never noticed a funny smell.) We have radiator heating (not forced air) and the room has an area rug. The computer sits on a desk, not on the rug. I've never noticed vibration problems; it's a reasonably sturdy IKEA desk. (So it's a top plus legs, not a full-blown desk, but it's against an inner wall and it's screwed together pretty tightly.) I leave the computer on all the time, so I rarely actually touch it except for access to the CD/DVD drives. I keep the blinds in the room down all the time, so direct sunlight isn't relevant.
More data tonight!
Data
Date: 2004-12-22 03:53 am (UTC)9:30 turn on power switch; power button unresponsive
9:35 power button unresponsive; cycle switch; hear very faint sound
vaguely reminiscent of disk spin-down
9:36 boot; fail to catch it before it starts disk integrity check
9:42 switches to boot screen ("applying security policy");
awaiting login prompt (mouse unresponsive)
9:47 login prompt appears; keyboard unresponsive
9:49 after triple-checking all connections, push reset button
9:51 normal boot; everything responsive
begin copying batch of files to USB hard drive
9:55 machine spontaneously reboots
9:56 blue screen:
"*** STOP (hex numbers here)
KMODE_EXCEPTION_NOT_HANDLED
Address (blah blah) - ntoskrnl.exe
If this is the first time... restart your computer. If this screen
appears again, follow these steps:
[adequate disk space?]
[if driver named in stop message, disable it]
Try changing video adapters.
Check with your hardware vendor for BIOS updates. Disable BIOS
memory options such as caching or shadowing. If you need to use
Safe Mode [instructions to get there]
Refer to Getting Started manual."
Commentary: the 40G hard drive has >10G free. No driver named
in stop message. Haven't touched BIOS since getting the machine;
not gonna start now without better clues.
10:02 unplug USB drive; reset button
10:03 normal boot; log in; browse files I was attempting to copy (triage)
10:06 plug USB drive into new machine; explore (no disk issues)
10:14 conclude that I've got everything I cared about from the attempts
to copy to USB drive; allow Bouncy to sit idle
10:30 still sitting there
10:41 notice blue screen (did not hear a reboot first); reset button
10:43 blue screen during boot; reset
10:45 blue screen during boot; power button no effect; power off (switch)
Re: Data
Date: 2004-12-22 05:19 am (UTC)That string of hex numbers really means something. You can actually search the MS database for them, and frequently get useful info. Each blue screen is likely to have similar numbers if not actually the same. If they're radically different each time it usually points to a driver issue. Was this more noticeable after an update or program installation?
Oh, and what brand/model is it?
This still sounds like an overheating issue, but we need to narrow it down.
At first glance it might be a memory problem, but I need more proof. The memory chips are rather sensitive to the environment and static, and you don't always get a failure message during the boot process - sometimes you get spontaneous reboots and blue screens.
Re: Data
Date: 2004-12-22 01:39 pm (UTC)Addendum: at 11:40 I restarted the machine, expecting it to fail quickly. I did nothing other than boot it, wondering if the heavy banging on the disk before had been related. Half an hour later I shut it down normally and went to bed.
Re: Data
Date: 2004-12-22 03:41 pm (UTC)In Win2K the hex numbers are an error (STOP) code (The caps letters are the error class), and various related information. Sometimes they're memory locations, pointers, instructions, etc. It depends on the actual STOP code. Once you search on the STOP code, you can get more info about the other numbers. 99 times out of 100, the secondary numbers don't provide additional useful info for troubleshooting.
Re: Data
Date: 2004-12-24 03:27 am (UTC)Tonight, I turned the machine on before leaving the house for three hours, and I came back to a blue screen. The hex numbers are not the same as before; I would have remembered these ones. Tonight I got:
*** STOP: 0x0000007F (0x00000000 [four of those])
UNEXPECTED_KERNEL_MODE_TRAP
And then straight into the "if this is the first time..." boilerplate. Not much in the way of useful diagnostics there.
The machine is an "AOPEN mid-tower KF45A 300W" (no, I'd never heard of Aopen either). The case also has an AMD sticker on it (CPU???). My paperwork says it was purchased in June 2002. (I was misremembering it as being newer.) I bought it at CompUSA but not quite off the shelf; I think the delay was due to my desire for Windows 2000 Professional instead of XP or ME or whatever they wanted to put on it by default. I think it's otherwise a normal machine of its era. Oh, 256meg of memory.
I tried rebooting from the blue screen and just got a blank screen after the normal boot text that scrolls by quickly (so it never got to the Windows splash screen). Ten minutes later it booted normally, and I'm now waiting for another blue screen.
Here we go!
Date: 2004-12-24 03:52 am (UTC)*** STOP: 0x0000001E (0xC0000005, 0x804698C8, 0x00000000, 0x10818D70)
KMODE_EXCEPTION_NOT_HANDLED
*** Address 804698C8 base at 80400000, DateStamp 41773335 - ntoskrnl.exe
Mean anything to you?
Re: Here we go!
Date: 2004-12-24 07:02 am (UTC)Re: Here we go!
Date: 2004-12-24 02:23 pm (UTC)Re: Here we go!
Date: 2004-12-25 03:51 am (UTC)I asked this only because a number of Microsoft's internal documents mention the STOP error can occur if a certain patch is not applied. But, like always, Microsoft doesn't claim this is a 100% bandaid.