×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Contact US

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

PE 600SC crashing/rebooting after power problems

PE 600SC crashing/rebooting after power problems

PE 600SC crashing/rebooting after power problems

(OP)
We have a Dell PowerEdge 600SC running Small Business Server 2000 SP4 with a CERC ATA100/4ch RAID controller with 4 attached drives (RAID 5 + hot spare)

After almost a week of spontaneous lockups, shutdowns, and pulling out of hair, we determined that we had a defective UPS that apparently wasn't delivering sufficient voltage to the server. It eventually got to the point that I couldn't get a problem-free startup!

I won't go into detail about the troubleshooting process that lead us there (unless someone thinks it relevant), but I had lots of duplicate hardware, and now the only things that have not been replaced are the system RAM (I'm rotating out sticks each time it crashes to see if that has any effect), and the drives in the RAID array.

After plugging into a different UPS, the server will start as normal, and run for hours (3 - 12 hours), but will eventually and abruptly crash... of course when no one is there to read any blue-screen messages. According to the DrWatson log, there is always an Access Violation (c0000005), but NOT always in the same process.

Does anyone have any insight on this, or suggestions for my next step?  

RE: PE 600SC crashing/rebooting after power problems

(OP)
OK, more info...
After wading through the Dr. Watson log, it turns out that I was overly optimistic. Only in one instance does the Access Violation error coincide with the time of the "unexpected shutdown" listed in the system log! So, I don't even have that little bit of information.

It seems to me that this must be a hardware problem, because there isn't anything relevant in the event logs prior to crash.

My fear is that the bad UPS ruined my spare hardware too, since I did all of the hardware replacements before we thought of the UPS :(  First instinct was that even though the UPS was replaced, it's still a power problem, so I put a brand new PSU in, but no change.

RE: PE 600SC crashing/rebooting after power problems

(OP)
Further...

Fortunately, I was at last able to get BSOD info from a couple of crashes. In each case, it was:

STOP: 0x00000077 (0xc0000185, 0xc0000185, 0x00000001, 0x0009d000)
KERNEL_STACK_INPAGE_ERROR

Although the BSOD said it was starting a dump of physical memory, there is no .dmp file to be found :(
The paging file IS large enough to handle the physical memory, but still no dump.

According to the MS KB article kb228753, the second parameter in the STOP message is the "I/O Status Code", which translates to:

"0xC0000185, or STATUS_IO_DEVICE_ERROR: improper termination or defective cabling of SCSI-based devices, or two devices attempting to use the same IRQ."

Remember, this is an IDE Raid controller, which just APPEARS to Windows to be SCSI, so there is no cable termination to check, but I replaced all of the drive cables anyway.

Going through the troubleshooting steps outlined in the MS document:
I scanned for boot sector virus with up-to-date McAfee
I reseated all components (except the processor)
I've run chkdsk /f /r on both virtual drives
I had one paging file on each virtual drive. I've removed the one on D:\. Tonight I will run another chkdsk on D:\ and then I'll reverse things and do the same on c:\. This should resolve any bad blocks in a paging file, shouldn't it???
I've got one more memory stick to swap out - so far no improvement.

I have yet to try disabling system caching in BIOS, but I don't hold out much hope, because why would a BIOS setting suddenly be causing errors if nothing else is defective?

So, according to the MS document, that brings us to a bad Mobo or drive controller in BOTH machines (assuming the memory and paging file stuff works out).

Does this seem likely? Does anyone have a best guess on which component to start hunting down?

Thanks...
 

RE: PE 600SC crashing/rebooting after power problems

(OP)
Well isn't this nifty!

Just started the chkdsk /f /r of the D:\ volume as described above, to see if there are any bad blocks where the pagefile was located... and the RAID controller starts beeping away - failed drive!

Fortunately I have a hot spare in the array, and it's rebuilding, but that doesn't mean there's no stress with chkdsk AND a drive recovery running at the same time!

I'll post again with results...

RE: PE 600SC crashing/rebooting after power problems

(OP)
It's Wednesday morning now, and so far, so good! For the first time in a week the server has been up for more than 24 hours without crashing.

The chkdsk results for the D:\ drive said:

"Correcting errors in the Volume Bitmap. Windows has made corrections to the file system", although it showed 0 kb in bad sectors.

The RAID controller has finished rebuilding the failed drive and shows the array as healthy. We'll see what happens.

What is puzzling me now is why the controller didn't fail the drive before this, and why during chkdsk scan. Does chkdsk skip the paging files like disk defragmenter does?  

RE: PE 600SC crashing/rebooting after power problems

When a disk electronics component start to fail you can get issues in which the raid adapter does not fail the drive or fails the drive after a period of time, unlike the adapter failing the drive for disk surface errors. As to drive electronic, a raid adapter will fail a drive with electronic issues under certain conditions, but especially with intermittent failures or components going out of spec, it may not.
The worst scenario with drive component failures is the offending drive can cause false failures of other drives in an array.

 

........................................
Chernobyl disaster..a must see pictorial
http://www.kiddofspeed.com/default.htm

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close