failed controller w/ raid manager

ponetguy2 · Jan 24, 2006

Hello Gurus,

I have two A1000s connected to the same controller. I know, this is
not a good setup, but I had no choice. This controller has gone bad.
We were able to install another controller and we are going to move the
A1000s to the new dual port controller. Am I going to loose my LUN
configuration if I move the A1000s to the new controller? Is there a
procedure for this? Am I out of luck, where I will loose all my config
and data if I move the disk arrays to the new controller?

Solaris 9 w/ RAID Manager 6.2.

Please help.

cndcadams · Jan 24, 2006

ponetguy2;

Is the new controller the same part # as the old one and did you put it in the same slot that the original controller was in?

CA

ponetguy2 · Jan 24, 2006

no, different slot. same model.

cndcadams · Jan 24, 2006

Ponetguy2;

Let me start off buy saying that I have never tried moving a configured A1000 from one controller to another so I can't tell you what will happen exactly.

What I can say is that the other controller will give the A1000 another controller # so if you had c1tXdX before you will end up with say c2tXdX.

What I would do is move the new controller to the slot the original controller was in (unless you have a reason for not wanting to do this). By putting the new controller in the original controller slot you will maintain the device paths and controller# seen in format (so by your statement above that the controllers are the same model things will not change). If you changed the type of controller and tried to put it in the original slot then controller#'s would change.

On a test box if you have one, you could try attaching an A1000 set up a lun, this will give you a cXtXdX. Then edit /dev/dsk and /dev/rdsk and remove the cXtXdX entries which point to your A1000. Move the controller to another slot and boot -r and see if it maintains your luns.

Maybe some of the other Techs/Admins have some input on this suggestion!!

Thanks

CA

ponetguy2 · Jan 24, 2006

hello all, i switched one of tha a1000s to the new controller and raid manager still can't see it.

since this a1000 is mirrored on another a1000 on a different controller port (dual port scsi controller), can't i just wipe this one out and re-create? if so, how can i wipe out (# raidutil -c c4t5d0 -X)?

see output:

# healthck -a

Health Check Summary Information

Test_001: Failed Module

# raidutil -c c4t5d0 -i
Reading Physical page (2A) Failed:** Check Condition **
SENSE Data:
7000040000000098 0000000040810000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
Sense Key: 04
ASC: 40
ASCQ: 81

raidutil failed!

# drivutil -i c4t5d0

Drive Information for Test_001

Location Capacity Status Vendor Product Firmware Serial
(MB) ID Version Number

drivutil succeeded!

cndcadams · Jan 24, 2006

ponetguy2;

When you moved the A1000 to the new controller did you do a reconfiguration reboot? (reboot -- -r or touch/reconfigure then init 6) or did you do devfsadm -c disks with the OS still running?

run lad and see what luns you have configured.

The A1000 you moved will need to have the lun recreated under rm6.

Thanks CA

ponetguy2 · Jan 24, 2006

first, i did a probe-scsi-all from ok prompt and it was able to find both arrays from different controllers. then i did a boot -r.

cndcadams · Jan 25, 2006

ponetguy2;

Did RM6 show the conrtoller on bootup?
Check /var/adm/messages or dmesg to see if rm6 sees the controller for the A1000 when it runs nvutil at boot time.

Or do man on nvutil for options but from what I remmeber you could run a nvutil -vf to see if both A1000 controllers are seen.

or

Does lad show either of the luns you originally had? This will also show the A1000 contoller id I believe.

Sometimes the A1000 can get all messed up and you need to follow a certain procedure to get them back, the procedure will completely destroy your original configuration on the A1000. BOOO!

Note that this destroys the data on the array, but that's probably already happened if you're at this point.

Power down the array
Power down the host
Remove the battery for at least 60 sec
Remove all the disks except for the far left (2,0)
Plug battery back in
Power up array
Power up host
This is the state in which the A1000 will rebuild it's default configuration, which is a single LUN 0 (10MB, I think). I was also told to do this procedure with more than one disk, but not to use more of the 3 left disks, as they are where the LUN info is stored. Grab like disk 1,4 and use it the second time around. Then I'd say go for a controller replacement. Make sure that they bring one out with latest firmware onboard. It's a different part number than the 0205 and other earlier versions ones are.

I pulled the above procedure off of link;

http://www.eng.auburn.edu/pub/mail-lists/ssastuff/a1kemercov.html

When I worked for SunREMAN we had to do this inorder to rework the A1000 for resale.

Thanks

CA

ponetguy2 · Jan 25, 2006

thank you, i'll try your suggestion. I really appreciate your help.

ponetguy2 · Jan 25, 2006

Here is the output from the nvutil command. What is?:
"An internal processing error occurred on controller 1T84630200"

# nvutil -vf
The NVSRAM settings of controller c3t5d0(1T94311589) are correct.

An internal processing error occurred on controller 1T84630200.

nvutil command failed.

ponetguy2 · Jan 25, 2006

Does "An internal processing error occurred on controller 1T84630200" error mean that the controller is still good and the a1000 array has gone bad?

cndcadams · Jan 25, 2006

ponetguy2;

I could not find to much on that specific error but this is the reason for you being unable to see the device. If you had the gui (which from past posts I know you don't) you would most likely see the controller marked as dead under the status or configuration button from what I remember. I can't think of a command line to show that status.

One doc I found talked about people getting this error with older versions of RM6.

Do you know what version of RM6 you have?

Thanks

CA

ponetguy2 · Jan 25, 2006

RM 6.2.

Do you think the procedure you posted will fix this problem?

cndcadams · Jan 25, 2006

ponetguy2;

1). I am 99.9% sure that you will be able to get the A1000 back but it will be set to minimum config and all info will be gone. Sometimes these units can be a real pain in the butt. The procedure is the sameone I used to use. There is always a possibility you have bad hardware, drive or controller, but I doubt it.

2) If you have a contract with Sun you may want to give them a call to see if they have any information on that error. I will tell you this much, Sun has many internal docs that they do not release on website even with a contract.

3) You had moved this unit from the other controller. What you could do is try to reattach this unit back to the original controller and see if the unit comes back online. This suggestion is based on another doc that I looked at on sunsolve ( A customer had a dual boot configuration of Solaris 6 and Solaris 7 where when they booted Solaris 6 the A1000's came up fine, if they tried to boot Solaris 7 they would get the error you are seeing. If they then booted Solaris 6 again the units came up fine.

So based on that doc I believe the problem lies with The A1000 having an original lun configuration (still on the drives) based on say c3t6d0 and since you moved the unit to another controller Solaris is trying to target as say c4tXdX, and this will totally confuse RM6.

4) If you run pkginfo -l and look for RM6 and does it does not show 6.22.1 you are not running the latest version of rm6. There is always a possibility that an upgrade to the latest software would fix this issue but In my personal opinion I doubt it.

Thanks

CA

ponetguy2 · Jan 25, 2006

# raidutil -c c2t5d0 -i
Reading Physical page (2A) Failed:** Check Condition **
SENSE Data:
7000040000000098 0000000040810000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
Sense Key: 04
ASC: 40
ASCQ: 81

raidutil failed!

EXPLAINED:

SC ASCQ Sense Key
40 NN 4,(6)
Diagnostic Failure On Component NN (0x80 - 0xFF)
The controller has detected the failure of an internal controller component. This failure may have
been detected during operation as well as during an on-board diagnostic routine. The values of NN
supported in this release are listed as follows:

> 80 - Processor RAM
> 81 - RAID buffer
> 82 - NVSRAM
> 83 - RAID Parity Assist (RPA) chip
> 84 - Battery-backed NVSRAM or clock failure
> 91 - Diagnostic self test failed non-data transfer components test most likely controller cache
holdup battery discharge)
> 92 - Diagnostic self test failed data transfer components test
> 93 - Diagnostic self test failed drive Read/Write Buffer data turnaround test
> 94 - Diagnostic self test failed drive Inquiry access test
> 95 - Diagnostic self test failed drive Read/Write data turnaround test
> 96 - Diagnostic self test failed drive self test

In a dual controller environment, the user should place this controller offline (hold in reset)
(unless the error indicates controller battery failure, in which case the user should wait
for the batteries to recharge). In single controller environments, the user should not use this
subsystem until the controller has been replaced.

ponetguy2 · Jan 25, 2006

No dice. I did as instructed and tried all of the disk on the a1000.

i'll try this procedure:

You can save yourself the time and energy of going through each of the 12 disks by doing the following:

1. Remove all disks from A1000.
2. Free up c0t1d0 on the server.
3. Place a disk from the A1000 into c0t1d0 on the server.
4. Run 'dd if=/dev/zero of=/dev/rdsk/c0t1d0s2 bs=10240'
5. Place disk in c0t1d0 back in A1000 in the leftmost bay.
6. reboot -- -r
7. Start rm6 and make sure you have a default config.
8. Plug drives in one at a time, waiting about 10 seconds between each.

This will clean up any corrupted RDAC and since there is no RDAC on the A1000 leftmost disk (the only disk available), one will be created.

http://sysunconfig.net/unixtips/a1000.html

cndcadams · Jan 25, 2006

ponetguy2;

I found this on Sun website for future reference;

To decode sense codes that are reported from Raid Manager controlled devices, such as the a1000 and the a3x00 series, users should reference the file /usr/lib/osa/raidcode.txt on their system

Also the procedure you have sounds good, I have also done that.

Basically you are writting a new label to the drive so instead of it being seen as an A1000 drive it will have a normal sun label on it. So once you have relabeled the drive, shutdown the system, put the drive back into the A1000 left most slot, boot the system back up and you will see corrupt label. Run format, choose the drive, and then answer yes to label the drive. You should then be able to add the other drives back into the A1000 one at a time waiting enough time for them to spin up. After I had added the drives in I run the raidutil option that resets the configuration on that A1000.

So good luck, let me know how things go.

Thanks

CA

ponetguy2 · Jan 26, 2006

no dice. a1000 is still dead.

looks like we will replace the a1000s with two d1000s. i'm not sure how to replicate the a1000s with the d1000s since i can not create luns with the d1000s. oh well.

thank you cndcadams for your help once again.

Annihilannic · Jan 26, 2006

Use Solaris Volume Manager or Veritas Volume Manager to perform RAID/mirroring at the software layer and you should be able to imitate the A1000s quite easily.

Annihilannic.

cndcadams · Jan 27, 2006

ponetguy2;

Couple questions;

You followed the procedure that had you relabel a drive correct?

You removed the battery from the unit? Powered the unit on waited thirty seconds to a minute and then powered it off again?(just leave the battery out of the unit until you get the A1000 back.)

When you inserted the drive back into the A1000 and booted did you see corrupt label message?

If you did not see it check you /var/adm/messages and messages.0 to make sure you did not miss it.

What did nvutil -vf show? Again check messages file to see if it ever came up as correct.

If corrupt label never appeared and nvutil did not show the controller as correct you did not fix the issue. I will again say these things can be a real pain in the butt. Sometimes I had to try a couple different drives to get the A1000 back.

Last thing to try is attach the d1000 and put all drives from the A1000 into it(remove the A1000 from the system). Boot -r, run format and relabel all the drives then repartition each drive to say slive 0 and newfs each drive. Shutdown the system and remove the D1000.
Install a drive into slot 0 and slot 8 in the A1000, boot -r. You should see corrupt label messages when booting. Watch to see if nvutil reports the controllers as correct. Run format and answer yes to labeling the drives.
Then run the raidutil option to reset the configuration, then run lad to confirm that things look good. Add 2 drives, put drives in slot 1 and slot 9. I can't remember from here what I would do but I think I would run devfsadm -c disk, then reset the A1000 with raidutil option. run nvutil -vf and confirm all is ok. Add 2 more drives(slot 2 and 10) run devfsadm -c disk and nvutil -vf until you get all the drives back in. Once all the drives are in run raidutil option to reset the configuration, then install the battery. If all is good reboot.

If this does not work then maybe you do have a bad controller.

I understand that you may say why do this all over again but it can be worth it.

Thanks

CA

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

failed controller w/ raid manager

MIS

IS-IT--Management

MIS

IS-IT--Management

MIS

IS-IT--Management

MIS

IS-IT--Management

MIS

MIS

MIS

IS-IT--Management

MIS

IS-IT--Management

MIS

MIS

IS-IT--Management

MIS

MIS

IS-IT--Management

Similar threads

Log in

Part and Inventory Search

Sponsor