×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Jobs

B6267342 DISK OPERATION ERROR

B6267342 DISK OPERATION ERROR

B6267342 DISK OPERATION ERROR

(OP)
Hello everyone,

I have an issue with an error being logged in the errpt. The error is:

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
B6267342 0327235514 P H hdisk1 DISK OPERATION ERROR

I have contacted IBM SAN Support and they've looked at the support logs and have assured me that there is nothing wrong with the SAN.

I'm confused because this error is generated every night at 23:55 pm. I discovered there is a root cronjob that is running nightly that executes the /var/perf/pm/bin/pmcfg command which ends up logging the error in the errpt.

The errpt -a | more output looks like this:
# errpt -a | more
ROS Level and ID............31303730
Serial Number...............
Device Specific.(Z0)........0000053245005032
Device Specific.(Z1)........

Description
DISK OPERATION ERROR

Probable Causes
DASD DEVICE

Failure Causes
DISK DRIVE
DISK DRIVE ELECTRONICS

Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
PATH ID
0
SENSE DATA
0600 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0118 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 2000 0000 0000 0000 0000 0000 0000 0000 0093 0000
0000 003D 001A

I followed this thread http://www.tek-tips.com/viewthread.cfm?qid=1255515 to obtain the SRN (Service Request Number) from the Sense Data. When I looked the SRN# (60000) up in this document ftp://document.ihg.uni-duisburg.de/IBM/Hardware/St... it's telling me that the SSA Adapter Card is missing from the configuration.

I'm lost here and not sure which steps I should take next. Can anyone offer any help?

RE: B6267342 DISK OPERATION ERROR

It ould have been nice if you had posted the whole error report entry to save us looking it up.
Things like the host: machine type and model, as well as the AIX version, technology level and service pack might have also have helped.

"my computer is broke, help" isn't really going to inspire most people to help you.

B6267342 is a SC_DISK_ERR2 which means it is a disk error at the scsi transport layer.

IBM provide all you need to decode this error in the "RS/6000 Eserver pSeries Fibre Channel Planning and Integration: User's Guide and Service Information"
Available here:
http://publib.boulder.ibm.com/systems/hardware_doc...

On page 89 it shows you how to interpret this error.

VVSS = 0118
Indicates that the SCSI device is reserved by another host or initiator.

So it looks like your cron job is placing a scsi 2 reserve on the disk thereby preventing any other initiator from accessing the disk.

If you do not understand "initiator" the think HBA or fibre adapter.

It's not a problem with your storage, it's a problem with your AIX configuration.

In the error report information you didn't bother to post it will identify the disk, maybe there are several, you didn't say, please check the output from:
lsattr -El hdiskx

Check the reserve policy is set to no_reserve.
If it is not, then you have an AIX configuration problem.
If it is, then you have probably hit an AIX defect with the cron job program.

I'm sorry but with so little information there is quite a lot of guess work in my analysis.

Hope this helps!
DukeSSD.

RE: B6267342 DISK OPERATION ERROR

(OP)
Hi DukeSSD,

First of all, please accept my apology for not posting the entire errpt. I thought I did, but now that I look at it I see I posted incomplete information.

Here is the complete error report:

LABEL: SC_DISK_ERR2
IDENTIFIER: B6267342

Date/Time: Fri Mar 28 12:57:42 2014
Sequence Number: 942
Machine Id: 00F697F24C00
Node Id: LP2DOMDB
Class: H
Type: PERM
WPAR: Global
Resource Name: hdisk1
Resource Class:
Resource Type:
Location:
VPD:
Manufacturer................IBM
Machine Type and Model......1746 FAStT
ROS Level and ID............31303730
Serial Number...............
Device Specific.(Z0)........0000053245005032
Device Specific.(Z1)........

Description
DISK OPERATION ERROR

Probable Causes
DASD DEVICE

Failure Causes
DISK DRIVE
DISK DRIVE ELECTRONICS

Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
PATH ID
0
SENSE DATA
0600 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0118 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 2000 0000 0000 0000 0000 0000 0000 0000 0093 0000
0000 003D 001A

I ran the lsattr -El hdisk1 command. Here is the output:

$ lsattr -El hdisk1
PCM PCM/friend/otherapdisk Path Control Module False
PR_key_value none Persistant Reserve Key Value True
algorithm fail_over Algorithm True
autorecovery no Path/Ownership Autorecovery True
clr_q no Device CLEARS its Queue on error True
cntl_delay_time 0 Controller Delay Time True
cntl_hcheck_int 0 Controller Health Check Interval True
dist_err_pcnt 0 Distributed Error Percentage True
dist_tw_width 50 Distributed Error Sample Time True
hcheck_cmd inquiry Health Check Command True
hcheck_interval 60 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
location Location Label True
lun_id 0x1000000000000 Logical Unit Number ID False
lun_reset_spt yes LUN Reset Supported True
max_retry_delay 60 Maximum Quiesce Time True
max_transfer 0x40000 Maximum TRANSFER Size True
node_name 0x20040080e51c054c FC Node Name False
pvid none Physical volume identifier False
q_err yes Use QERR bit True
q_type simple Queuing TYPE True
queue_depth 10 Queue DEPTH True
reassign_to 120 REASSIGN time out value True
reserve_policy single_path Reserve Policy True
rw_timeout 30 READ/WRITE time out value True
scsi_id 0x10f00 SCSI ID False
start_timeout 60 START unit time out value True
unique_id 3E21360080E50001C1240000003224DDBB5330F1746 FAStT03IBMfcp Unique device identifier False
ww_name 0x20550080e51c054c FC World Wide Name False


It looks like we have an AIX configuration issue (which is what I expected) because the reserve_policy is set to single_path. Would setting the reserve_policy to no_reserve resolve the issue?

RE: B6267342 DISK OPERATION ERROR

(OP)
Sorry for the output above. Here it is again:

CODE -->

PCM             PCM/friend/otherapdisk                                         Path Control Module              False
PR_key_value    none                                                           Persistant Reserve Key Value     True
algorithm       fail_over                                                      Algorithm                        True
autorecovery    no                                                             Path/Ownership Autorecovery      True
clr_q           no                                                             Device CLEARS its Queue on error True
cntl_delay_time 0                                                              Controller Delay Time            True
cntl_hcheck_int 0                                                              Controller Health Check Interval True
dist_err_pcnt   0                                                              Distributed Error Percentage     True
dist_tw_width   50                                                             Distributed Error Sample Time    True
hcheck_cmd      inquiry                                                        Health Check Command             True
hcheck_interval 60                                                             Health Check Interval            True
hcheck_mode     nonactive                                                      Health Check Mode                True
location                                                                       Location Label                   True
lun_id          0x1000000000000                                                Logical Unit Number ID           False
lun_reset_spt   yes                                                            LUN Reset Supported              True
max_retry_delay 60                                                             Maximum Quiesce Time             True
max_transfer    0x40000                                                        Maximum TRANSFER Size            True
node_name       0x20040080e51c054c                                             FC Node Name                     False
pvid            none                                                           Physical volume identifier       False
q_err           yes                                                            Use QERR bit                     True
q_type          simple                                                         Queuing TYPE                     True
queue_depth     10                                                             Queue DEPTH                      True
reassign_to     120                                                            REASSIGN time out value          True
reserve_policy  single_path                                                    Reserve Policy                   True
rw_timeout      30                                                             READ/WRITE time out value        True
scsi_id         0x10f00                                                        SCSI ID                          False
start_timeout   60                                                             START unit time out value        True
unique_id       3E21360080E50001C1240000003224DDBB5330F1746      FAStT03IBMfcp Unique device identifier         False
ww_name         0x20550080e51c054c                                             FC World Wide Name               False 

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close