Nice post Lenski. Only 18 failed tapes in 3yrs. impressive, sounds like you invest the same effort that i do. I analyze each failure on a mission to eradicate non-genuine backup failures. In the 2yrs since we went LTO, have had numerous failures, but due to stringent (ok, obsessive !) testing/logging of each & every failure, seem to have eradicated non-genuine failures.
For those that don't know, NetBackup appends to its own 'errors' file for all drive/tape failures, so i've appended comments to each failure in this file. Thus have a history of the environment in one place.
11/10/02 23:44:01 5770L1 4 WRITE_ERROR
#rrb very first problem since this new equipment installed in Sept 2002
<snip> (quite a few WRITE errors up until autumn early July 2003)
#rrb 10/07/03 turned freq-based cleaning OFF for the LTO drives
#rrb 10/07/03 upgraded firmware for L60 Robot & all 6 drives
#rrb 10/07/03 all as part of fault call
<snip> (few WRITE errors late July 2003)
#rrb 29/07/03 drives 0 & 5 swapped out for fault call
<snip> (quite a few WRITE errors Aug 2003 thru Oct 2003)
#rrb 06/10/03 manually cleaned ALL 6 drives due to drives not requesting cleaning & above write errors
10/25/03 07:40:03 5913L1 0 WRITE_ERROR
10/27/03 07:39:56 5734L1 2 WRITE_ERROR
#rrb 27/10/03 to stop overrunning backups failing at 07:30 due to scsi bus resets, stopped running daily 'sgscan' which used to run at 07:30 each morning (this scan causes scsi bus resets)
<snip> (quite a few WRITE/POSITION errors Nov thru Dec 2003)
#rrb 09/12/03 manually cleaned drives 0,1,2,4,5 due to above write errors
#rrb 12/12/03 manually cleaned drive 3 due to above write errors
<snip> (absolutely loads of WRITE/POSITION/OPEN errors Dec 2003 thru Feb 2004)
#rrb - the above 3 write errors approx midnight caused by Sun Explorer 3.6.2
<snip> (loads of OPEN/WRITE errors Feb thru Mar 2004)
#rrb 07/03/04 the above 4 write errors approx midnight caused by Sun Explorer 4.2. Have since amended crontab to run Explorer at quiet time of day
<snip> (loads of POSITION/WRITE/OPEN errors Mar thru Apr 2004)
#rrb 19/04/04 manually cleaned all drives due to above write errors
#rrb 19/04/04 - all drives but 1 had mount time of 900+hrs, the other had 1000+
<snip> (loads of POSITION/WRITE/OPEN errors May thru Jun 2004)
#rrb 18/06/04 - the above errors from 10/06/04 thru 17/06/04 began to be addressed in new fault call
06/18/04 19:35:45 5828L1 1 WRITE_ERROR
#rrb 21/06/04 manually cleaned all drives due to above write errors 10/06/04 thru 18/06/04. Fault call still outstanding
#rrb 21/06/04 - all drives had average mount time of 465hrs before cleaning
#rrb 22/06/04 - as part of fault call 201585, added HP Ultrium-specific info to /kernel/drv/st.conf, after adding patch 108725-16 (no revision of this was previously installed)
#rrb 22/06/04 - patch 108725 addresses position errors
07/01/04 04:39:38 5867L1 1 WRITE_ERROR
#rrb 02/07/04 - drive 1 swapped out for fault call; raised call as previous call didn't seem to cure the write errors
07/08/04 03:11:58 5864L1 3 WRITE_ERROR
#rrb 09/07/04 drive 3 swapped out for fault call
07/18/04 13:15:50 5930L1 5 WRITE_ERROR
#rrb 18/07/04 the above write error was caused by Sun Explorer..Have now removed explorer from crontab altogether.
07/29/04 09:18:26 5848L1 3 WRITE_ERROR
#rrb 29/07/04 the above write error was caused by a (maually run) Sun Explorer..Goddamit !
08/02/04 12:17:22 5863L1 4 POSITION_ERROR
08/12/04 18:28:18 5818L1 2 WRITE_ERROR
#rrb 13/08/04 think this was a genuine write error as 5818L1 has had write errors before.
09/11/04 00:21:33 5788L1 3 WRITE_ERROR
#rrb 13/09/04 this was a genuine write error as no further problems (manually 'froze' tape at time of failure)
10/13/04 18:27:21 5920L1 2 POSITION_ERROR
#rrb ok can't explain this one-off position error. Backup subsequently reran itself ok to same tape. wierd.
#rrb as @ 20/10/04 no more non-genuine write failures since July 2004. Impressive.
Sorry for long post, but tape backups stil lhave their place, just needs a bit of effort to keep 'em in check
Rich
