Frustrating SAN Part 2

stonetown · Oct 28, 2002

System:
2 x Compaq Proliant 3000
Compaq Modular Data Router
1 x Compaq TL891 Tape Library (10 slots, 2 groups)
Win NT4 SP5
ARCserve 2000 SP4 (AE, TLO, SANO, OFA, Exchange Agent, DRO)
3 Patches applied: QO28915 (Device Patch), QO20824 (crash patch), QO27937 (OFA patch).
cpqfcalm version: 4.19.0.63

Problem: 2 Servers configured to run backups to the Tape Library via the SAN option. The primary server runs OK but the distributed server backup fails after 1 or 2 days, an event frequently appears in the NT system log only on the distributed server as follows:

Source: cpqfcalm
Event: 15
Description: The device, \Device\ScsiPort7, is not ready for access yet.

When the distributed server backup fails the errors are flagged as follows in the Arcserve log (no errors appear on the primary server):

20021028 094549 42 Start Backup Operation. (QUEUE=1, JOB=2)
20021028 094550 42 W3831 Unable to find any media.
20021028 094552 42 Mount media request cancelled by user or by default timeout.
20021028 094552 42 W3832 Unable to find a blank media.

This looks like the distributed server is not able to connect to the device groups on the primary server.

The tape log on the distributed server looks like this:

[883] TAPE ENGINE MESSAGE LOG
[883] OEM: CA Inc.
[883] Name: ARCserve 2000
[883] Major Version: 7
[883] Minor Version: 0
[883] Date: Oct 14 2002
[883] Time: 11:58:27
[883] Description: 7.0 Release of NT Tape Engine
[883] Debug Level: 2
[883] Total Jobs ACTIVE 3
[883] CREATEJOBHANDLE [019F0688]: JobType - 8, ServerType - Local, Client - ASMGR@ANDOMS2
[863] Total Jobs ACTIVE 4
[863] CREATEJOBHANDLE [019F09AC]: JobType - 2, ServerType - Local, Client - SYSTEM
[883] ClientGetGroupStatus() is called.
[883] ClientGetGroupStatus:gbPrimaryServer == FALSE. call _ClientGetGroupStatus()
[883] _ClientGetGroupStatus() is called.
[883] ClientGetGroupStatus: ownerName = SYSTEM
[863] ClientReserveGroup ANDOMS2
[883] Calling DisConnectFromTape()
[883] DestroyHandle [019F09AC]: Active Jobs 3
[863] Total Jobs ACTIVE 4
[863] CREATEJOBHANDLE [019F09AC]: JobType - 257, ServerType - Local, Client - Administrator
[883] CheckIfRAIDGroup: RAIDSupport = FALSE
[883] Calling ConnectToGroup():ANDOMS2
[883] Successfully attached to primary Server: ANDOMS1!
[863] Entering EnumTapeName() State:1.
[863] Entering EnumTapeName() State:1.
[863] Entering EnumTapeName() State:1.
[863] Entering EnumTapeName() State:1.
[863] Entering EnumTapeName() State:1.
[863] Entering EnumTapeName() State:2.
[863] Entering EnumTapeName() State:2.
[863] Entering EnumTapeName() State:2.
[863] Entering EnumTapeName() State:3.
[863] Entering EnumTapeName() State:1.
[863] Entering EnumTapeName() State:1.
[883] Calling DisConnectFromTape()
[883] DestroyHandle [019F09AC]: Active Jobs 3

All hardware conforms to CA CDL. Previous to QO28915, the distributed server really had a hard time connecting to the Device groups on the primary server, at least now the backups will run for a few days before failing on the distributed server.

Anybody got any ideas???

See previous thread entitled Frustrating SAN for more background.

jabus · Oct 28, 2002

Do you have the Removable Storage Service disabled? Also, install the latest DeviceSP (QO28915) and QO29368. There are several fixes that apply to your configuration.

Is the Destination Group on the dist node backup set to '*', or is it assigned to one of the groups? Is the media destination set to '*'? Where there any blank media available in both groups (gotta ask that one 8-D)? Was the timeout for first media set higher than 5 minutes (15 minutes is a good starting point)?

Also try setting the ARCserveIT\Base\TapeEngine\Debug value to 9 on both the primary and distributed servers. Add the values ARCserverIT\Base\Database\Debug and \DebugTapeAPI, and set both to DWORD:9. These will provide additional details as to what is going on.

ARCserveGuru · Oct 28, 2002

And be careful, they will also create BIG logs....

g

stonetown · Oct 29, 2002

QO28915 is already applied, previous to this patch this same problem occurred, but at no time was the distributed server able to make a single backup, since the patch was applied the problem has reduced. The specific problem that QO28915 fixes according to CA follows:

"t16c211 - Jobs are unable to run from Distributed Servers in a SAN due to a connection failure to the device group."

I can see your reasoning for suggesting QO29368, but I don't have a problem with restores from the distributed server, I also don't experience the two error messages mentioned by CA.

I cannot find an installed Removable storage Service! So I can rule that one out.

The same backup is scheduled to run on both servers once a day, the primary uses a group and the distributed server uses a different group (different slots), the primary server backup never fails, it is only the distributed server that fails and it may fail after 1, 2 or 3 days.

The destination group is set as the group name and not '*', the media destination is set as '*'.

I have noticed that the cpqfcalm event error occurs on the distributed server, at the same time that the backup finishes on the Primary server, (it also possibly occurs when a restore, merge job is run/finishes on the primary server, although I need to analyse this one more thoroughly). I guess the primary server must be dropping the connection to the distribution server thus causing the problem.

I have this configuration installed at 4 sites, all distributed servers at the 4 sites are experiencing the cpqfcalm event, but only two sites are experiencing the connection problems (so far).

Thanks for the debugging info, I shall try it, although the tape log consumes masses of HD space.

OT - ARCserve is set up on a 3.5 Gb partition, 1.5 Gb alone is consumed by the databases and the tape log fills the rest up pretty quickly. I've read all the discussion reference the defragging of the databases, this worked under 6.61 and I cannot understand why CA have overlooked this in 2000. I also cannot understand why it appears that CA released a service pack, SP4, without thoroughly testing it, there is nearly a whole page of post SP4 patches with more to come no doubt, they are almost as bad as Micros*ft, do they think we are beta testers ?!? - OT

I have a hunch that the DeviceSP patch needs to be updated further by CA.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Frustrating SAN Part 2

stonetown

MIS

jabus

MIS

ARCserveGuru

Technical User

stonetown

MIS

Similar threads

Part and Inventory Search

Sponsor