savegroups hanging from gui and schedules

ravashaak · Dec 11, 2004

I recently upgraded my backup server (Solaris 8) from networker 6.14 to 7.12 build 325. At first, things seemed pretty normal. However, within a day or so, I realized that a few groups, although displayed as "running" under the group control window, were in fact doing nothing. There were no saves running on the clients for the hung groups. I could stop the groups, and I'd soon get the savegroup reports indicating that no savesets had started for clients in the group. I've even noticed a couple of hung groups that have actually had some clients run their backups successfully, but some clients hang. However, usually the entire group just sits there until eventually timing out...after a seemingly long time. This seems to happen only when the server is busy with multiple groups, etc. It's almost like either client connections or parallelisms are being throttled. I've confirmed my parallelisms setting to be 32, which is correct for my nw version.

I've entered the update enabler code, but due to a clerical error on their end, I will be unable to get Legato to provide an authorization code for a few more days.

Although I haven't yet identified the exact number of concurrent save sessions running on the server when the problems appear, I'd guess it's around 15-20 sessions (as displayed from nwadmin and nsrwatch). I've also noticed that sometimes I can stop a hung group, and clients in a different hung group might start saving to tape.

If you've read this far, please keep going. Here's where I think it gets interesting...

These same groups which are refusing to start any save sessions can be started via command line with the savegrp command and they will run just fine! In fact, in order to bypass the problem and maintain operations, I've disabled autostart for all savegroups, and I've created an appropriately scheduled cronjob for each savegroup. This works perfectly, allowing me to run at 32 concurrent sessions with no problems whatsoever. Individual save commands run directly from clients also work perfectly.

Doing a post-mortem for a hung group, I examined the daemon.log and really found no help. You see the savegroup start, then there's nothing, until the group either times out, or you stop it manually. Once the hung group is stopped, the daemon.log would show an entry like the following:

12/08/04 16:37 nsrd: runq: NSR group testgroup exited
with return code 1.

I've cleared /nsr/tmp (except for the /sec directory). I've cycled the software several times. I've even rebooted the server twice. The problem still reappears.

I have an open support case with Legato on this one, but so far we've had no luck. Although it's not from lack of effort. So, while I work this problem with Legato, I thought I'd toss it out here for everyone to chew. Any help is sincerely appreciated.

Ravashaak

605 · Dec 11, 2004

I have heard that there is a problem honoring the savegroup parallelism in NW 7. Keeping this in mind, you better set it to 0 and start the groups separately via a cron job (as you did.

ravashaak · Dec 13, 2004

Well, at least someone else has heard of this problem. Thanks. Here's hoping Legato can provide a fix.

- Ravashaak

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

savegroups hanging from gui and schedules

ravashaak

Technical User

605

Instructor

ravashaak

Technical User

Similar threads

Part and Inventory Search

Sponsor