I recently upgraded my backup server (Solaris 8) from networker 6.14 to 7.12 build 325. At first, things seemed pretty normal. However, within a day or so, I realized that a few groups, although displayed as "running" under the group control window, were in fact doing nothing. There were no saves running on the clients for the hung groups. I could stop the groups, and I'd soon get the savegroup reports indicating that no savesets had started for clients in the group. I've even noticed a couple of hung groups that have actually had some clients run their backups successfully, but some clients hang. However, usually the entire group just sits there until eventually timing out...after a seemingly long time. This seems to happen only when the server is busy with multiple groups, etc. It's almost like either client connections or parallelisms are being throttled. I've confirmed my parallelisms setting to be 32, which is correct for my nw version.
I've entered the update enabler code, but due to a clerical error on their end, I will be unable to get Legato to provide an authorization code for a few more days.
Although I haven't yet identified the exact number of concurrent save sessions running on the server when the problems appear, I'd guess it's around 15-20 sessions (as displayed from nwadmin and nsrwatch). I've also noticed that sometimes I can stop a hung group, and clients in a different hung group might start saving to tape.
If you've read this far, please keep going. Here's where I think it gets interesting...
These same groups which are refusing to start any save sessions can be started via command line with the savegrp command and they will run just fine! In fact, in order to bypass the problem and maintain operations, I've disabled autostart for all savegroups, and I've created an appropriately scheduled cronjob for each savegroup. This works perfectly, allowing me to run at 32 concurrent sessions with no problems whatsoever. Individual save commands run directly from clients also work perfectly.
Doing a post-mortem for a hung group, I examined the daemon.log and really found no help. You see the savegroup start, then there's nothing, until the group either times out, or you stop it manually. Once the hung group is stopped, the daemon.log would show an entry like the following:
12/08/04 16:37 nsrd: runq: NSR group testgroup exited
with return code 1.
I've cleared /nsr/tmp (except for the /sec directory). I've cycled the software several times. I've even rebooted the server twice. The problem still reappears.
I have an open support case with Legato on this one, but so far we've had no luck. Although it's not from lack of effort. So, while I work this problem with Legato, I thought I'd toss it out here for everyone to chew. Any help is sincerely appreciated.
Ravashaak
I've entered the update enabler code, but due to a clerical error on their end, I will be unable to get Legato to provide an authorization code for a few more days.
Although I haven't yet identified the exact number of concurrent save sessions running on the server when the problems appear, I'd guess it's around 15-20 sessions (as displayed from nwadmin and nsrwatch). I've also noticed that sometimes I can stop a hung group, and clients in a different hung group might start saving to tape.
If you've read this far, please keep going. Here's where I think it gets interesting...
These same groups which are refusing to start any save sessions can be started via command line with the savegrp command and they will run just fine! In fact, in order to bypass the problem and maintain operations, I've disabled autostart for all savegroups, and I've created an appropriately scheduled cronjob for each savegroup. This works perfectly, allowing me to run at 32 concurrent sessions with no problems whatsoever. Individual save commands run directly from clients also work perfectly.
Doing a post-mortem for a hung group, I examined the daemon.log and really found no help. You see the savegroup start, then there's nothing, until the group either times out, or you stop it manually. Once the hung group is stopped, the daemon.log would show an entry like the following:
12/08/04 16:37 nsrd: runq: NSR group testgroup exited
with return code 1.
I've cleared /nsr/tmp (except for the /sec directory). I've cycled the software several times. I've even rebooted the server twice. The problem still reappears.
I have an open support case with Legato on this one, but so far we've had no luck. Although it's not from lack of effort. So, while I work this problem with Legato, I thought I'd toss it out here for everyone to chew. Any help is sincerely appreciated.
Ravashaak