NSRD won't start - using Solstice Networker 6.0 4

angelo23 · Jan 20, 2004

Hello All,

We are using Sun solaris 2.7 and running The Solstice Networker Backup 6.0. Everything was running great until a few days back I noticed that my /nsr filesystem was pretty big. I decided to delete some of the oldest index instances thru the NWADMIN tool. I then did a reclaim on the space. After that I started getting the following messages when trying to perform my backups for each file system. "error, media index problem...can't open save session ...will retry 2 more times" Like I said I get this for every file system and then they all eventually fail... I tried shutting down nsr (nsr_shutdown) and ran the various nsrck command and then restarted nsr but had no luck. I also tried to just wipe out the indexes by moving the "/nsr/index/client name" to a separate backup directory and then trying to do a backup. This would give me a new full backup of every file system since there would'nt have been any full backups. But this was also another failed attempt...

Now my NWADMIN tool won't even open and the nsrd daemon will not start. I do see the nsrexecd running but no nsrd... Also here are some lines from my /nsr/logs/daemons.log file:

01/19/04 10:35:25 nsrd: nsrmmdbd has exited with status 1
01/19/04 10:35:25 nsrd: shutting down
01/19/04 10:35:25 nsrd: successful shutdown
01/19/04 11:47:15 nsrd: server notice: started
01/19/04 11:47:16 nsrmmdbd: error adding btrees to ss (I/O error)
01/19/04 11:47:16 nsrmmdbd: WISS error: I/O error
01/19/04 11:47:16 nsrmmdbd: Error on close of volume index file #0 (invalid file number)
01/19/04 11:47:16 nsrmmdbd: Error on close of volume index file #1 (invalid file number)
01/19/04 11:47:16 nsrmmdbd: Error on close of volume index file #2 (invalid file number)
01/19/04 11:47:16 nsrmmdbd: Error on close of volume index file #3 (invalid file number)
01/19/04 11:47:16 nsrmmdbd: Error on close of volume index file #4 (invalid file number)
01/19/04 11:47:16 nsrmmdbd: Error on close of client id map index file #0 (invalid file number)
01/19/04 11:47:16 nsrmmdbd: Error on close of client id map index file #1 (invalid file number)
01/19/04 11:47:16 nsrd: unable to start nsrmmdbd
01/19/04 11:47:16 nsrd: shutting down
01/19/04 11:47:16 nsrd: successful shutdown
01/19/04 11:56:23 nsrd: server notice: started
01/19/04 11:56:24 nsrmmdbd: error adding btrees to ss (I/O error)
01/19/04 11:56:24 nsrmmdbd: WISS error: I/O error
01/19/04 11:56:24 nsrmmdbd: Error on close of volume index file #0 (invalid file number)
01/19/04 11:56:24 nsrmmdbd: Error on close of volume index file #1 (invalid file number)
01/19/04 11:56:24 nsrmmdbd: Error on close of volume index file #2 (invalid file number)
01/19/04 11:56:24 nsrmmdbd: Error on close of volume index file #3 (invalid file number)
01/19/04 11:56:24 nsrmmdbd: Error on close of volume index file #4 (invalid file number)
01/19/04 11:56:24 nsrmmdbd: Error on close of client id map index file #0 (invalid file number)
01/19/04 11:56:24 nsrmmdbd: Error on close of client id map index file #1 (invalid file number)
01/19/04 11:56:24 nsrd: unable to start nsrmmdbd
01/19/04 11:56:24 nsrd: shutting down
01/19/04 11:56:24 nsrd: successful shutdown
01/19/04 13:58:57 nsrd: server notice: started
01/19/04 13:58:58 nsrmmdbd: error adding btrees to ss (I/O error)
01/19/04 13:58:58 nsrmmdbd: WISS error: I/O error
01/19/04 13:58:58 nsrmmdbd: Error on close of volume index file #0 (invalid file number)
01/19/04 13:58:58 nsrmmdbd: Error on close of volume index file #1 (invalid file number)
01/19/04 13:58:58 nsrmmdbd: Error on close of volume index file #2 (invalid file number)
01/19/04 13:58:58 nsrmmdbd: Error on close of volume index file #3 (invalid file number)
01/19/04 13:58:58 nsrmmdbd: Error on close of volume index file #4 (invalid file number)
01/19/04 13:58:58 nsrmmdbd: Error on close of client id map index file #0 (invalid file number)
01/19/04 13:58:58 nsrmmdbd: Error on close of client id map index file #1 (invalid file number)
01/19/04 13:58:58 nsrd: unable to start nsrmmdbd
01/19/04 13:58:58 nsrd: shutting down
01/19/04 13:58:58 nsrd: successful shutdown

Any help or comments would be appriciated... As I'm getting to the point that I might just unistall Solstice Networker and Reinstall it...

Thanks
Brian (angelo23)

TDun · Jan 20, 2004

Your media index is u/s.
Try using mmrecov to restore your media index, then try doing your tidy again, you might want to consider doing so from the command line this time.

TDun · Jan 20, 2004

I forgot to say, move/rename/delete your old mm structure before restarrting Networker and then doing the mmrecov.

605 · Jan 20, 2004

As it much faster, you may try to scavenge the media db before you
run mmrecov.

- Stop all NW daemons
- In /nsr/mm/mmvolume6, delete all files
clients_i*.*
ss_i*.*
vol_*.*
- restart NW
- check the daemon.log file whether there have been problems.

If not, retry the restore operation. Good luck.

DavinaTreiber · Jan 20, 2004

It looks like you have media database corruption. This was very common with 6.0. It was a very short lived release since 6.0.1 was released hot on its heels because of the media database bug. I urge you to upgrade as soon as possible. I know that doesn't help you with the corruption problem but there are some valid pieces of advice already posted here.

angelo23 · Jan 20, 2004

Hello,

I appreciate everyones input to my problem!!! Each reply so far is telling me to shut down and restart the Networker (NWADMIN)before try to do a recover on the media index.... Like I mention I can't even get Networker to start... NSRD will not start at all... At the command prompt if I type in nsrd, I see a message in my console that "Solstice Backup Server (notice) started" but there is no nsrd actually running when I do a ps -ef | grep nsr. Now when I type nsrexecd I see /usr/sbin/nsr/nsrexecd as a process running... I also have tried the /etc/init2.d/S**networker script to start Netowrker but I get the same response... So by restoring the media index is that going to correct the problem of nsrd not starting??... Because before I tried to clean out my indexes (as mentioned in my original POST above) nwadmin would open and nsrd was running.... I will try to tinker with the scavenge of the mdeia index to see what happens... Thanks

Brian(angelo23)

605 · Jan 20, 2004

OK, i guess you have never used the disaster recovery procedure in practice.
Before you really do this on your production server, try it on a test machine.
Just install the software (you do not even need a license).
As you will see,
it is easy and straight forward.

However, we assumed that you know how to use the routine. So we thought you
know that you will have to delete/more /nsr/mm, /nsr/res and
/nsr/index
before you restart nsrd.

DavinaTreiber · Jan 20, 2004

If the problem is in fact media database corruption (sounds likely) you'll see a message in the daemon.log that should give you a clue. Have you checked the daemon.log?

wallace88 · Jan 20, 2004

Incrediable...

Anyway, angelo23, your original post shows that the media database is corrupted, and this is why networker will not start up. It also indicates that it has a problem accessing the index files for the media database. This is why the nsrmmdbd daemon cannot start, and therefore why nsrd cannot start.

Chances are, networker became corrupted necause the file system that holds /nsr ran out of space. Please look at LEGATO TECHNICAL BULLETIN 001: Managing the NetWorker Index Size, Release 5.0 and Later (UNIX and NT).

http://portal2.legato.com/resources/bulletins/001.html

605's suggestion of attempting the media scavenging procedure is correct. This will recreate the indexes for the media database. This does not address whether the data in the media database is accurate or corrupted.

The correct method for performing media scavenging is as follows:

First, check the file system to make sure that it is 'sane' before performing the media scavenging procedure. For example, in UNIX use fsck.

If the file system that has /nsr is out of space, then move some of the indexes to another file system as described in the tech bulletin, and also described later in this post.

Then, in your case, perform the following to reindex the media database:

1) In /nsr/mm remove: cmprssd
2) in /nsr/mm/mmvolume6 delete all files except the following: vol.0, ss.0, clients.0, vol.1, ss.1, clients.1, vol.2, ss.2, clients.2, vol.3, ss.3, clients.3, etc.
3) delete the /nsr/tmp directory
4) start NetWorker
5) run ps -ef | grep -i nsr, and see if nsrd, nsrmmdbd, nsrindexd, nsrmmds, nsrexecd have all started.
6) look at the daemon.log to see if NetWorker has started without problems.
7) If it stared correctly then run: nsrck -m and nsrim -X

If NetWorker fails to start, and you are still getting WISS errors, or any other errors during the startup, then you will have to recover the bootstrap, which will recover the res and media database from tape.

After NetWorker restarts (hopefully) then look at the media database, either using nwadmin to look at volume information, or with:

mminfo -B
mminfo -mv

If it doesn't look right.. then you may still have to recover the bootstrap to recover a non corrupted database.

The nsrexecd process running is the NetWorker client process. This can start even if the nsrd process fails to start.

If you need to recover the bootstrap, then perform the following:

rename the media database with: mv /nsr/mm /nsr/mm-corrupt

start networker

Assuming it starts, then you need to load the tape with the last bootstrap prior to when the problem started. You can either go through the daemon.logs or through the savegroup completion reports to find out when the last bootstrap was done, and what volume it was in. If you setup your bootstrap notifications for email, then check the last one sent out. The point is, this notification will have the saveset id, file number, and record number of the bootstrap. You will need this.

Put the tape that has the bootstrap into a drive. Easiest way is to physically load it manually.

If you don't know the ssid, file and record number, then you'll have to run the following to scan the tape for the bootstraps: scanner -B (device name)

mt -f (device name) rewind

mmrecov Then when it asks for it, enter the device name, saveset id number, file number, and record number

After mmrecov has completed, stop and restart networker. Then run nsrck -m and nsrim -X

You still need to address the issue of /nsr running out of space. There are a number of methods for addressing this problem. One is to move some of the index directories to other file systems, and then either create a soft link from /nsr/index to the new location, or you can specify the location in the client resource's "index path". Then move the indexes with a move command (mv) and restart networker.

You can use nsrls to determine the client's index size and identify which clients have the largest indexes.

Once you have freed up space, then you may want to run nsrck -L6 (client name) to checn the client file indexes. If any of these are corrupted, then run nsrck -L7 (client name) to recover it.

If NetWorker fails to restart after renaming the /nsr/mm directory, then you probably have corruption in your res directory. Then you need to rename /nsr/res also, then start networker. This will basically put you back into a new insallation state. Thenit's a matter of running defining a tape drive, then load the tape with the bootstrap, and running mmrecov as explained above.

With all this said... shouldn't you call SUN for help on this? Also look into upgrading... Even Sun's version should have a more recent 6.x version. They even have a v7.x that is almost the same as Legato's version.

Good luck...

angelo23 · Jan 23, 2004

Hello Again,

Thank you Wallace88 for the great detail help you posted... As for that, thanks to everyone that posted their input on my problem...

Well Wallace88 - I followed your instructions line by line... After reindexing the media database (your steps #1- 7), I was able to successfully start up the Networker (nwadmin)... NSRD was runing as all of the other processes!! I then proceeded to recover the bootstrap to get a non corrupted database... It recovered my resource files in the /nsr/res.R directory and prompt me to MV them in the res directory.. Although when I go into nwadmin and look at my Volumes or indexes I have nothing... I was just going to reload our jukebox will some tapes and start labeling some tapes and start over all over .... Another words the scanner -B worked and gave me the SSID.... I did run the mmrecov and all I got out of it was the resource files (res.R)... I'm going to try and run a backup now to see what kind of meesages I get in the daemon.log file..... Thanks again for the help!!!

Brian

wallace88 · Jan 23, 2004

After you first reindexed the media database and successfully restarted NetWorker, you should have looked in nwadmin tio see if you can see any volumes or indexes, or use the mminfo command to query the media database. (for example by using mminfo -B and mminfo -mv). Only if it didn't return any data, or if you suspected that the data was incorrect, that you would recover the bootstrap.

mmrecov recovers the bootstrap, which recovers the mm and the res directories. As you discovered, the mm is put back in the same location, but the res is restored to the /nsr/res.R directory. Then, you needed to stop networker, rename /nsr/res to /nsr/res-corrput, and then rename /nsr/res.R to /nsr/res. Then restart NetWorker.
If, after restarting NetWorker, you still do not see any information regarding the volumes or index, then it may be that the bootstrap you recovered was backed up when NetWorker was already corrupted. In this case, you will need to recover from an earlier bootstrap.

Without valid NetWorker index and media databases, it is possible to backup, however, since there is a question of corruption, you may not be able to recover data. The backup data itself are still on tape, and worse comes to worse, you can always use the scanner command to scan and then recover the data.

So if you cannot see any volume or index information, and you want to recover this, Try recover the last bootstrap again (prior to the crash), then move the res directpory and restart n/w. If you don't see any info after this, try recovering an earlier bootstrap.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

NSRD won't start - using Solstice Networker 6.0 4

angelo23

Technical User

TDun

Technical User

TDun

Technical User

605

Instructor

DavinaTreiber

Technical User

angelo23

Technical User

605

Instructor

DavinaTreiber

Technical User

wallace88

Technical User

angelo23

Technical User

wallace88

Technical User

Similar threads

Part and Inventory Search

Sponsor