Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Rhinorhino on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

cluster issue

Status
Not open for further replies.

terry712

Technical User
Joined
Oct 1, 2002
Messages
2,175
Location
GB
ok this is a few bits rolled into one

have a netware 6.5 sp2 cluster - all latest patches - excluding the slp patch
the sbd 's were mirrored originally and it was a 6 that was upgraded - it's 2 dl580's with compaq 7 agents on it - local sys and then san storeage via fibre hsg80's with securepaths

if i do a mirror status - they are sync'ed anymore -

one of the servers abended today with the detected by timer on a server process - what is preferred settings for this - i know marv has commented on this before but cant see it

the server was out of netware ie it was sitting on a blank screen with a flashing white line in top left of screen but a four figure did nothing - it just hadnt rebooted etc - the cluster volume was comatosed
i'm suspecting this was due to the sbd's or any other thoughts
 
16 seconds is a good idea for a starting setting. 8 is a little tightfor most clusters. Sounds like either a comm issue or a hardware issue. What kind of fiber switches? Is the storage server on the same switch?
Which HBA's? If Qlogic, makd sure the execution throttle setting is 128 or below (defult is 256.. too high) Max luns = 32

Not familiar with CPQ hardware so not sure how old/new the server box is..
here are some good basic settings that may help:

in CPU settings: disable hyperthreading. this is key on netware

these are some settings we used on our IBM x345 2.6GHz cluster servers: (ymmv)

run monitor !h
server-->parameters-->communications
find tcp ip maximum small ecb's. change from1024 to 32000
(this is a poor mans QOS for the heartbeat packets)

find "max rcv buffers" and double it from 10,000 to 20,000
find min rcv buffers and double it from 2000 to 4000

back out of comm menu and enter MISC menu

find "minimum service processes", up this from 100 to 500
find "max service proc" up this from 500 to 950

check your settings by esc out to the console prompt and typing
"display modified environment" - this will show you all the non-dflt SET parameters.

in startup.ncf make sure acpidrv.cdm is remmed out on any non-single CPU box. (it's only for multi-cpu servers) Hyperthreading being enabled will cause this to load as your netware os will see N+1 proc.

are you using any multi-nic configs? (aft, alb, etc) if so, disable them for now and run on one nic. it's good to remove extra variables.. speaking of nics, how are speed and duplex? test transfer speed to make sure it's happy (speed should be at least 6-8MB/sec)

You can also run the

cluster_start displaystats

this will show the time values going member to member in seconds (along with other stuff.. ) if these values are >8.xxx that could be your problem. of course I'm trying to run it here now and it's not working..? but I have used it before. (having a senior moment?!) TID 10079737 has some command line FU for checking out clusters.

Not sure what it going on w/your mirror.. what are the compaq 7 agents? are they "yes" approved for NW6.5 and/or SP2? should not be (IMHO) mirroring in software. Should be mirroring on raid controller card. I'm sure you know that, but.. ;)

If you could not get into the debugger, the thing really got whacked.. makes me think hardware.

jm2c, ymmv and the usual disclaimers apply. ;)
 
still checking on some of settings

the dl580's are newish
they do have 2 nics but i'm not running them teamed just running on 1
they are dual processor so acpi is loaded

they have 2 fibre cards cpqfc's
all components are on "yes" list

the servers have hardware raid 1 for os
then the storeage on the san has 2 raid 5 partitions

the software mirroring is for the sbd partitions during sbd install - this is only way as far as i know

hyperthreading is off

i think it's sbd related

on first server
mirror status shows
mirror object 0x15 is not active on this node
mirror object 0x1B is not active on this node

on second server
mirror status shows
mirror object 0x2F is fully synchronized
mirror object 0x32 is not fully synchronized

i'm assuming that i want all four lines to say fully synced
first server just doesnt know that's its mirroring
second server - first partition is fine and second it's just not sure

what do i need to do - or best track
an sbd view on both show

cluster sbd partition mirrored on
disk [0]: compaq ra8000 id 0 lun 12 (5000-1fe1-0017-cde0)
disk [1]: compaq ra8000 id 0 lun 13 (5000-1fe1-0017-cde0)

same epoch - sbd+ and that its live

thanks for any guidance
the cluster loads ok etc

 
OT: whose MPIO driver are you using?

We have nearly the same setup, except I did not mirror my SBD since it's on part of the RAID-5 array. I figure (perhaps wrongly) that it's as safe as the file system. Did mirrored HD's on the servers for sys.. everything else on SAN. Would have been nice to add some trays and include multiple LUN's within the RAID but.. our storage is IBM so we use IBMSAN.CDM for multipath. I've heard various reports that the Netware MPIO driver is not fully baked. We had inconclusive tests here, and IBM would only support their driver so that was that. ;)

On your SB partition.. never heard of anyone mirroring it in the OS. Might be a cosmetic issue, may be an unsupported feature? Probably your best bet is to check with the sysops on the Novell support forums (novell.support.cluster-services) .. actually I think you (or your twin) already posted.. surprised no one has taken a shot at it. Not a good sign. :( Tim Heywood and the other sysops are a pretty knowledgable bunch. Might try reposting with OS version, patches, server hardware and firmware versions. Sometimes that gets folks over the hump. It would be helpful for them to know you have multipath fiberchannel SAN setup, which MPIO you're using, etc.

Meanwhile check your other comm settings and see if that doesn't help.

On the comatose volume issue, did you run the nss repair utils to make sure it is in good shape?

 
had a further look

i think the sbd arent correct - but i dont think it's a major issue - if it was then the cluster wouldnt load - i think the way to fix the sbd's is leave cluster - delete sbd's and then recreate them - but maybe more hassle than it's worth - i never bothered mirroring other clusters sbd's - this was the first one so we mirrored it - even although as you say it is already on a raid

on further investigation - it looks like - one server had cpu hogged (setting on this one was 8 - now 16) - which means it missed the heartbeat so it issued a poison pill - it therefore comatosed the volume it case it abended the server it thought was ok

obviously i need to find the hog - (it was getting backed up remotely at the time by a backup exec host)

so i'm looking closely at the stats and may up a few figures , but need to check the cluster castoff times and the heartbeat time (any suggestions on this setting) - checking the logical memory etc

i'm using securepaths for the multipaths - a mine field area - it seems ok and is at the latest versions etc

volume mount ok again when brought online -
i'm always apprehensive about nss tools - probably as i remeber all the netware 5 nss horror tids - maybe i should reacssess
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top