Core/ESS Failover

trilogy8 · Sep 13, 2017

My call servers are running in a separate location than my G650 GW's and separated by a WAN. There's an ESS in the location with the G650's. My LSP sites are all set to register to the PROCR of the call servers. My scenario and question is if the ESS site loses connection to the call servers and takes over the PN (G650's), but the LSP sites can still reach the main call servers, should that be a fully functional scenario in all locations?

trilogy8 · Feb 7, 2018

similar scenario upcoming.. my core CM's are running in a separate location while my ESS and only PN is in another location where there's a scheduled power down this weekend. The core CMs will be up, but the PN and ESS will be unreachable. There are LSP sites that will be unaffected and have access to the core CMs. Am I going to have this issue with core CM's going crackers because they see no IPSI's?

kyle555 · Feb 7, 2018

I suspect so, yeah. If your core is a duplex, and in "disp ips 1" you see that "ignore connectivity in server arbitration" is set to "n", then your duplex will flip flop.

Maybe busy the ipsi gracefully first from the core CM? I don't know if its good enough if the ipsi is busied and not pingable to not flip flop your duplex

trilogy8 · Feb 7, 2018

What if I toggled that arbitration setting to 'y' before the power down? or what if I also shutdown the standby core call server?

I guess worse case is I'm going to have to disable the NR's for the LSP sites and force them local, which wouldn't be ideal.

kyle555 · Feb 7, 2018

at least 1 IPSI per system must be enabled for server arbitration

trilogy8 · Feb 9, 2018

Would I prevent the interchanging by busying out or shutting down the standby call server?

kyle555 · Feb 9, 2018

trilogy8 · Feb 12, 2018

This went semi-according to plan. Busying out the standby call server did work in preventing attempted interchanges and all the remote branch offices remained connected. My H.323 and SIP trunks remained normal since they used the PE of the call servers. Shortly before the restore of power to the site with the PN and ESS one of the branch offices indicated they could not make internal WAN calls. I did not have the cycles to troubleshoot it, but the only thing I can surmise was the core CM PE is in NR1 and the branch locations only have direct WAN to NR250. The only thing in 250 was the ESS, which was not online.

kyle555 · Feb 12, 2018

Ha! That'll do it! Best practice is procr is in 250 by itself with no DSP and everything is direct to 250. Everything to everything else is through intervening region 250.

trilogy8 · Feb 12, 2018

Had I made the NR of that remote site have direct WAN to NR1 that should have also fixed that I gather. Not the proper design, of course.

trilogy8 · Feb 13, 2018

You mentioned in this thread about adding a entry in the local host name resolution with the IP's for the core and ESS and assigning priority. I did try that and used that fqdn for the CM entity link. It sat in the partially up state for quite a time and never fully came up. Changing the entity link back to the single IP was fine.

kyle555 · Feb 13, 2018

Yeah, because the ESS/LSP should always be responding to SIP OPTIONS messages with 500 Service Unavailable (ESS inactive) when it's in normal mode.

trilogy8 · May 14, 2018

I had a network event this past weekend and am trying to make heads or tails of what to look for as there were definite voice issues.

CM6.x.. my off-prem core procr is assigned NR1 and in location 1 are the MG's (both 650's/450's), which are also in NR1. The phones at this location are all defined in the map to NR 1 as well. The ESS procr is setup in NR 250 with direct WAN to NR1. The site got isolated from the core and the MG's registered to the ESS as did the phones. During this event the phones were unable to dial basically anything... internal desk to desk.. WAN calls or PSTN calls. Since it was out of hours I was not online doing live troubleshooting and am going off documented notes.

My first trigger was the NR setup, but even with the ESS in NR250 there was still direct WAN setup to NR1 and the associated IP's were all reachable. Which should have given the phones DSP resources. Without doing another failover test where's a good place to poke around to see a setup issue?

kyle555 · May 14, 2018

what you're basically saying is that your sets were on a CM that didn't have DSPs that could serve them.

Follow the alarm logs to see what failed where when. If data center A has a core CM in NR1 and data center B has a ESS in NR 250 and a site happens to be NR1 and location 1, then if gateways and sets from that location all happened to be on the ESS like you said, and with direct WAN from 250-->1, then there's no reason the ESS wouldn't set up calls the same way the core CM would.

Something else happened to make those phones do what they did. Maybe someone goofed a routing change and it made your gateways in some core VLAN unable to reach the main CM and hit the ESS and perhaps that change didn't affect voice vlans that your phones are on. That could have your phones still on the core CM with no DSPs which would explain being unable to make calls.

In system-parameters ip-options, I believe there is an option to 'force phones and gateways to active ESS/LSP' so that if some parts of your gear can hit the core+ESS but other parts only can hit the ESS, the main CM can tell whatever's left on it that those things should go to the ESS because it's 'healthier' than the core. That would be based on the idea that some of your stuff could still actually hit the core CM to make it worth telling them to go to the ESS instead.

You've got yourself a fine mess to deal with. A right proper audit is in order to explain exactly what you've got and I'm sure it would identify some pitfalls which it sounds like you unfortunately fell into over the weekend.

trilogy8 · May 15, 2018

The PN and MG in location/NR 1 did fail to the ESS. The sets did re-register to the ESS and they were able to make calls amongst themselves. They were able to make outbound PSTN calls, so I guess that's a positive thing. They were unable to make 4 digit calls to the other LSP sites using DTP, which would probably need to be troubleshot live. Same thing with calling 4 digits to other regions via AAR with the first trunk as an h.323 trunk followed by PSTN with inserted digits. Wondering if for some reason the ESS was still seeing the h.323 trunk still in-service somehow when it really wasn't.

One very weird report was there were 5 9611G phones that were unable to do anything during the outage. These pull from the same webserver/DHCP scopes as all the other model phones. What I have noticed is in the system they are setup as 9630/9650 and not 9611. Since this is a CM6.3 the 9611 option is there. Also the 46xx.settings file being used looks to be too old for the current firmware of these. Not sure why they'd work in normal conditions and not during a failover. Doesn't make much sense, but will make changes to both those items.

I should also put the core PROCR into NR 250, as the ESS is.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Core/ESS Failover

trilogy8

Technical User

trilogy8

Technical User

kyle555

Technical User

trilogy8

Technical User

kyle555

Technical User

trilogy8

Technical User

kyle555

Technical User

trilogy8

Technical User

kyle555

Technical User

trilogy8

Technical User

trilogy8

Technical User

kyle555

Technical User

trilogy8

Technical User

kyle555

Technical User

trilogy8

Technical User

Similar threads

Part and Inventory Search

Sponsor