Core/ESS Failover

trilogy8 · Sep 13, 2017

My call servers are running in a separate location than my G650 GW's and separated by a WAN. There's an ESS in the location with the G650's. My LSP sites are all set to register to the PROCR of the call servers. My scenario and question is if the ESS site loses connection to the call servers and takes over the PN (G650's), but the LSP sites can still reach the main call servers, should that be a fully functional scenario in all locations?

kyle555 · Sep 13, 2017

Depends on your dial plan and setup

If every site never calls over the WAN and is in/out local trunks, sure.

If you're central SIP trunking at the main site with the call servers, the LSP sites would probably be fine too.

If you're central PRI trunking from the G650s, you're in bad shape!

Do your H323 phones register to CLANs at the ESS site? Procr? If you're using CLANs at the ESS site for all your H323 phones, are they in the same network region as procr?
If you are using CLANs and they're not in the same NR as procr, when the ESS site dies, the phones don't know to go to procr, they'll go to the alternate gatekeeper list in their network region, which includes their LSP. The problem with that is that an LSP CM server only goes live when a gateway registers to it - so, if your gateways are on procr and phones on CLANs, the gateway never hits the LSP so the phones can't either. ESS down = total network outage.

Conversely, if your gateways have gateway list "CLAN,CLAN,LSP" and CLANs are in the same NR as procr, then when the G650s die, the gateways go into LSP mode but the phones register back to procr on the main so you've got 1 PBX with sets and many little PBXs with trunks.

Whatever you do, be consistent - procr for phones+gateways, or CLANs for both - don't mix and match. Where's Session Manager/SIP trunking in the mix. SIP failover is a different beast altogether and has to layer atop CM's H323/H248 failover.

trilogy8 · Sep 13, 2017

The sites call over the WAN to other offices (4 digit dialing) and for outbound PSTN they have local circuits. All phones, in all sites are set to register to the PROCR of the call servers. My PROCR NR is 250 (I've not fully understood that one technically) The phones are in NR 1, as are the CLANs.

kyle555 · Sep 13, 2017

That's good! 250 is supposed to be direct WAN to everything and every other region is supposed to be indirect to one another thru intervening region 250. That's how you set up a hub and spoke with CM.

If the phones go to procr, that's fine. I'd suggest maybe putting your CLANs in another region. I know CM will load balance all registrations on CLANs across CLANs in the same network region.
I don't know if phones in NR1 to procr in 250 would have procr send the phones the CLANs in NR1 as available gatekeepers as well. It might speed up failover.

Only other thing you can add is dial plan transparency across the sites. Basically, when in failover, each other region has a DID you program so when sitea to siteb goes 4 digit, CM dials the DID of site B and it's like a little autoattendant in the background that CM uses to preserve short dialing across WAN failed sites.

trilogy8 · Sep 18, 2017

In this scenario should the PROCR of both the core servers and ESS be in 250? Dealing with the core site seems a bit more clear cut. The remote sites seem a bit more trickier. In this specific example my core servers can get isolated from the ESS site, where the main users and trunking are, but the remote sites can still have connectivity to the core servers. In that case I have core servers still reachable to the remote office phones/GW's, but the core servers with no control of PN's. If I'm thinking it correctly:

remote office A / NR 8. goes through 250 to get to any region. mgc list is that of the core PROCR, ESS and LSP. Phone registration: core, ess, lsp.

core gets isolated from ESS and ESS takes over the PN. ESS site with majority of users and G650's recover. Remote office phones remain unchanged and registered to core PROCR, along with MG since that site still has access to the core. When that remote site tries to call another site it sounds like it will follow the PROCR of the core servers to the destination region x.

kyle555 · Sep 18, 2017

You can specify the survivable procr's network region, but when it has no DSPs, it's not terribly relevant.

So, suppose ESS/PN site loses WAN, everything else for the other sites is the same - they should set up audio direct between one another's sites - nothing's changed.

Procr's network region doesn't matter so long as it has no DSPs because it'll bias to DSPs in it's own NR first. Now, if NRs 1-10 are all direct to 249 and 250 and indirect to each other 1-10, it doesn't matter which procr is running the network - the calls will set up the same.

So, if the site with the servers loses WAN, everything goes to the ESS. If the ESS site loses WAN, it's all by its lonesome and the rest of the network isn't changed.

Now, if the ESS NR is not 250 and the branch sites only connect indirect thru 250 to other branches, then you'd have a problem. Generally speaking, that's why you'd want a single network region for all procrs - they're just the hub of the hub and spoke topology. The most important thing is having consistency in where your sets and gateways register to, and in which order to make it smooth.

Got any SIP in there with SM? It's a whole other ballgame that needs to layer atop CM's h323/h248 failover.

trilogy8 · Sep 18, 2017

So, the core servers can be the source of control for the remote MG's/phones while the ESS can go active, take over the PN's and maintain control of that main site? I guess I'm confusing terms like split brain where both the core and ESS are active at the same time, while it sounds like they can be.

I had a scenario happen last week where the WAN between the core and the site with the ESS had extreme packet loss. The circuit didn't go down nor flap, and was just experiencing massive packet loss. The conditions for a failover to ESS didn't happen because there was still connectivity, but it was enough to disrupt all the service. Whatever was happening on that link caused the core servers to interchange. Not sure why that happened, but it was all related to that network incident. Had the circuit gone down BGP would have kicked in and there'd have been no disruption since there are backup circuits. While that issue was in the midst of being troubleshot by networking I forced the takeover of the PN's to ESS. That normalized everything for the most part, while the remote branches remained connected to the core servers since the WAN links to/from there weren't affected. One of the remote sites complained of dialing issues, but before any troubleshooting the networking team took the troubled circuit out of the mix and I forced back to normal since there were good backup links. Of course the post mortem was to re-test failover so I'm just making sure my setup is the way it should be.

kyle555 · Sep 18, 2017

Yup. If site A with just servers loses WAN, ESS takes over everything. If ESS loses WAN and siteA is still up, split brain and everything's on A but the ESS manages the PNs and sets there.

IPSIs are sensitive to timing. They abstract the old TDM control messages into IP for the CM server to manage. Needless to say, it's not tolerant to bad WAN. CM's server arbitration (the thing that decides which in a duplex pair is live) must consider at least 1 IPSI as a condition to flip. If you "disp ips 1" you'll see a "ignore connectivity in server arbitration" - which as a "yes" is a good idea for sites remote to the main server, but you must absolutely have one IPSI available for that.

Bad WAN is worse than no WAN at all. Actually, come to think of it, your CM servers at site A may perpetually flip back and forth if siteB's WAN goes down because it won't see any IPSIs. I'm not sure you'd ever get around siteA's 2 servers flipping back and forth if the ESS site loses WAN and is the only one with IPSIs.

I'd say make the site with PNs your primary, ESS solo.

trilogy8 · Sep 18, 2017

I'd do as you suggested, but we'd have to use physical servers in the site with the PN's. No virtual capabilities there. The ESS is physical, but it's just 1 server.

Because of that server interchange issue it sounds like it's almost best to shut 1 of the 2 servers down, or both of them to force all the remote sites to the ESS.

trilogy8 · Sep 19, 2017

Also, should the phones in the site with the ESS be in NR 250 or 1, or doesn't matter? The CP's in the G650's are all in NR 1, as are the phones currently. These G650's are being replaced with G450's in a few weeks, so not sure how much that changes anything.

kyle555 · Sep 19, 2017

phones should never be in 250.
good thing you're going to g450s - I don't think the duplex CM can have IPSIs AND not interchange repeatedly in the absence of any connected to the system. That'd be something to test as they interchange when one's state of health is better than the other's.

You'll be in better shape with the 450s

trilogy8 · Sep 20, 2017

In the case of the G450's you're saying the duplex servers wouldn't interchange and in that case those can be active at the same time as the ESS?

trilogy8 · Sep 20, 2017

The only other interim step that sounds logical is to change the PROCR ip-interface to prevent both h.323 and h.248 GW's from being able to register to it. If that's the case and the ESS goes active in the main location, the CLANs would still be accessible to the remote offices. Once the G650's are replaced then I'd have to alter that setup again.

kyle555 · Sep 20, 2017

I wouldn't go back to CLANs if you're already procr.

I know CM load balances registrations against all active GKs in the same region, but I don't know if that when you point your NR1 phones to procr in DHCP if CM sends the alternate gatekeeper list with the CLANs in NR1.

Go all CLAN or all procr. If you've only got to live with it a short time longer anyway, just wait it out.

trilogy8 · Sep 21, 2017

We include the PROCR, CLANs, ESS and LSP(where applicable) in the DHCP string, but I was also told that once a phone registers it learns the registration addresses.

With the G450 design does that allow both the core and ess servers to be active at the same time, since the IPSI issue would be eliminated?

I'll have a look, but is it overly complicated to convert/eliminate PN's to G450's? I'm versed in configuring and getting G450's online, and know how to remove PN translations. Is that all that is involved or are there other translation areas that need to be touched?

Thank you for your input on this, cheers.

kyle555 · Sep 21, 2017

Once the phone hits the first thing that'll let it register, that will send the phone the GK list it'll use and the order/priority. So, if CM only sends as "primary" GKs that which are in the same network region as the gatekeeper, then procr 1st in 250 will make sure phones never hit CLANs if the CLANs aren't in the same region as procr.

status sockets to see if you got H323 phones on procr only, or list registered to check.

But yeah, both can always be live and will be live if a IPSI or gateway ask it for service. So, site B's WAN dies, it's new 450s will kick it's simplex ESS into service.
If you know how to remove port networks you should be fine.

trilogy8 · Sep 21, 2017

I do have phones registered to the PROCR (250) and CLAN (1). 80% are registered to PROCR and the other 20 is split even between the CLANs.

trilogy8 · Nov 7, 2017

If I can pick your brain on this again.. I have the new G450's registered and am almost done with moving all of the services off of the G650's on to it. I'll be looking to remove the PN's in the next week or so and want to just confirm some of the information you provided above, if you don't mind.

During the change window I want to register the new G450's to the ESS, which is local in that office (site B). Which I'm assuming will make the ESS go live. At the same time I'd have the duplex servers (site A) as the 2nd choice in the MGC list.

Alter DHCP for the site B phones to use the ESS PE as the registration point and the duplex PE as the second choice. Which one it picks to register in a normal scenario didn't sounded like it mattered.

Continue to have the remote LSP MG's / phones register to the duplex servers, with the ESS as the backup and itself as the tertiary.

I guess my thought was since there would be no more IPSI's and the core servers wouldn't interchange I can have site B run everything local in its office and the remote branch offices hang off the duplex servers.

kyle555 · Nov 7, 2017

oh god no.

Core primary CM is core primary CM. You can't have 2. Having a core cm and a ESS live is bad news - split brains - 2 separate PBXs.

Also, you can't save translations on the ESS ever - so you can't save changes.

If you really wanted to, I suppose you could in the server roles, make siteA a "ess" and siteB a "coreCM". Not sure if you're licensing at 6.3 would permit flipping like that...it might/should.

Then you'd load a backup of just the XLN (call processing database) on the server at siteB.

Or, just reinstall'em from scratch to accomplish that and reload the XLN at the site you want to be primary. Unless you've stood up CMs and configured ESSs and are comfortable with it, don't. If you thought you could have a core and ESS live as a sunny-day thing - and maybe I misunderstood you - but if you thought you can do that, you shouldn't be messing with that.

And, DHCP isn't the be-all-end-all of where sets register. DHCP points the set to a gatekeeper. Upon registering, the gatekeeper provides the priority list of how/where phones should register. I'd suspect a ESS would still tell your phones "go core, then to ESS", so the phones at siteA with just servers, even if 1st register to siteB would get kicked back to siteA because that's seen as "normal and best" and to try that first.

trilogy8 · Nov 8, 2017

Well, glad I asked. I thought an ESS can be asked for service even if it's still registered to the core. And w/o no more G650's/IPSI's in the mix I thought there'd be a bit more flexibility. Thinking about the various scenarios.. remote site C loses connection to site A, but has connection to site B. That would kick the ESS active for that site C branch, while the ESS is still registered to the core.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Core/ESS Failover

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor