Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

IPDA goes down everyday 1

Status
Not open for further replies.

lhiraman

Technical User
Aug 31, 2006
191
US
I have a ticket open with Siemens, but users are calling me telling that everything around 4pm all the telephones stop working. Is there anything that I can check? Waiting for Siemens to respond.
 
It was a network problem before?
Has that been resolved?
When it goes down does the APESU take over?
 
The network never when down. ape did take over. 15 mins later it came up. Siemens was the change out the nuci card.
 
try also restarting the Access Point and STMI4 cards simultaneously...
 
Ok they changed our the cards yesterday. Today at exactly 4:31 Pm the switch when down and came back up 15 minutes later like it should.
 
Here is the APE Setup:
DIS-APESU:,,;
H500: AMO APESU STARTED

+------------------------------------------------------------------------------+
| CURRENT SYSTEM TIME : 08-13-2013 19:07:29 |
+------------------------------------------------------------------------------+

+------------------------------------------------------------------------------+
| CC-AP: 17 IP ADDRESS: 192.XXX.XX.1X |
| SPEED/WORKING MODE(IPDA): 100MBFD |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| AP EMERGENCY GROUP: 1 CC-AP: 17 NAME: CARL |
| THRSHLD: 100 SBMODE: AUTO |
| STABLE: 5 MIN SBBEGIN: 0 H SBENDE: 0 H SBOFFSET: 15 MIN |
+------------------------------------------------------------------------------+
| AP: 17 AP EMERGENCY GROUP: 1 CC-AP: 17 WEIGHT: 100 SWMODE: GROUP |
| CONTROL-UNIT: HOST-CC SIGNAL-PATH: LAN |
| LAST RECORDED CONNECTION STATUS CHANGE: |
| |
| HOST-CC: CONNECTED: YES CONNECTED SINCE : 2013-08-13 16:42 |
| CC-AP: CONNECTED: YES CONNECTED SINCE : 2013-08-13 16:42 |
+------------------------------------------------------------------------------+

AMO-APESU-111 CONFIGURATION OF AP EMERGENCY DATA
 
Look at your backup schedule to see when the APE units are backing up. It should be set for early morning hours-like 2AM or something. It it's set for 1630 that may be part of the problem. The APE will perform a reload after it backs itself up, but it should not be taking the NCUI card down with it.
 
Hello.

Thanks Kevin...it turned out the APE was getting confused because all of our backups were going into the same directories. I have cleaned it up and so far no reboots during the day.
 
I changed the scheduled, but still can't figure out why it reboots at 12:30 AM every night now. What am I missing??

Type Unit Status Frequency Time Archive S V O I
mo-rmx ALL Enabled Saturdays 04:00 MO-RMX Y N N N
Data ALL Enabled Sundays 06:00 Hard Disk Y N N N
System ALL Enabled Saturdays 06:00 Hard Disk Y N N N
System ALL Enabled Daily 19:00 Backup Server Y N N N
AP Emergency ALL Enabled Saturdays 00:00 AP Backup Server Y N N N
mo-rmx AREA_E Enabled Sundays 20:00 MO-RMX Y N N N
hd-scr UNIX Enabled Sundays 03:00 HD-SCR Y N N N

 
There is not sufficient info available on this forum to diagnose the problem 100%, but here is what I THINK I am hearing:

Shelf 17's NCUI is crashing within 15-30 minutes of the APE Backup running. Plus, this comment: "...it turned out the APE was getting confused because all of our backups were going into the same directories..." - VERY WEIRD. I think there is an IP Address conflict somewhere in the AP Backup Server or AMO SIPCO/APESU config. If you can post the following information, perhaps someone here can see the problem:

On the HOST:
Provide the IP Address of the Host's Assistant.
DIS-SIPCO:LSNET;
DIS-SIPCO:TIMING;
DIS-APESU;
DIS-UCSU:AP,1,17;

There is an AP Backup Server configured under Assistant -> Software Management -> HiPath Backup & Restore -> Administration. Take a screen capture of that entire "AP Backup Server Configuration" window. The IP Address of the AP Backup Server should be the same address as the Host's Assistant.
Post all of this info on this site so that we know that it is Clearly from the HOST system.
-----------------------------------------------------------------------------------------------------------------

On the APE
Provide the IP Address of the APE's Assistant.
DIS-SIPCO:LSNET;
DIS-SIPCO:TIMING;
DIS-APESU;
DIS-APESM;


Access the APE's Assistant, and take a screen capture of the APE's AP Backup Server. The IP Address of the APE's AP Backup Server should match the IP Address of the HOST's Assistant.
Post all of this info on this site so that we know that it is Clearly from the AP-E system.
------------------------------------------------------------------------------------------------------------

Please do not block out the IP Addresses, as that may be part of the problem.
 
FROM APE:
DIS-SIPCO:LSNET;
H500: AMO SIPCO STARTED
CENTRAL IP ADDRESSES ( LSNET ) :
-------------------------------------------------------------------
NETADDR ( NET ADDRESS OF THE HIPATH LAN SEGMENT ) : 192.6.35.0
NETMASK ( NET MASK OF THE HIPATH LAN SEGMENT ) : 255.255.255.0
DEFRT ( ADDRESS OF THE DEFAULT ROUTER ) : 192.6.35.131
CCAADDR ( ADDRESS OF THE CC-A PROCESSOR ) : 192.6.35.106
CCBADDR ( ADDRESS OF THE CC-B PROCESSOR ) : N/A
SURVNET ( NET ADDRESS OF THE SURVIVABILITY NET ) : 0.0.0.0

AMO-SIPCO-111 SYSTEM IPDA CONFIGURATION

DIS-SIPCO:TIMING;
H500: AMO SIPCO STARTED
TIMING :
-------------------------------------------------------------------
PINGTIME ( TIME FOR CHECK PAYLOAD PATH QUALITY ) : 60 SEC
RESTIME ( SELFRESET TIME AFTER SIG. CONN. LOSS ) : 300 SEC
SUPVTIME ( KEEP ALIVE TIME SUPERVISORY ) : 10 SEC
APESWDLY ( APE SWITCH OVER DELAY TIME. ) : 0 MIN
ALVTIME ( KEEP ALIVE TIME SIGNALLING ) : 60 SEC

AMO-SIPCO-111 SYSTEM IPDA CONFIGURATION
DISPLAY COMPLETED;
< DIS-APESU;
H500: AMO APESU STARTED

+------------------------------------------------------------------------------+
| CURRENT SYSTEM TIME : 08-24-2013 09:00:30 |
+------------------------------------------------------------------------------+

+------------------------------------------------------------------------------+
| CC-AP: 17 IP ADDRESS: 192.168.52 .16 |
| SPEED/WORKING MODE(IPDA): 100MBFD |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| AP EMERGENCY GROUP: 1 CC-AP: 17 NAME: CARLSTADT |
| THRSHLD: 100 SBMODE: AUTO |
| STABLE: 0 MIN SBBEGIN: 0 H SBENDE: 0 H SBOFFSET: 15 MIN |
+------------------------------------------------------------------------------+
| AP: 17 AP EMERGENCY GROUP: 1 CC-AP: 17 WEIGHT: 100 SWMODE: GROUP |
| CONTROL-UNIT: HOST-CC SIGNAL-PATH: LAN |
| LAST RECORDED CONNECTION STATUS CHANGE: |
| |
| HOST-CC: CONNECTED: YES CONNECTED SINCE : 2013-08-23 22:56 |
| CC-AP: CONNECTED: YES CONNECTED SINCE : 2013-08-23 22:56 |
+------------------------------------------------------------------------------+

AMO-APESU-111 CONFIGURATION OF AP EMERGENCY DATA
<DIS-APESM;
DIS-APESM;
H500: AMO APESM STARTED
THE CONFIGURED CC-AP NUMBER IS 17.

AMO-APESM-111 CONFIG. OF THE AP SHELF NO. OF THE CC-AP HOSTING SHELF
DISPLAY COMPLETED;

FROM HOST:

DIS-SIPCO:LSNET;
H500: AMO SIPCO STARTED
CENTRAL IP ADDRESSES ( LSNET ) :
-------------------------------------------------------------------
NETADDR ( NET ADDRESS OF THE HIPATH LAN SEGMENT ) : 192.6.35.0
NETMASK ( NET MASK OF THE HIPATH LAN SEGMENT ) : 255.255.255.0
DEFRT ( ADDRESS OF THE DEFAULT ROUTER ) : 192.6.35.131
CCAADDR ( ADDRESS OF THE CC-A PROCESSOR ) : 192.6.35.106
CCBADDR ( ADDRESS OF THE CC-B PROCESSOR ) : N/A
SURVNET ( NET ADDRESS OF THE SURVIVABILITY NET ) : 0.0.0.0

AMO-SIPCO-111 SYSTEM IPDA CONFIGURATION
DISPLAY COMPLETED;

DIS-SIPCO:TIMING;
H500: AMO SIPCO STARTED
TIMING :
-------------------------------------------------------------------
PINGTIME ( TIME FOR CHECK PAYLOAD PATH QUALITY ) : 60 SEC
RESTIME ( SELFRESET TIME AFTER SIG. CONN. LOSS ) : 300 SEC
SUPVTIME ( KEEP ALIVE TIME SUPERVISORY ) : 10 SEC
APESWDLY ( APE SWITCH OVER DELAY TIME. ) : 0 MIN
ALVTIME ( KEEP ALIVE TIME SIGNALLING ) : 60 SEC

AMO-SIPCO-111 SYSTEM IPDA CONFIGURATION
DISPLAY COMPLETED;

DIS-APESU;
H500: AMO APESU STARTED

+------------------------------------------------------------------------------+
| CURRENT SYSTEM TIME : 08-24-2013 09:03:06 |
+------------------------------------------------------------------------------+

+------------------------------------------------------------------------------+
| CC-AP: 17 IP ADDRESS: 192.168.52 .16 |
| SPEED/WORKING MODE(IPDA): 100MBFD |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| AP EMERGENCY GROUP: 1 CC-AP: 17 NAME: CARLSTADT |
| THRSHLD: 100 SBMODE: AUTO |
| STABLE: 0 MIN SBBEGIN: 0 H SBENDE: 0 H SBOFFSET: 15 MIN |
+------------------------------------------------------------------------------+
| AP: 17 AP EMERGENCY GROUP: 1 CC-AP: 17 WEIGHT: 100 SWMODE: GROUP |
| CONTROL-UNIT: HOST-CC SIGNAL-PATH: LAN |
| LAST RECORDED CONNECTION STATUS CHANGE: |
| |
| HOST-CC: CONNECTED: YES CONNECTED SINCE : 2013-08-23 22:56 |
| CC-AP: CONNECTED: YES CONNECTED SINCE : 2013-08-23 22:56 |
+------------------------------------------------------------------------------+

AMO-APESU-111 CONFIGURATION OF AP EMERGENCY DATA
DISPLAY COMPLETED;
DIS-UCSU:AP,1,17;
H500: AMO UCSU STARTED
+---------+-------------------+---------+--------------+----------+--------+
|ADDRESS |EXPECTED CONFIGUR. |STATE |LTUC MODULE |LTU-TYPE |FRMTYPE |
| +-------------------+---------+--------------+----------+--------+
| |CONNTYPE LOCATIONID LOCATION SRCGRP |
| |PHONE: FAX: |
| |LSRTADDR APRTADDR BCHL BCHLCNT PLCHECK SIGNAL. |
| |CONVLAW CONTROL UNIT TCLASS ALARMNO |
| +----------------------------------------------------------------+
| |SIGNAL. ENCRYPTION ACTIVE [ ] PAYLOAD ENCRYPTION ACTIVE [ ] |
| |SECURE STATE : SEE DESCRIPTION AT BOTTOM |
+---------+-------------------+---------+--------------+----------+--------+
|AP 1.17|ADDED |READY |Q2324-X |AP37009 |AP37009 |
| +-------------------+---------+--------------+----------+--------+
| |APNW 017 NEW 17 17 |
| |PHONE: FAX: |
| |192.006.035.131 192.168.052.001 60 60 YES LAN |
| | NO HOST-CC 0 0 |
| +----------------------------------------------------------------+
| |SIGNAL. ENCRYPTION ACTIVE [ ] PAYLOAD ENCRYPTION ACTIVE [ ] |
| |STATE : 1:[ ] 2:[ ] 3:[ ] 4:[ ] 5:[ ] 6:[ ] 7:[ ] 8:[ ] |
| | 9:[ ] 10:[ ] 11:[ ] 12:[ ] 13:[ ] 14:[ ] 15:[ ] 16:[ ] |
+---------+-------------------+---------+--------------+----------+--------+

The software configuration for backup is the same on both host and ape.
 
Without knowing that both the HOST's Assistant and the APE's Assistant have unique IP Addresses, there is no way to remotely determine if there is an IP conflict. Without having the info from the AP Backup Server, there is no way to validate the NFS or FTP config, directory structure, etc, is setup correctly at the HOST site and the APE site. It takes only one mis-configured parameter to halt the entire AP Backup Server process. If you cannot post all of that info, I certainly understand, but that also limits our ability to help resolve the problem.

Regarding APESU parameter "STABLE": someone apparently has changed this value since your post from August 13th, as then it was "5", now it is "0". I recommend that no more changes be made to the Timing parameters: they very much matter!!!!

From your SIPCO information, I know that your HiPath 4000 has only one Switching Unit; therefore SWU redundancy is out of the picture. As I explain things below, I have NOT factored in a redundant SWU.

When the HiPath 4000 enters "APE" mode for your shelf 17, control will be returned to the HOST automatically because of your APESU config. Your "time window" (parameters SBBEGIN=0 & SBENDE=0) are configured as "around the clock". Thus any time the APE takes control of IPDA 17, the APE will begin looking to return control to the HOST at 15 minutes AFTER the next top-of-the-NEXT-hour (because of the SBOFFSET=15).
With parameter "STABLE=0", there will be no testing of the HOST's IPDA port. If there is a bad connection at the Host's Switching Unit "IPDA" port, the APE will not care, and the APE will attempt to hand control back to the Host. If that IPDA port is truly down, then the HOST will not be able to control IPDA 17, and the entire cycle will repeat itself. Thus your users will bounce from Host to APE until someone fixes the problem. This situation can be eliminated by setting the parameter "STABLE" with a value "5", such that the HOST's Switching Unit's IPDA port must return a positive PING result continuously for 5 minutes before the APE attempts to return control to the HOST.

The APE's "time window" was originally engineered so that if APE has taken control of IPDA 17, the AUTO MODE would allow the customer to return control of IPDA 17 to the HOST during an "off-hours window", so that users will not be "double" impacted. By setting SBBEGIN and SBENDE to "0", your users are subject to be disrupted twice whenever your network hiccups: once when the NCUI asks the APE to take control, and again when the APE returns control to the HOST at the next XX:15 o'clock. Obviously you as the customer can setup AMO APESU however you desire, but from my perspective the APE's "around the clock" setup is not efficient. EXAMPLE of EFFICIENTLY CONTROLLING APE SWITCHOVER: If you set SBBEGIN=4, and SBENDE=6 with SBOFFSET=15, then between 4:15am and 6:15am, the APE will return control to the HOST, hopefully not impacting the users. Is there a reason that you need to abort out of APE mode so quickly? APE was designed as an alternative controller when the Host is not available. The APE is not an exact replacement for the Host, but it should suffice for the remainder of the day, thus limiting the number of daily IPDA reboots to "1".

Regarding WHY your IPDA is going crazy - it is important to know how the APE takes control. The HOST's active Switching Unit sends TCP-based keep alive messages to all IPDAs at a rate determined by this formula:
SIPCO Timing parameter "SUPVTIME"/8. At your site, SUPVTIME=10 seconds; therefore, the Keep-Alive messages are sent every 10/8 seconds, which equates to every 1.25 seconds.
If there is any disruption, or network traffic issue between the HOST's Switching Unit's IPDA port and your IPDA 17 NCUI, which results in a non-delivery of a Keep-Alive message, that message is re-transmitted. If IPDA 17's NCUI cannot acknowledge the Keep-Alive due to a network congestion or different issue for a time period = "SUPVTIME", which is "10 seconds" at your site, then the HOST enters Signaling Survivability mode for that IPDA, regardless of whether Signaling Survivability has been purchased. It is merely the internal name of the mode.
When Signaling Survivability mode is triggered, IMMEDIATELY another Time parameter (SIPCO->Timing->ALVTIME) begins a countdown, which at your site is "60 seconds". During that 60 seconds, if your network problem(s) clear, then IPDA 17 will come back online immediately. Also during that 60 second countdown, calls already UP will not be interrupted, but no features can be used, e.g. HOLD, TRANSFER, etc, because the Switching Unit is not available to process that Feature Request. Also, no NEW calls are allowed. If that ALVTIME 60 second timer expires, then a third parameter (RESTIME=300 seconds at your site) begins. The big change here is: there is no recovery at this point, even if the actual problem is repaired. When this RESTIME parameter expires, ALL CALLS WILL BE DROPPED -> then either (1) the NCUI will reboot (if there is no Signaling Survivability or APE), or (2) the Signaling Survivability process will take control of the IPDA (if this feature is purchased, AND if all the required AMO configuration has been performed, AND the Signaling Survivability router is properly configured and ACCESSIBLE), or (3) the IPDA will ask the APE to take control.
What is the purpose of extending the "RESTIME" time parameter to 300 seconds? When the problem has reached this point, the IPDA will NOT come back up, even if the problem is resolved, until one of the three above-mentioned actions happens! Therefore extending this time parameter actually extends the period of time where no users can place new phone calls at the IPDA.

I assume that your company did not purchase Signaling Survivability. You can quickly see by typing: DIS-CODEW; If "Signaling Survivability"=0, then it was not purchased. Plus, there is additional AMO APRT configuration required, an additional router with modem must be installed, configured, and a matching modem must be connected to the IPDA NCUI board's "MODEM" port.

Signaling Survivability was designed to handle problems in the CUSTOMER's network. The AP-E was designed to handle problems within the HiPath LAN Segment, such as the failure of the Host's Switching Unit's "IPDA" port. If this port fails, then the Switching Unit cannot reach the Signaling Survivability router. Thus, the BEST solution that covers both the HiPath LAN segment AND the customer's networkis: Signaling Survivability AND AP-E.

To summarize, if the network between your HOST/Switching Unit and the IPDA 17 is disrupted for 70 seconds (SIPCO Timing parameters SUPVTIME + ALVTIME), then your IPDA will switch to control by the APE 5 minutes later because of the parameter "RESTIME=300 seconds". If this switchover occurs at 4pm, then because of the APESU "around the clock" configuration plus a 15-minute offset, the APE will begin to attempt to return control to the HOST at 4:15pm. So your users experience a double-disconnection if the switchover occurs during 8am - 4pm hours, assuming your business ends at 5pm.

Your IPDA is a Networked Shelf (see AMO UCSU). This means that the IPDA is in a different Network than the HOST, and routers are needed to route the Signaling and Voice/Payload to/from the IPDA. Often network engineers will use a WAN between the HOST and a remote location. A WAN typically has a bandwidth limitation to conserve money. Is it possible that the bandwidth between your main site and this remote site is completely exhausted at certain periods during the day, which could be triggering the NCUI to switchover to APE control??

In a recent post you mentioned that the AP Backup had stopped working. What was the cause of this failure? During the APE installation, the vendor SHOULD have tested this AP Backup thoroughly, and also switched between HOST mode and APE mode using EXE-APESU before leaving the site. When your system was installed, the AMO SIPCO -> Timing parameters should have been discussed with the Communications Manager for maximum efficiency. In my opinion, these parameter settings at your site are no longer efficient. I believe the problem is being caused by the network, most likely network congestion triggering the APE mode. There could also be IP Address conflict(s), as I see that you are using a Private Network with STATIC IP Addresses, which can lead to duplication/conflicts if the IP Addresses are not properly managed.

Here is my final theory: is it possible that someone, perhaps on a cleaning crew, is unplugging the AC plug for equipment at the Host location or remote location, which is triggering the switchover to APE? A new cleaning employee/contractor may need access to an AC outlet for the vaccuum cleaner -> unplugs something at the Host which kills your Switching Unit's Layer 2 switch, which kills the Private LAN to your IPDA, which causes the switchover to APE. Or, something at the remote location is being unplugged that kills the IPDA 17's network connection to the Host.
You should be able to use STA-HISTA or Assistant -> Diagnostics -> HiPath System Diagnosis to search for daily failure(s) of the Switching Unit's IPDA port.

Note: if your company had a Service Contract with an approved vendor, this problem MOST LIKELY would have been resolved soon after the symptoms were first detected.

The Access Point-Emergency server is very effective IF properly designed and implemented. I hope this BASIC information provides you with a new level of understanding as to how these many parts & pieces must intermingle to form a perfect solution. If mis-managed, this solution can easily fail!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top