Without knowing that both the HOST's Assistant and the APE's Assistant have unique IP Addresses, there is no way to remotely determine if there is an IP conflict. Without having the info from the AP Backup Server, there is no way to validate the NFS or FTP config, directory structure, etc, is setup correctly at the HOST site and the APE site. It takes only one mis-configured parameter to halt the entire AP Backup Server process. If you cannot post all of that info, I certainly understand, but that also limits our ability to help resolve the problem.
Regarding APESU parameter "STABLE": someone apparently has changed this value since your post from August 13th, as then it was "5", now it is "0". I recommend that no more changes be made to the Timing parameters: they very much matter!!!!
From your SIPCO information, I know that your HiPath 4000 has only one Switching Unit; therefore SWU redundancy is out of the picture. As I explain things below, I have NOT factored in a redundant SWU.
When the HiPath 4000 enters "APE" mode for your shelf 17, control will be returned to the HOST automatically because of your APESU config. Your "time window" (parameters SBBEGIN=0 & SBENDE=0) are configured as "around the clock". Thus any time the APE takes control of IPDA 17, the APE will begin looking to return control to the HOST at 15 minutes AFTER the next top-of-the-NEXT-hour (because of the SBOFFSET=15).
With parameter "STABLE=0", there will be no testing of the HOST's IPDA port. If there is a bad connection at the Host's Switching Unit "IPDA" port, the APE will not care, and the APE will attempt to hand control back to the Host. If that IPDA port is truly down, then the HOST will not be able to control IPDA 17, and the entire cycle will repeat itself. Thus your users will bounce from Host to APE until someone fixes the problem. This situation can be eliminated by setting the parameter "STABLE" with a value "5", such that the HOST's Switching Unit's IPDA port must return a positive PING result continuously for 5 minutes before the APE attempts to return control to the HOST.
The APE's "time window" was originally engineered so that if APE has taken control of IPDA 17, the AUTO MODE would allow the customer to return control of IPDA 17 to the HOST during an "off-hours window", so that users will not be "double" impacted. By setting SBBEGIN and SBENDE to "0", your users are subject to be disrupted twice whenever your network hiccups: once when the NCUI asks the APE to take control, and again when the APE returns control to the HOST at the next XX:15 o'clock. Obviously you as the customer can setup AMO APESU however you desire, but from my perspective the APE's "around the clock" setup is not efficient. EXAMPLE of EFFICIENTLY CONTROLLING APE SWITCHOVER: If you set SBBEGIN=4, and SBENDE=6 with SBOFFSET=15, then between 4:15am and 6:15am, the APE will return control to the HOST, hopefully not impacting the users. Is there a reason that you need to abort out of APE mode so quickly? APE was designed as an alternative controller when the Host is not available. The APE is not an exact replacement for the Host, but it should suffice for the remainder of the day, thus limiting the number of daily IPDA reboots to "1".
Regarding WHY your IPDA is going crazy - it is important to know how the APE takes control. The HOST's active Switching Unit sends TCP-based keep alive messages to all IPDAs at a rate determined by this formula:
SIPCO Timing parameter "SUPVTIME"/8. At your site, SUPVTIME=10 seconds; therefore, the Keep-Alive messages are sent every 10/8 seconds, which equates to every 1.25 seconds.
If there is any disruption, or network traffic issue between the HOST's Switching Unit's IPDA port and your IPDA 17 NCUI, which results in a non-delivery of a Keep-Alive message, that message is re-transmitted. If IPDA 17's NCUI cannot acknowledge the Keep-Alive due to a network congestion or different issue for a time period = "SUPVTIME", which is "10 seconds" at your site, then the HOST enters Signaling Survivability mode for that IPDA, regardless of whether Signaling Survivability has been purchased. It is merely the internal name of the mode.
When Signaling Survivability mode is triggered, IMMEDIATELY another Time parameter (SIPCO->Timing->ALVTIME) begins a countdown, which at your site is "60 seconds". During that 60 seconds, if your network problem(s) clear, then IPDA 17 will come back online immediately. Also during that 60 second countdown, calls already UP will not be interrupted, but no features can be used, e.g. HOLD, TRANSFER, etc, because the Switching Unit is not available to process that Feature Request. Also, no NEW calls are allowed. If that ALVTIME 60 second timer expires, then a third parameter (RESTIME=300 seconds at your site) begins. The big change here is: there is no recovery at this point, even if the actual problem is repaired. When this RESTIME parameter expires, ALL CALLS WILL BE DROPPED -> then either (1) the NCUI will reboot (if there is no Signaling Survivability or APE), or (2) the Signaling Survivability process will take control of the IPDA (if this feature is purchased, AND if all the required AMO configuration has been performed, AND the Signaling Survivability router is properly configured and ACCESSIBLE), or (3) the IPDA will ask the APE to take control.
What is the purpose of extending the "RESTIME" time parameter to 300 seconds? When the problem has reached this point, the IPDA will NOT come back up, even if the problem is resolved, until one of the three above-mentioned actions happens! Therefore extending this time parameter actually extends the period of time where no users can place new phone calls at the IPDA.
I assume that your company did not purchase Signaling Survivability. You can quickly see by typing: DIS-CODEW; If "Signaling Survivability"=0, then it was not purchased. Plus, there is additional AMO APRT configuration required, an additional router with modem must be installed, configured, and a matching modem must be connected to the IPDA NCUI board's "MODEM" port.
Signaling Survivability was designed to handle problems in the CUSTOMER's network. The AP-E was designed to handle problems within the HiPath LAN Segment, such as the failure of the Host's Switching Unit's "IPDA" port. If this port fails, then the Switching Unit cannot reach the Signaling Survivability router. Thus, the BEST solution that covers both the HiPath LAN segment AND the customer's networkis: Signaling Survivability AND AP-E.
To summarize, if the network between your HOST/Switching Unit and the IPDA 17 is disrupted for 70 seconds (SIPCO Timing parameters SUPVTIME + ALVTIME), then your IPDA will switch to control by the APE 5 minutes later because of the parameter "RESTIME=300 seconds". If this switchover occurs at 4pm, then because of the APESU "around the clock" configuration plus a 15-minute offset, the APE will begin to attempt to return control to the HOST at 4:15pm. So your users experience a double-disconnection if the switchover occurs during 8am - 4pm hours, assuming your business ends at 5pm.
Your IPDA is a Networked Shelf (see AMO UCSU). This means that the IPDA is in a different Network than the HOST, and routers are needed to route the Signaling and Voice/Payload to/from the IPDA. Often network engineers will use a WAN between the HOST and a remote location. A WAN typically has a bandwidth limitation to conserve money. Is it possible that the bandwidth between your main site and this remote site is completely exhausted at certain periods during the day, which could be triggering the NCUI to switchover to APE control??
In a recent post you mentioned that the AP Backup had stopped working. What was the cause of this failure? During the APE installation, the vendor SHOULD have tested this AP Backup thoroughly, and also switched between HOST mode and APE mode using EXE-APESU before leaving the site. When your system was installed, the AMO SIPCO -> Timing parameters should have been discussed with the Communications Manager for maximum efficiency. In my opinion, these parameter settings at your site are no longer efficient. I believe the problem is being caused by the network, most likely network congestion triggering the APE mode. There could also be IP Address conflict(s), as I see that you are using a Private Network with STATIC IP Addresses, which can lead to duplication/conflicts if the IP Addresses are not properly managed.
Here is my final theory: is it possible that someone, perhaps on a cleaning crew, is unplugging the AC plug for equipment at the Host location or remote location, which is triggering the switchover to APE? A new cleaning employee/contractor may need access to an AC outlet for the vaccuum cleaner -> unplugs something at the Host which kills your Switching Unit's Layer 2 switch, which kills the Private LAN to your IPDA, which causes the switchover to APE. Or, something at the remote location is being unplugged that kills the IPDA 17's network connection to the Host.
You should be able to use STA-HISTA or Assistant -> Diagnostics -> HiPath System Diagnosis to search for daily failure(s) of the Switching Unit's IPDA port.
Note: if your company had a Service Contract with an approved vendor, this problem MOST LIKELY would have been resolved soon after the symptoms were first detected.
The Access Point-Emergency server is very effective IF properly designed and implemented. I hope this BASIC information provides you with a new level of understanding as to how these many parts & pieces must intermingle to form a perfect solution. If mis-managed, this solution can easily fail!