Symptoms
Unisphere GUI and CLI are no longer accessible after attempting to change DNS settings. Restarting management services does not resolve the issue. It was also determined that ECOM does not run on either SP or stay running for more than 10 minutes.
Attempting to use the following KB to restart MGMT does not resolve the issue:
Dell Unity: Unable to Access Unisphere The system is busy. Try again later (User Correctable)
https://www.dell.com/support/kbdoc/000056109
Command: svc_restart_service restart MGMT
Collecting data collects showed that there were ECOM dump files.
Rebooting each SP per KB 000021439 did allow ECOM to work but only for 10 minutes at a time.
Dell Unity: How To recover or troubleshoot when the management service (ECOM) is not running on either SP (User Correctable)
https://www.dell.com/support/kbdoc/000021439
Cause
The issue occurs when a DNS command "papi_clust_set.sh dns xxx" has timed out and caused an ECOM panic. It can also occur when using Unisphere to change DNS settings or to remove a new DNS server.
Note: IPMI Tool must be used to connect to either SP to troubleshoot because ECOM is down.
Triage and review of logs showed hung Batch jobs from a failed attempt to make DNS changes to the Unity system.
Command used to check: uemcli /sys/task/job show -detail Internal logs showed the following:
cemtracer_sysapi logs 18 Jul 2023 21:51:36 - [SYSAPI] ERROR - {0:777251:881779993}[1053|3741|f70d6b40][doTimeoutAction @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/ConfigMgr.cpp:403] Timeout action (poll): abort Aborting the system.
The ECOM dumps can show signs similar to the following:
Search for "error" in cemtracer_sysapi.log showed:
xx Nov xxxx 13:27:52 - [SYSAPI] ERROR - {0:24803979:204377483}[18921|28516|f70d6b40][_watchDogRoutine @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/UpdateManagerImpl.cpp:1511] Watch dog poll request timeout occured. Now:24803979204 TimeGap:900621 Is in Poll:0
xx Nov xxxx 13:27:52 - [SYSAPI] ERROR - {0:24803979:205104121}[18921|28516|f70d6b40][_watchDogRoutine @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/UpdateManagerImpl.cpp:1512] PerfStatReport:
xx Nov xxxx 13:27:52 - [SYSAPI] ERROR - {0:24803979:206876577}[18921|28516|f70d6b40][_watchDogRoutine @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/UpdateManagerImpl.cpp:1516] dependencyMap:
xx Nov xxxx 13:27:52 - [SYSAPI] ERROR - {0:24803979:206905718}[18921|28516|f70d6b40][doTimeoutAction @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/ConfigMgr.cpp:403] Timeout action (poll): abort
xx Nov xxxx 23:16:07 - [SYSAPI] ERROR - {0:1379:344272630}[2919|6108|f17ffb40][Poll @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/TLDPollManager.cpp:383] Admin PEER poll request failed.Error Code = 7e110000.
xx Nov xxxx 23:16:58 - [SYSAPI] ERROR - {0:1429:454592292}[2919|20591|d79ffb40][performRequestBase @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/util/TLDUtils.cpp:346] Admin failed to process request (err = 2115043355):
TAG_K10_ERROR_PACKET (0x10004)
TAG_K10_ERROR_CODE (0x10005) num: 2115043355 (0x7e11001b) str: "...~" hex: 1b:0:11:7e
xx Nov xxxx 23:31:15 - [SYSAPI] ERROR - {0:2286:652151017}[2919|6535|f70d6b40][_watchDogRoutine @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/UpdateManagerImpl.cpp:1511] Watch dog poll request timeout occured. Now:2286652 TimeGap:900650 Is in Poll:0
xx Nov xxxx 23:31:15 - [SYSAPI] ERROR - {0:2286:652504752}[2919|6535|f70d6b40][_watchDogRoutine @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/UpdateManagerImpl.cpp:1512] PerfStatReport:
xx Nov xxxx 23:31:15 - [SYSAPI] ERROR - {0:2286:653466119}[2919|6535|f70d6b40][_watchDogRoutine @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/UpdateManagerImpl.cpp:1516] dependencyMap:
xx Nov xxxx 23:31:15 - [SYSAPI] ERROR - {0:2286:653504952}[2919|6535|f70d6b40][doTimeoutAction @ /c4_working/Unity_PullRequest_Build_Driver_Sles15_RTM_1.1/clariion/components/adapters/SystemAPI/framework/src/ConfigMgr.cpp:403] Timeout action (poll): abort
Resolution
This is resolved in Unity OE 5.2.0 and higher. If stopped responding jobs are found contact Dell Technical support and mention this article. Hung jobs can be seen in Unisphere in the Event/Jobs section. Support will assist with clearing the hung jobs using the Internal section of KB 000059274.
Dell Unity: Deleting stuck or suspended batch jobs, Error Code: 0x7d13151 (Dell Correctable)
https://www.dell.com/support/kbdoc/en-us/000059274/dell-emc-unity-deleting-stuck-or-suspended-batch-jobs-error-code-0x7d13151
After the hung jobs are removed, the ECOM service must be restarted using KB 000019807.
Dell Unity: How to perform a management services (ECOM) failover (Dell Correctable)
https://www.dell.com/support/kbdoc/000019807
Reference: UnityD-54308, UnityD-59297, UEE-16306, UEE-17969
Affected Products
Dell EMC Unity