Skip to main content

PowerFlex Experiencing High Latency On Datastores And Hanging ESXi Hosts

Summary: Issue Description vSphere ESXi hosts hang and become inaccessible randomly after experiencing high latency within the datastores.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Scenario
Due to a driver incompatibility within the ixgbe Intel NIC driver the VxFlex system had multiple SDS disconnections, causing DU and APD within the ESXi nodes. The APD caused ESXi hostd hang which causes the nodes to become inaccessible. 

Symptoms
Vmkernel logs:

2019-04-10T04:47:02.092Z cpu43:946022)NetLB: 2233: Driver claims supporting 15 TX queues, and 15 queues are accepted.
2019-04-10T04:47:02.092Z cpu43:946022)NetLB: 2237: Driver claims supporting 15 RX queues, and 15 queues are accepted.
...
2019-04-10T04:47:02.100Z cpu55:946022)WARNING: Tcpip_Vmk: 781: vmk_get_gateway failed with error = 0x2d, status = 0xbad0105
2019-04-10T04:47:02.100Z cpu55:946022)WARNING: Tcpip_Vmk: 781: vmk_get_gateway failed with error = 0x2d, status = 0xbad0105
2019-04-10T04:47:02.100Z cpu55:946022)WARNING: Tcpip_Vmk: 781: vmk_get_gateway failed with error = 0x2d, status = 0xbad0105
2019-04-10T04:47:02.104Z cpu55:946022)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136
2019-04-10T04:47:02.104Z cpu55:946022)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136
2019-04-10T04:47:02.104Z cpu55:946022)Tcpip_Vmk: 96: get connection stats failed with error code 195887136
....
2019-04-10T04:47:02.132Z cpu55:946022)WARNING: Tcpip_Vmk: 781: vmk_get_gateway failed with error = 0x2d, status = 0xbad0105
...
2019-04-10T04:47:10.498Z cpu13:948008)WARNING: UserObj: 5436: vmkvsitools: Unimplemented operation on 0x439e817fc850/SOCKET_VMCI
2019-04-10T04:47:10.498Z cpu13:948008)WARNING: UserObj: 5436: vmkvsitools: Unimplemented operation on 0x439e817f3c40/SOCKET_VMCI
2019-04-10T04:47:11.587Z cpu40:66684)nsxt-switch-security: SwSecDelVmi:1121: [nsx@6876 comp="nsx-esx" subcomp="swsec"]Filter 67112517Deleting vmi: 2 vlanId = 0 mac = 02:50:56:00:70:e8 ip = 10.255.15.33
2019-04-10T04:47:11.587Z cpu40:66684)nsxt-switch-security: SwSecDelVmi:1165: [nsx@6876 comp="nsx-esx" subcomp="swsec"]Filter 67112517After deleting: [0 0 0 0]
2019-04-10T04:47:12.407Z cpu34:948237)DLX: 4310: vol 'F2_DS1', lock at 174866432: [Req mode 1] Checking liveness:
2019-04-10T04:47:12.407Z cpu34:948237)[type 10c00001 offset 174866432 v 139, hb offset 3932160
gen 17, mode 1, owner 5c8e202a-f1c1830f-af9b-246e96c9cad0 mtime 446607

Vmkernel log entries showing TX hangs:

2019-04-10T12:01:55.265Z cpu33:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4571: vmnic5 : scheduler(0x430acbf450e0)/device(0x4306fee843c0) 0/1 lock up [stopped=0]:
2019-04-10T12:01:55.265Z cpu33:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4602: vmnic5: packets completion seems stuck, issuing reset
2019-04-10T12:01:59.626Z cpu48:65693)<6>ixgbe 0000:05:00.1: vmnic5: Fake Tx hang detected with timeout of 5 seconds 
CPU locks declared prior to driver state:

2019-04-10T12:01:54.547Z cpu10:73512)WARNING: Heartbeat: 794: PCPU 32 didn't have a heartbeat for 7 seconds; may be locked up.
2019-04-10T12:01:54.547Z cpu23:73050)WARNING: Heartbeat: 794: PCPU 45 didn't have a heartbeat for 7 seconds; may be locked up.
2019-04-10T12:01:54.547Z cpu10:73512)WARNING: Heartbeat: 794: PCPU 33 didn't have a heartbeat for 7 seconds; may be locked up.
2019-04-10T12:01:54.547Z cpu13:73515)WARNING: Heartbeat: 794: PCPU 35 didn't have a heartbeat for 7 seconds; may be locked up.
2019-04-10T12:01:54.547Z cpu10:73512)WARNING: Heartbeat: 794: PCPU 34 didn't have a heartbeat for 8 seconds; may be locked up.
2019-04-10T12:01:54.547Z cpu13:73515)WARNING: Heartbeat: 794: PCPU 36 didn't have a heartbeat for 7 seconds; may be locked up.

Further we see logged evidence of the msgs to hostd failing:

2019-04-10T21:05:31.753Z cpu1:65707)VmkEvent: 93: Msg to hostd failed with timeout, dropping function 2092 len 20
2019-04-10T21:05:35.011Z cpu2:65707)VmkEvent: 93: Msg to hostd failed with timeout, dropping function 2092 len 20
2019-04-10T21:05:35.046Z cpu0:65707)VmkEvent: 93: Msg to hostd failed with timeout, dropping function 2092 len 20
2019-04-10T21:05:37.795Z cpu26:65707)VmkEvent: 93: Msg to hostd failed with timeout, dropping function 2092 len 20




2019-04-14T22:32:35.007Z cpu5:95941)WARNING: Heartbeat: 794: PCPU 28 didn't have a heartbeat for 8 seconds; *may* be locked up.
...
2019-04-14T22:33:36.644Z cpu30:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4571: vmnic5 : scheduler(0x430accad10e0)/device(0x4306fee843c0) 0/1 lock up [stopped=0]:
2019-04-14T22:33:36.644Z cpu30:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4578: detected at 407655639 while last xmit at 407650438 and 39742 bytes in flight [window 86460 bytes]
2019-04-14T22:33:36.644Z cpu30:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4583: and last enqueued/dequeued at 407652355/407655639 [stress 0]
2019-04-14T22:33:36.644Z cpu30:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4586: with 394 pkts inflight
2019-04-14T22:33:36.644Z cpu30:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4602: vmnic5: packets completion seems stuck, issuing reset
...
2019-04-14T22:55:33.715Z cpu50:66861)WARNING: Lock: 1675: (held by 2: Spin count exceeded 1 time(s) - possible deadlock.
...
2019-04-15T01:00:01.810Z cpu29:608949)ALERT: hostd detected to be non-responsive
...
2019-04-15T00:55:16.800Z cpu0:546988)WARNING: Heartbeat: 498: One or more PCPUs didn't perform a heartbeat check for 7 seconds.

 

Impact 

Cause network latency that can affect the HCI VxFlex SVMs installed on the ESXi nodes that cause APD and hostd hang on the nodes.

Cause

Intel cards on ESXi installed on Ready Nodes require the native mode driver to be used. 

 

Resolution

Workaround

Change the Intel NIC driver from ixgbe to ixgben native driver through reinstall.

Impacted Versions

N/A

Fixed In Version

N/A - Driver issue

Affected Products

PowerFlex Software, VxFlex Product Family, VxFlex Ready Node, Ready Node Series
Article Properties
Article Number: 000203027
Article Type: Solution
Last Modified: 18 Mar 2025
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.