Update 2/1/2018
- This issue has been fixed in an updated release of Mellanox CX4 firmware and driver
In a hyper-converged cluster implemented using the Dell EMC Microsoft Storage Spaces Direct Ready Nodes with Dell EMC PowerEdge R740xd and Mellanox CX4 LX adapters for storage traffic, you may see SMB client errors (event id 30803) in Windows event viewer (Applications and Services Logs -> Microsoft -> Windows -> SMB client -> Connectivity) when a cluster node reboots.
While this is normal in a failover cluster during a node reboot, you may occasionally see these errors re-appear on the cluster nodes at a regular interval even after all cluster nodes are fully functional. This behavior is due to a failure in creating the SMB listeners for every storage interface in the node that restarted. These errors appear on the surviving nodes in the cluster and not on the node that restarted. The error description indicates the server to which the SMB client is trying to connect and the Server Address in the description indicates the node that just restarted.
In a normal functional state of the cluster nodes, after a node reboot, running netstat –xan should show an IPv4 and IPv6 listener associated with every storage interface on the node. The following output of netstat.exe was gathered on a node with two storage adapters.
Active NetworkDirect Connections, Listeners, SharedEndpoints
Mode IfIndex Type Local Address Foreign AddressPID
Kernel 4 Connection 10.128.100.101:445 10.128.100.100:61476 0
Kernel 4 Connection 10.128.100.101:445 10.128.100.100:62244 0
Kernel 4 Connection 10.128.100.101:445 10.128.100.100:61988 0
Kernel 4 Connection 10.128.100.101:445 10.128.100.100:62756 0
Kernel 4 Connection 10.128.100.101:12541 10.128.100.100:445 0
Kernel 4 Connection 10.128.100.101:12797 10.128.100.100:445 0
Kernel 4 Connection 10.128.100.101:14077 10.128.100.100:445 0
Kernel 4 Connection 10.128.100.101:14333 10.128.100.100:445 0
Kernel 14 Connection 10.128.100.133:445 10.128.100.132:27454 0
Kernel 14 Connection 10.128.100.133:445 10.128.100.132:27198 0
Kernel 14 Connection 10.128.100.133:237510.128.100.132:445 0
Kernel 14 Connection 10.128.100.133:62535 10.128.100.132:445 0
Kernel 14 Connection 10.128.100.133:62791 10.128.100.132:445 0
Kernel 14 Connection 10.128.100.133:64071 10.128.100.132:445 0
Kernel 14 Connection 10.128.100.133:64327 10.128.100.132:445 0
Kernel 4 Listener [fe80::4cae:cb05:4932:f226%4]:445 NA 0
Kernel 4 Listener 10.128.100.101:445 NA 0
Kernel 14 Listener 10.128.100.133:445 NA 0
Kernel 14 Listener [fe80::5180:55b6:c0f0:ae8d%14]:445 NA 0
Output Listing 1 - Fully functional SMB stack
However, when you start seeing the SMB client errors in the cluster, the node that rebooted may not have all the listeners associated with every storage interface in the system.
Active NetworkDirect Connections, Listeners, SharedEndpoints
Mode IfIndex Type Local Address Foreign AddressPID
Kernel 4 Connection 10.128.100.101:445 10.128.100.100:61476 0
Kernel 4 Connection 10.128.100.101:445 10.128.100.100:62244 0
Kernel 4 Connection 10.128.100.101:445 10.128.100.100:61988 0
Kernel 4 Connection 10.128.100.101:445 10.128.100.100:62756 0
Kernel 4 Connection 10.128.100.101:12541 10.128.100.100:445 0
Kernel 4 Connection 10.128.100.101:12797 10.128.100.100:445 0
Kernel 4 Connection 10.128.100.101:14077 10.128.100.100:445 0
Kernel 4 Connection 10.128.100.101:14333 10.128.100.100:445 0
Kernel 14 Connection 10.128.100.133:2375 10.128.100.132:445 0
Kernel 14 Connection 10.128.100.133:62535 10.128.100.132:445 0
Kernel 14 Connection 10.128.100.133:62791 10.128.100.132:445 0
Kernel 14 Connection 10.128.100.133:64071 10.128.100.132:445 0
Kernel 14 Connection 10.128.100.133:64327 10.128.100.132:445 0
Kernel 4 Listener [fe80::4cae:cb05:4932:f226%4]:445 NA 0
Kernel 4 Listener 10.128.100.101:445 NA 0
Output Listing 2 - SMB stack missing a listener
Therefore, in the above example, SMB client attempting to connect on the interface index 14 will eventually result in connection refused messages and SMB client errors (event ID 30803) related to RDMA
The Dell EMC Microsoft Ready Node network architecture recommends two storage adapters per every cluster node, there won’t be any disruption in cluster functionality when this issue occurs. Also, the adapter that is missing a listener can still be used to send RDMA traffic. However, since there is no listener on one of the storage adapters, writes using RDMA cannot be performed. This adapter falls back to using TCP for any writes or receiving traffic. This may result in lower write performance depending on the workload. There is no data loss or functionality limitations when this issue occurs.
This has been identified as a bug in the Mellanox CX4 LX WinOF2 driver versions 1.70 and below.
The SMB listener can be recreated by restarting the virtual storage adapter that has no associated SMB listener after a reboot. You can identify the right virtual adapter to restart by following the steps outlined below.
From the netstat -xan output, you can see that there is a listener missing for one of the storage adapters. The interface index for the missing adapter can be found using the Get-NetAdapter cmdlet.
PS C:\> Get-NetAdapter
Name InterfaceDescription ifIndex Status MacAddress LinkSpeed
______________ __ ___________________ ____ ______vEthernet (Storage2) Hyper-V Virtual Ethernet Adapter #3 14 Up 00-15-5D-09-C4-0210 Gbps
vEthernet (Storage1) Hyper-V Virtual Ethernet Adapter #2 4 Up 00-15-5D-09-C4-0110 Gbps
vEthernet (Management)Hyper-V Virtual Ethernet Adapter 10 Up 00-15-5D-09-C4-0010 Gbps
Ethernet Remote NDIS Compatible Device 9 Not Present 50-9A-4C-A7-F9-DF 0 bps
NIC2 Intel(R) Ethernet 10G X710 rNDC 6 Disconnected 24-6E-96-52-CC-A410 Gbps
NIC4 Intel(R) I350 Gigabit Network Connec... 15 Disconnected 24-6E-96-52-CC-C3 0 bps
NIC3 Intel(R) I350 Gigabit Network Conn...#2 8 Disconnected 24-6E-96-52-CC-C2 0 bps
NIC1 Intel(R) Ethernet 10G 4P X710/I350 rNDC 13 Disconnected 24-6E-96-52-CC-A210 Gbps
SLOT 1 Port 2 Mellanox ConnectX-4 Lx Ethernet Ad...#2 2 Up 24-8A-07-59-4C-6910 Gbps
SLOT 1 Port 1 Mellanox ConnectX-4 Lx Ethernet Adapter 11 Up 24-8A-07-59-4C-6810 Gbps
By looking at the netstat –xan output (shown in Output Listing 2), you can see that interface with index 14 has no listener associated with it. From the Get-NetAdapter cmdlet, you can see that the interface index 14 is the virtual adapter vEthernet (Storage2).
Note: This network adapter name may be different based on how you have named storage adapters in the management OS.You can now restart the interface with missing listener.
Restart-NetAdapter –Name 'vEthernet (Storage2)'
Once this process is complete, you can check netstat –xan to ensure that the listener is created. This process may take a few minutes. Once the listener is created, the cluster nodes will start communicating normally over RDMA and new SMB client errors will stop appearing in the event viewer.