Failure of primary VLT peer causes major outages (S4048-ON OS9)

We have a four-node Microsoft Failover Cluster, with each server equipped with a pair of NICs configured in a Switch Embedded Team (SET). Each NIC within the team is connected to one of the two peers in the VLT domain, with a single link per connection. The VLT domain connects to an "access" switch via a VLT port-channel with LACP, facilitating client access.

We have followed best practices and official documentation to ensure that SET and VLT are configured correctly. However, during fault/failure simulations, we consistently observe catastrophic outages affecting the cluster, but only when the tests are conducted against the "primary" VLT peer. These issues include nodes being dropped from the cluster, VMs failing, crashing, or entering a paused state, and Cluster Shared Volumes (CSVs) disconnecting.

For example, the following conditions will cause our cluster to enter a failed state and lose network connectivity for an unacceptable amount of time:

Reloading the primary VLT peer by pulling the power or by issuing the reload command
Administratively shutting down all server ports, VLT port-channel uplink and VLTi

The individual links to the servers fail over gracefully. Killing the VLTi on the secondary VLT peer also results in a graceful failover. Reloading the secondary VLT peer causes a graceful failover as well.

We expect each peer to handle failures similarly, but they clearly do not. We’re out of ideas... and almost out of drywall to bang our heads against. Any assistance would be greatly appreciated.

View All

No Events found!

Networking General

Failure of primary VLT peer causes major outages (S4048-ON OS9)