Start a Conversation

Unsolved

This post is more than 5 years old

9997

November 30th, 2014 18:00

VMware HA with VPLEX Stretch Cross-Connect Cluster HA Failure

We are in the process of setting up VMware stretch cluster using VPLEX Metro
and a cross-connect configuration. Currently the VMware cluster is setup with
six (6) hosts and one site, and six (6) hosts at another site.  There are
two (2) redundant SAN fabrics, and each fabric spans both sites via DWDM
support FC.  The link between the sites is 24 Gb/s per fabric.  We are using
PowerPath/VE for multipathing on the VMware hosts.  All hosts in the VMware
cluster see all LUNs from both VPLEX clusters.

From a VMware point of view:

     1.  VMkernel.Boot.terminateVMOnPDL changed to Yes
     2.  DRS Groups created for each site (should rules)
     3.  HA Admission Control set to 50%


This past week we attempted two (2) failover tests:


     1.  Physical host at site A running the VM.  Unzoned all hosts from
         site A from all paths between the hosts and the VPLEX clusters.
         Maintained paths for site B hosts.

     2.  Physical host at site A running the VM.  Removed all hosts from
         site A from the storage views on BOTH VPLEX clusters.  Maintained
         the hosts at site B in the storage views.

What we found is:


     Test 1.  If the physical server loses all paths to the VPLEX, HA will
              NOT occur from the VMware point of view.  Instead, the host goes
              into an APD state, and you have to manually initiate failover. 
              It appears to be how VMware designed this.  It is unfortunate.

     Test 2.  If the physical server is removed from the VPLEX storage view,
              but maintains its paths to the VPLEX, HA will occur, but it will
              question you first.  What you get in vCenter is a pop-up window
              that states: "The storage backing virtual disk
              TSTKRSW8K1v-TEST.vmdk has permanent device loss. You may be able
              to hot remove this virtual device from the virtual machine and
              continue after clicking button.retry.  Click button.abort to
              terminate this session."  And you have a choice of "Retry" or
              "Cancel".  If you select “Cancel” the HA will occur, BUT it is
              not automatic as hoped.  We thought that TerminateVMonPDL would
              suppress the question.

Has anyone else seen this? 

     1.  Is there a way to make VMware automatically HA a VM if it goes into
         an APD state?
     2.  Is there a way to make VMware automatically HA a VM if some hosts can
         still access the VPLEX cluster but cannot access the LUN?  A way to
         get around that question from Test 2 above?

Any thoughts?

Thanks,

Will

5 Practitioner

 • 

274.2K Posts

December 2nd, 2014 03:00

Hi Will,

The Permanent Device Loss condition is a condition that is communicated by the array/VPLEX to ESXi via a SCSI sense code. It indicates that a device (LUN) is unavailable and more than likely permanently unavailable.”

By issuing this command to ESXi the storage array informs ESXi what the status of the LUN is and then action can be

taken if and when this is configured - in your case you want to initiate an automatic H.A action.

So based on this my 2cents would be:

In test 1


Without Cross-Connect

When you lose all your paths from host to VPLEX, there is no way for the ESXi to receive that SCSI sense code so the behavior you are seeing is expected. VM remains in ADP state.


With Cross-Connect

In theory you should lose all "local paths" to SiteA VPLEX but retain "Remote Paths" to SiteB VPLEX. VM should continue to operate unaffected as Powerpath/VE will failover and balance ideally via autostandy feature.

In test 2

You retain your paths allowing the sense code to be received by the host and thus causing an action - in your case the pop up windows. While I cant say I've seen those pop-ups before I would suggest you check to see if a setting called das.maskCleanShutdownEnabled is set to "True". This is a vSphere HA advanced setting. It allows HA to trigger an automatic restart response for a virtual machine which has been killed automatically due to a PDL condition.

More on the two settings:

disk.terminateVMOnPDLDefault

http://www.yellow-bricks.com/2013/08/28/vsphere-5-5-nuggets-changes-to-disk-terminatevmonpdldefault/

das.maskCleanShutdownEnabled

http://www.yellow-bricks.com/2012/04/25/what-is-das-maskcleanshutdownenabled-about/

Also relavent thread:

https://community.emc.com/thread/190016

*note subtle differences in vSphere 5.1 to 5.5`

Hope this helps,

Alan

89 Posts

December 5th, 2014 01:00

Hi Will,

Hope you are well.

I haven't read this document in detail myself yet, but we have a newly updated Best Practices document that discusses ESXi Path Loss Handling (Chapter 3), pages 15 through 20.

http://www.emc.com/collateral/technical-documentation/h13546-vplex-san-connectivity-best-practices.pdf

Hope that helps,

Gary

89 Posts

December 5th, 2014 14:00

Hi again Will,

I was curious, so I did some digging...

Some questions:

  • What version of vSphere are you using?  5.0, 5.1, 5.5, other?
    • I assume the version of vSphere also applies to SRM, but the version there might be important.
  • Are you using any RDMs in this environment?

You said:

> Test 1.  If the physical server loses all paths to the VPLEX, HA will

              NOT occur from the VMware point of view.  Instead, the host goes

              into an APD state, and you have to manually initiate failover. 

              It appears to be how VMware designed this.  It is unfortunate.


Check this out:

vSphere 5.1 All Paths Down (APD) enhancements


"This brand new setting is called Misc.APDHandlingEnable. It can be set to 0 or 1. A value of zero means that ESXi will stick to the “old” method which is to always retry failed I/O’s. A value of 1 enables the new behavior. The behavior will allow ESXi to “fast-fail” I/Os. This will happen after 140 seconds by default. Fast-failing I/Os is what will prevent the host to be disconnected or frozen up.  This is configurable though through Misc.APDTimeout"


vSphere 5.1 Storage Enhancements – Part 4: All Paths Down (APD) | CormacHogan.com


In vSphere 5.1, a new timeout value for APD is being introduced. There will be a new global setting for this feature called Misc.APDHandlingEnable. If this value is set to 0, the current (5.0) behavior of retrying  failing I/Os forever will be used. If Misc.APDHandlingEnable is set to 1 (default), APD Handling will be enabled to follow the new model using the time out value Misc.APDTimeout.

This is set to 140 second timeout by default, tuneable. [The lower limit is 20 seconds but this is only for testing]. These settings (Misc.APDHandlingEnable & Misc.APDTimeout) are exposed in the vSphere UI. When APD is detected, the timer starts. After 140 seconds, the device is marked as APD Timeout.  Any further I/Os are fast-failed with a status of NO_CONNECT. This is the same sense code observed when an FC cable is disconnected from an FC HBA. This fast failing of I/Os prevents hostd from getting stuck waiting on I/O.  If any of the paths to the device recovers, subsequent I/Os to the device are issued normally and special APD treatment finishes.

So APD, for newer vSphere versions (as of 5.1), APD will timeout I/O after a waiting a configurable amount of time (default, 140 seconds, minimum time, 20 seconds.)


My questions to you would be - was this enabled, if so, what is the timeout?  And if you did wait for this to kick in, did you wait long enough for it to happen?  (You might want to try the lowest setting, 20 seconds, or something like a typical 30 or 60 seconds, like a typical scsi timeout value.)



> Test 2.  If the physical server is removed from the VPLEX storage view,

              but maintains its paths to the VPLEX, HA will occur, but it will

              question you first.  What you get in vCenter is a pop-up window

              that states: "The storage backing virtual disk

              TSTKRSW8K1v-TEST.vmdk has permanent device loss. You may be able

              to hot remove this virtual device from the virtual machine and

              continue after clicking button.retry.  Click button.abort to

              terminate this session."  And you have a choice of "Retry" or

              "Cancel".  If you select “Cancel” the HA will occur, BUT it is

              not automatic as hoped.  We thought that TerminateVMonPDL would

              suppress the question.


I Googled your error message (my usual troubleshooting method :-) and one or two hits came up:


VMware Site Recovery Manager 5.5 Release Notes


    • When protection site LUNs encounter All Paths Down (APD) or Permanent Device Loss (PDL), SRM might not recover raw disk mapping (RDM) LUNs in certain cases.During the first attempt at planned migration you might see the following error message when SRM attempts to shut down the protected virtual machine:Error - The operation cannot be allowed at the current time because the virtual machine has a question pending: 'msg.hbacommon.askonpermanentdeviceloss:The storage backing virtual disk VM1-1.vmdk has permanent device loss. You might be able to hot remove this virtual device from the virtual machine and continue after clicking Retry. Click Cancel to terminate this session.If the protected virtual machines have RDM devices, in some cases SRM does not recover the RDM LUN.Workaround:
      1. When LUNs enter APD/PDL, ESXi Server marks all corresponding virtual machines with a question that blocks virtual machine operations.
        1. In the case of PDL, click Cancel to power off the virtual machine.
        2. In the case of APD, click Retry.
        If you run planned migration, SRM fails to power off production virtual machines.
    1. If the virtual machines have RDM devices, SRM might lose track of the RDM device and not recover it. Rescan all HBAs and make sure that the status for all of the affected LUNs has returned from the APD/PDL state.
    2. Check the vCenter Server inventory and answer the PDL question that is blocking the virtual machine.
    3. If you answer the PDL question before the LUNs come back online, SRM Server on the protected site incorrectly detects that the RDM device is no longer attached to this virtual machine and removes the RDM device. The next time you run a recovery, SRM does not recover this LUN.
    4. Rescan all HBAs to make sure that all LUNs are online in vCenter Server inventory and power on all affected virtual machines. vCenter Server associates the lost RDMs with protected virtual machines.
    5. Check the Array Managers tab in the SRM interface. If all the protected datastores and RDM devices do not display, click Refresh to discover the devices and recompute the datastore groups.
    6. Make sure that Edit Group Settings shows all of the protected datastores and RDM devices and that the virtual machine protection status does not show any errors.
    7. Start a planned migration to recover all protected LUNs, including the RDM devices.

Does the line "During the first attempt at planned migration you might see the following error message when SRM attempts to shut down the protected virtual machine", imply that this pop-up only happens on the first attempt, or is it all attempts?  If it were just the first, then maybe you could go and "clear it" somehow, but that still seems rather annoying and error prone...

Now the workaround doesn't make much sense to me, but it does seem to indicate that perhaps it's related to RDMs, and not say VMFS volumes?

Best advice I can give on this one is ask VMware what this means, and how to get around the pop-up.  The above statement seems to go back as far as the 5.1 SRM Release notes.

Bottom line, I don't necessarily think these behaviors are specific to VPLEX, I think what you're looking at here is a lot of VMware-isms...

Hope that helps,

Gary

89 Posts

December 5th, 2014 21:00

Alas, it looks like APD is just a fact of life, and the setting I mentioned (Misc.APDHandlingEnable, in vSphere 5.1) merely improves the behavior of the esx host not to be completely hung.

http://www.valcolabs.com/2012/08/25/solving-all-paths-down-apd-when-using-emc-vplex/

Seem likes to solve your APD problem, failover does indeed need to be manual.  Darn indeed.  Maybe VMware is listening and can make improvements...

Gary

114 Posts

February 5th, 2015 08:00

Gary,

This is excellent indeed. How do you think this will affect the cross-connect requirement ?

89 Posts

February 5th, 2015 08:00

Well, it looks like our prayers have been answered :-)

http://www.yellow-bricks.com/2015/02/04/whats-new-ha-vsphere-6-0/

Snippets from above:

    • VM Component Protection – This allows HA to respond to a scenario where the connection to the virtual machine’s datastore is impacted temporarily or permanently.
      • “Response for Datastore with All Paths Down”
      • “Response for Datastore with Permanent Device Loss”

VM Component Protection (VMCP) is in my opinion THE big thing that got added to vSphere HA. What this feature basically allows you to do is protect yourself against storage failures. There are two types of failures VMCP will respond to and those are PDL and APD. Before we look at some of the details, I want to point out that configuring is extremely simple… Just one tickbox to enable it.


HA in vSphere 6.0

In the case of a PDL (permanent device loss), this is something HA already was capable of doing when configured through the command line, a VM will be restarted instantly when a PDL signal is issued by the storage system. For an APD (all paths down) this is a bit different. A PDL more or less indicates that the storage device does not expect the device to return any time soon. An APD is more of an unknown situation, it may return… it may not… and no clue how long it takes. With vSphere 5.1 some changes were introduced to the way APD is handled by the hypervisor in this mechanism is leveraged by HA to allow for a response. (Cormac wrote an excellent post about this APD handling here.) When an APD occurs a timer starts. After 140 seconds the APD is declared and the device is marked as APD time out. When the 140 seconds has passed HA will start counting. The HA time out is 3 minutes. When the 3 minutes has passed HA can restart the virtual machine, but you can configure VMCP to respond differently if you want it to. You could for instance specify that events are issued that a PDL or APD has occurred. You can also specify how aggressively HA needs to try to restart VMs that are impacted by an APD. Note that aggressive / conservative refers to the likelihood of HA being able to restart VMs. When set to “conservative” HA will only restart the VM that is impacted by the APD if it knows another host can restart it. In the case of “aggressive” HA will try to restart the VM even if it doesn’t know the state of the other hosts, which could lead to a situation where your VM is not restarted as there is no host that has access to the datastore the VM is located on. It is also good to know that if the APD is lifted and access to the storage is restored during the total of roughly 5 minutes and 20 seconds it would take to reboot the VM, that HA will not do anything unless you explicitly configure it do so. This is where the “Response for APD recovery after APD timeout” comes in to play.



HA in vSphere 6.0

Oh happy day! :-)

Gary

114 Posts

February 5th, 2015 10:00

Very good, thank you. With VMCP the cross-connect becomes complete overkill. And since with the "non-uniform" (non-cross connect) we can use RR policy, it simplifies the config great deal.

89 Posts

February 5th, 2015 10:00

Hi,

> How do you think this will affect the cross-connect requirement ?

I wasn't aware of a cross-connect requirement in place.  I thought cross-connect in a Metro environment was always optional.  Uniform or non-uniform I believe we call it.

If cross-connect were in place, i believe that the possibility of an APD in that situation would require two failures as an example - a failure of the host's connection to the remote cluster's VPLEX front-end ports, and a failure to the local cluster VPLEX front-end ports.  Without vSphere 6.0 the host would know about APD forever but take no automated action.  Now it seems like in 6.0 the VM has a chance to be failed over to another host, either at the local cluster or at the remote cluster, depending upon the nature of the failures.

So I guess in my opinion I see this new APD 6.0 behavior reducing the need for cross-connect.  Cross-connect  has always been an extra cost and complexity (stretching fabrics across data centers.)

I'd personally want to try out all the failure scenarios listed here again on vSphere 6.0:

VMware KB: Implementing vSphere Metro Storage Cluster (vMSC) using EMC VPLEX

I see a couple of them changing for 6.0. The fourth column below being my take on how it could all work:

Scenario

VPLEX Behavior

Impact/Observed VMware HA Behavior (5.x)

Impact/Observed VMware HA Behavior (6.0)  - to be confirmed

ESXi host experiences APD (All Paths down) –

Encountered when the ESXi host loses access to its storage volumes (in this case, VPLEX Volumes).

  1. None.

In an APD (All Paths Down) scenario, the ESXi host must be rebooted to recover. If the ESXi Server is restarted, this will cause VMware HA to restart the failed virtual machines on other surviving ESXi Servers within the VMware HA cluster.

In an APD (All Paths Down) scenario with VMCP enabled, this will cause VMware HA to restart the failed virtual machines on other surviving ESXi Servers within the VMware HA cluster.

VPLEX cluster failure

(The VPLEX at either site-A or site-B has failed, but ESXi and other LAN/WAN/SAN components are intact.)

The I/O continues to be served on all the volumes on the surviving site.

The ESXi hosts located at the failed site experience an APD condition. The ESXi hosts needs to be rebooted to recover from the failure.

In a uniform host access configuration, the virtual machines run without any impact since the ESXi host can still access the distributed volume through the preferred site.

The ESXi hosts located at the failed site experience an APD condition. 


In a non-uniform host access configuration and with VMCP enabled, VMware HA will restart the failed virtual machines on other surviving ESXi Servers within the VMware HA cluster.


In a uniform host access configuration, the virtual machines run without any impact since the ESXi host can still access the distributed volume through the preferred site.

Thanks,

Gary

114 Posts

February 5th, 2015 11:00

Yes, corrected

89 Posts

February 5th, 2015 11:00

Excellent point, without the need for cross-connect access anymore to protect against APD, you can be just fine with NMP RR doing I/O to the local cluster.

One did always have the option to use PP/VE in cross-connected setup to have it prefer local paths over remote paths, but some folks didn't like the extra expense of PP/VE.

Thanks,

Gary

btw.  Uniform is cross-connect, and non-uniform is non-cross-connect.  Pretty sure anyways :-)

  • Uniform
Uniform Host Access (Cross-Connect) – This deployment involves establishing a front-end SAN across the two sites, so that the hosts at one site can see the storage cluster at the same site as well as the other site.
  • Non-Uniform

Non-uniform Host Access – This type of deployment involves the hosts at either site seeing the storage volumes through the same site storage cluster only.

Source: ValCo Labs – VPlex Non-Uniform and Uniform Configurations

89 Posts

May 19th, 2015 09:00

FYI on a recent blog post by Drew Tonnesen regarding VMAX and VPLEX behavior and settings for APD/PDL in vSphere 6:

https://drewtonnesen.wordpress.com/2015/05/13/apdpdl-in-vsphere-6-with-vmax-and-vplex/

Thanks,

Gary

No Events found!

Top