VMware vSphere 6.5 has added some enhancements and new features to its clustering technology, High Availability (HA). The base principle of HA has not changed; since vSphere 5, we have had the Fault Domain Manager (FDM), which is responsible for communication between hosts and cluster members.

We still have the master-slave system, where a single master within the cluster is elected and communicates with all "slaves" through a management network or through a shared datastore network (datastore heartbeating) as a second communication channel.

The master monitors the availability of ESXi hosts and also the availability of virtual machines (VMs). The master agent also monitors all slave hosts, and in case a slave host fails, all VMs on that host will restart on another host. I have detailed the configuration of a VMware HA cluster through vSphere Client in a previous post.

In many environments, vCenter itself can be a single point of failure too, but technologies like HA are designed to function without vCenter being available. So what are the latest innovations from VMware to improve the availability of vSphere environments?

The main new availability features of vSphere 6.5 are Proactive HA, Orchestrated HA, and better admission control. There is also a new HA active/passive clustering capability for VMware vCenter Server Appliance 6.5 (VCSA), but I'll talk about this new feature in my next post. For now, let's explore the three features mentioned above.

Proactive HA ^

Proactive HA may or may not be the best feature to start with, as it involves third-party (hardware) vendors to step in with hardware-specific modules for each host and also requires a vCenter server plugin. The provider's modules are designed to inform vSphere of potential problems with a host; they trigger an HA event before the host runs into trouble. This provides a sort of "reporting for action before failure."

These modules can account for things like SMART data from disks and also direct integration with some motherboards or storage modules, which would inform the cluster that part of the motherboard or storage module is malfunctioning or running in a degraded state.

Another case would be failing DIMM modules or an overheating chassis. A hardware manufacturer's modules can integrate all of these hardware-specific parts of a host more tightly than VMware can with its general software.

You can think of it as a bridge between the bare-metal hardware and the VMware layer that reports upcoming failures to VMware HA. In this situation, HA can take actions such as evacuating all the running VMs off the host and putting the host into maintenance mode.

VMware vSphere availability Proactive HA failures and responses

VMware vSphere availability Proactive HA failures and responses

Think of remediation as a reply to an event. If some event (such as an error message, a SMART "red" event, or an overheating "red" alert) occurs, the system will reply with a remedial action, even if the alert is only a warning and not a failure (such as a host failure, network failure, or storage failure).

There is a new remediation mode called quarantine mode, which will apply the same response for both moderate and severe failures. Quarantine mode allows you to run VMs if the performance of those VMs isn't affected. With larger clusters that have a lot of resources free, you might see a situation where this mode will avoid placing workflows on the quarantined host (simply because there are enough healthy hosts on which to run workflows). But in an environment running a lot of workflows with a "so-so" capacity, the host can run VMs even in quarantined mode.

You can also select mixed mode, which will apply the quarantine mode for moderate failures and maintenance mode for severe failures. The Proactive HA module is completely new in vSphere 6.5. Previous releases of vSphere provided only "traditional" HA, where after failure, HA triggered a restart of VMs on remaining hosts within the HA cluster.

Quarantine mode options

Quarantine mode options

Orchestrated HA ^

This option is designed especially for multi-tier apps. For instance, if you have a front-end tier from a web server, one for the core application, and one for the back-end database, these three VMs form a logical application that you must manage as one system. Imagine that one of your hosts fails. How can you manage to restart each tier in the right order?

Before Orchestrated HA, vSphere would simply restart the VMs on the surviving hosts within your cluster without taking care of the restart priority—quite rudimentary.

Now, with Orchestrated HA, you have the option to specify the restart order for your VMs. To this end, you can create groups of VMs and configure the restart order for those groups. For instance, you could have a situation where VM02 depends on VM01, and VM03 depends on VM02, which we can symbolize like this: VM01 < VM02 < VM03.

First, you would have to create a VM group by selecting Cluster > Configure > Configuration VM/Host Groups.

Configure VM groups

Configure VM groups

Then you would create VM/host rules (see below) and configure the dependency of those groups.

Configure VM to VM dependency

Configure VM to VM dependency

Better admission control ^

While admission control has always been part of vSphere HA, there are some significant enhancements and new options in the VMware vSphere 6.5 release. Percentage control was the preferred option, where you could choose the percentage to which a cluster could support running VMs.

You can configure admission control to specify whether to start VMs if they violate availability constraints. The cluster reserves resources so that failover can occur for all running VMs on the specified number of hosts. If you don't have enough cluster resources, and you activated admission control, you can't start newly created VMs. You can't migrate VMs to the cluster or even increase memory or CPU reservation for a VM.

For a three-host cluster, each having 128 GB of RAM, this would mean that the VMs would still start or restart when consuming 256 GB, for N+1 redundancy (assuming you select 100%). But there is more.

There's a new option in vSphere 6.5 for admission control called the Performance degradation VMs tolerate. This feature assures the same level of performance after an HA event restarts the VMs on another host within the cluster. If the cluster is unable to satisfy the level of performance based on the setting provided, the system will issue a warning that there is insufficient failover capacity to guarantee the same performance after the VMs restart.

Subscribe to 4sysops newsletter!

Performance degradation VMs can tolerate after restarts

Performance degradation VMs can tolerate after restarts

Wrap-up ^

VMware continues to add more "fine-tuning" features to the already reliable HA features. This allows the admin to worry less about eventual hardware problems, as the software is able to react automatically by isolating the failed host within the cluster and restarting the workflows elsewhere. By adding proactive HA options, the software can automatically vMotion VMs (meaning no downtime) to another host before the host actually fails.


Leave a reply

Your email address will not be published.


© 4sysops 2006 - 2022


Please ask IT administration questions in the forums. Any other messages are welcome.


Log in with your credentials


Forgot your details?

Create Account