Is cloud computing more reliable than your on-premises datacenter? I've heard from several sources about an interesting side effect of Spectre and Meltdown. High-availability (HA) projects people had put on the backburner suddenly gained urgency when Azure restarted all virtual machines (VMs) to implement patches.

So is the cloud more reliable? The answer is "most likely," but it doesn’t happen automatically; it takes planning and design.

We'll focus on infrastructure as a service (IaaS) first, as this is where many businesses find themselves, especially after a lift-and-shift project. If you create a single VM using premium storage (SSD-based), it will have a 99.9% service-level agreement (SLA). Standard (HDD-based) storage for a single VM has no SLA.

Availability sets ^

If you group at least two VMs together in an availability set, you qualify for a 99.95% SLA. Behind the scenes, what's happening is that Azure is spreading the VMs across racks in a Hyper-V cluster. Two concepts to understand are update domains (UDs) and fault domains (FD).

UDs deal with planned downtime and maintenance events (such as the aforementioned Spectre and Meltdown patches). Where possible, Microsoft uses a technology known as preserving host updates (PHU). This pauses all VMs on a particular host, replaces or patches most components of the host OS without rebooting, and then resumes the VMs (in under 30 seconds).

But there are times when you actually need to reboot the VMs (and the host). An availability set receives five UD assignments and spreads the VMs in the set across these automatically. So if you have eight VMs, it will put the first in UD1, the second in UD2, and so forth, with the sixth in UD1, the seventh in UD2, etc. When you need to restart hosts and VMs, Azure will ensure it processes only one UD at a time.

FDs on the other hand concern single points of failure, such as power, networking etc. Each availability set has (at least) two FDs, ensuring that this spreads your VMs across (at least) two racks. The FD quantity isn't set in stone, and you may have more in your availability set.

So if you have two SQL Server VMs or two domain controllers (DCs), for example, put them in an availability set. It's a bit counterintuitive, but that's how you let Azure know to "keep these apart." It'll give you higher availability without extra cost.

The underlying storage in Azure is also durable. The normal locally redundant storage (LRS) gives you three copies of your data in the same datacenter. There are also other options such as zone-redundant storage classic (ZRS Classic) and geographically redundant storage (GRS), which gives you another three copies in another location. Note that ZRS Classic and GRS aren't suitable for IaaS VMs; for VM replication, Azure Site Recovery is a better choice (see below).

An availability set will protect you against an Azure rack failure and planned restarts, but it won't protect you against a datacenter outage.

Availability zones ^

In preview since September 2017, Azure now offers availability zones in select regions (East US 2, US Central, West Europe, and France Central). Each region (Azure currently has 36, with an additional 6 announced) consists of more than one datacenter. The point of availability zones is to spread your resources across datacenters in a single region. Currently the preview supports:

  • Linux VMs
  • Windows VMs
  • VM scale sets
  • Managed disks
  • Load balancing
  • Public IP addresses

Note that you'll need to test your particular workload. While each zone in a region is close geographically, there will be a higher latency for communication, which may affect performance. The promise of availability zones is that they'll offer a 99.99% SLA.

Other services in Azure are also starting to take advantage of availability zones. Recently (as of January 24, 2018), Microsoft released ZRS in preview. This spreads copies of your data in the storage backend across zones and writes data synchronously. This means zero data loss even if a datacenter or zone fails. It's similar to the current ZRS storage but with more control, and they've renamed the current ZRS storage to ZRS Classic.

You can create single VMs (detailed tutorial here) and specifically pick a particular numbered availability zone. Below you can see the choices I have in the East US 2 region. You can do this in the portal, in PowerShell, the cross-platform command line interface (CLI), or even use the representational state transfer (REST) application programming interface (API). To provide HA, you would create at least two VMs, one in each availability zone with a load balancer in front of the VMs.

Availability zones for a single VM

Availability zones for a single VM

Alternatively you can create a VM scale set (identical machines that Azure can scale up or down in numbers based on load) in an availability zone. Below are my choices in the US Central region. Again, you would front two or three VM scale sets in different availability zones with a load balancer. So even if a whole datacenter is out of action, the load balancer will direct clients to other VMs. Very recently (January 24, 2018) Microsoft released the capability, in preview, of creating a VM scale set that automatically scales across zones. This negates the need to pick zones manually and so on.

Availability zones for a VM scale set

Availability zones for a VM scale set

Alternatives ^

There are many ways to build HA in the cloud. Availability zones are a good addition to Azure. But if there's a natural disaster in the region that hosts your mission-critical workload, which could potentially take out all the datacenters in the region, availability zones aren't going to help.

If disaster recovery (DR) is your goal rather than downtime measured in seconds, Azure Site Recovery is your friend, based on popular technology. You can use it to migrate workloads to the cloud (the first 31 days are free for each VM) from VMWare, Hyper-V, and physical servers. You can also give your own datacenter DR by continuously replicating VMs to Azure. Azure Site Recovery now supports replicating VMs from one region to another.

Be aware that IaaS isn't the only game in town. Azure offers many platform-as-a-service (PaaS) services such as SQL/Cosmos for database services, app services for web hosting, and Azure Functions/Kubernetes container orchestration for running your code. These services often provide very good HA while requiring a lot less management of the underlying infrastructure.

I've seen people being critical of availability zones (and Azure HA technologies in general) with comments along the lines of "they should have built it in for all workloads" or "it's too complex." I think these critics are missing the point—all workloads are not equal. I wouldn't want to pay for HA for my developers' test servers. And for some of my workloads, manual DR is absolutely preferable to complex HA configuration. It all depends on the workload and the needs of the business.

Subscribe to 4sysops newsletter!

Suffice it to say, however, if you really need HA for a service, building it in Azure gives you many different options for achieving resiliency and HA, with availability zones being one more tool in your belt.

1 Comment
  1. Arjun Bahree 5 years ago

    Very well articulated, concisely cut, and no-fluff information Paul.

    Hope you are able to update this article with further changes happening around native Azure IaaS HA options (and I hear a few arriving this year) instead of newer articles on newer Info since its more convenient to have all the relevant Info in one place.

Leave a reply

Your email address will not be published.

*

© 4sysops 2006 - 2022

CONTACT US

Please ask IT administration questions in the forums. Any other messages are welcome.

Sending

Log in with your credentials

or    

Forgot your details?

Create Account