Latest posts by Paul Schnackenburg (see all)
- Burstable VMs (B-series) in Azure - Tue, Dec 12 2017
- Project Honolulu - A new way to manage Windows Server - Wed, Nov 22 2017
- Use Azure Managed Service Identity (MSI) to store passwords in your code securely - Thu, Nov 9 2017
As any sysadmin knows, storage is key to a well-functioning virtualized data center. Traditionally, this has been SAN storage – Fibre Channel (FC) when you could afford it and iSCSI when you couldn't.
Back in Windows Server 2012, Microsoft introduced Storage Spaces. A replacement for SANs, this used Windows Server file servers ("Scale-Out File Servers" or SOFS) in front of shared SAS enclosures. While cost effective and easier to manage than SANs, Storage Spaces wasn't a runaway success. This was partially because of the limited scope that only supported Hyper-V and SQL Server storage, and partially because very few hardware vendors wanted to sell solutions that directly competed with their pricey SAN offerings. Ironically, it turns out that admins often used SOFS in front of SANs to simplify access and limit the number of FC switches and host bus adapters needed.
What Microsoft lacked was an answer to VMware's vSAN. Storage Spaces is a good solution, but only for medium to large environments where separate server clusters and disk enclosures make economic sense.
Way back, we looked at the first iteration of S2D in Technical Preview 2. Back then, it was very complex to set it up. However, between TP5 and the RTM release of Windows Server 2016, Microsoft consolidated all the manual steps into a single PowerShell cmdlet.
The building blocks ^
S2D takes internal storage in servers across a cluster and pools it together for fault tolerance and performance. This storage can take the form of ordinary SAS or SATA HDDs, SAS or SATA SSDs or NVMe. The latter is flash-based storage just like SSDs. However, instead of being attached to a SAS or SATA bus, it directly interfaces with the PCIe bus. This greatly enhances performance and throughput. Note that if you have SAS enclosures external to a server, you can still use them with S2D. Each one must connect to a single server only, effectively presenting itself as internal storage to that node.
One of the benefits of S2D over Storage Spaces is that it can use lower cost storage (SATA) as well as very high performing storage (NVMe). The minimum number of servers for an S2D cluster is 2, and the maximum is 16.
There are two ways of deploying S2D. One is disaggregated, where you have servers with internal storage running SOFS. These present file shares to other clusters running Hyper-V with the VMs storing their VHDx on file shares on the storage cluster. The other method is hyper-converged, which uses internal storage on the same servers that are running Hyper-V. The difference comes down to how you want to scale. In smaller shops, you're likely to want hyper-converged. If you need more VMs or more storage, simply buy another server. Conversely, larger environments will likely want to scale when they need more storage and compute when they need more hosts to run VMs. Because there's a lot of data flowing between each server, Microsoft strongly recommends using RDMA networking with a minimum of 10 Gbps bandwidth between the nodes.
Depending on your needs, you can design S2D systems with different characteristics. If you have modest IOPS and throughput needs, a mix of large HDDs for capacity and SSDs for caching and tiering would be appropriate. If you have high performance needs, you should consider an all-flash system. Also, realize that on a cost-per-GB basis, SSDs are cheaper than NVMe devices. However, if throughput is your aim, NVMe is more cost effective on a dollar-per-IOPS basis. It's also possible to combine all three types of storage media in a single S2D cluster. Note that S2D will automatically enable faster storage for caching. This cache dynamically binds to HDDs, adapting to changes in available disks or SSD drive failures. Data flows over the software storage bus between the disks in each node. This effectively treats all the storage as if it was in a single server by pooling it together.
Scaling out Storage Spaces enclosures was not an easy proposition. On the other hand, S2D scales out simply by adding nodes with more internal storage. It automatically claims and adds available drives in the new server(s) to the storage pool. A drive is available when it doesn't have a partition on it and it's not a system drive.
Storage Spaces has a limitation of 80 drives per pool. This is primarily because each drive stores metadata about each disk. In S2D, some disks keep this information but not all of them. The current recommended maximum is 240 disks. I also recommend keeping a ratio between performance disks and capacity disks. For instance if you have two SSDs in each host, have two or four or six hard drives. If you have three SSDs (or NVMe drives), pair them with three or six hard drives.
Resilient File System (ReFS) is the recommended file system. It provides many benefits over NTFS, particularly in metadata operations, which are lightning fast. The one thing not supported in ReFS is deduplication, but this is coming.
Fault Tolerance ^
Occasionally, I come across an old hand IT pro who will dismiss Storage Spaces and S2D as Windows Server software RAID from back in the Windows NT/2000 days. This is totally wrong; neither is related to software RAID at all. Whether storing the data using mirroring or erasure coding (parity – see more below), the technology stripes data in slabs across drives. This provides excellent performance as opposed to the old world where a whole drive was mirrored to another drive.
As a basic statement, S2D is fault tolerant to two simultaneous drive failures or two simultaneous server failures. It does this by storing three copies of your data on different drives and different cluster nodes. This is a must when using internal storage because you'll need to restart servers for maintenance or patching. During that planned outage, you still need two copies/servers to protect you from an unexpected disk or server failure.
Announced at Ignite 2016 late last year, S2D now supports a minimum of two nodes in a cluster. In that scenario, you'll only have two copies of your data, leading to fault tolerance for a single drive or node only. Note that in a two-node cluster, you'll need to have a file share witness or an Azure cloud witness to provide quorum for the cluster. If you have three servers in the cluster, S2D will opt for three-way mirroring. This provides excellent read and write performance but at the cost of efficiency. After all, if you need 10 TB of available storage, you must buy 30 TB of HDD/SSD.
If you have four or more nodes, however, things start to get interesting. The most effective way to store data is using parity, which Microsoft calls erasure coding (logically equivalent to RAID 6). This means you can get from 60% to 80% disk utilization (see the chart below). Thus, instead of having to buy 30 TB of raw storage, you may only have to buy 12–14 TB. However, parity has a downside. Writing data is expensive in terms of CPU and IOPS because of the need to read the data, modify it, recalculate the parity and then write it out across the three drives. Parity and mirroring do not affect reading data.
The real magic (which wasn't available in Storage Spaces) of S2D on ReFS is where it combines the use of three-way mirroring and erasure coding on the same volume. Microsoft calls this accelerated erasure coding. S2D sets aside about 10–20% of the space for the mirror, which acts like a write cache for incoming writes. A VM writes some data to the three mirrored copies. As soon as this happens, the IO is complete as far as the VM is concerned. In the background, S2D will arrange the IOs for optimal, sequential writing and write them out to disk. As an interesting anecdote, during the Technical Preview period, another team in Microsoft contacted the S2D team to let them know that their Performance Monitor counters where incorrect. They were seeing 600 IOPS for a single HDD, which they said was impossible. What they were seeing was the effect of the inbuilt cache mirror and highly optimized sequential writing of data.
There are actually two types of parity schemes in use in S2D. If you have 11 servers or fewer, or an all-flash configuration, S2D uses only erasure coding, also known as Reed-Solomon error correction (RS). If you have 12 or more servers and you're mixing HDDs and SSDs, S2D uses a parity scheme that Microsoft Research invented: Local Reconstruction Codes (LRC).
In a large deployment where you might spread servers across racks or in a blade server chassis, you can tag individual nodes with rack or chassis information. S2D will make sure to spread your data accordingly so that it is fault tolerant to a rack or chassis failure.
When a drive does fail, the system will let you know which rack the server is in, which slot it's in and the serial numbers of both the drive and the server. Unlike Storage Spaces, which required a fair bit of manual work to retire a drive and get it ready for replacement, S2D only requires you to swap the drives. As soon as a drive fails (before you have replaced the failed drive), S2D will begin the repair using other drives in the pool. During this time, S2D will mark the volume as "incomplete" until it has made three copies. After you insert the replacement drive, S2D will rebalance the pool by spreading the data evenly across all drives.
In the next part of this two-part article, we'll look at networking and the steps to set up S2D clusters, along with how to monitor and manage them.