- Azure Sentinel—A real-world example - Tue, Oct 12 2021
- Deploying Windows Hello for Business - Wed, Aug 4 2021
- Azure Purview: Data governance for on-premises, multicloud, and SaaS data - Wed, Feb 17 2021
Since the release of Windows Server 2012, Microsoft has been building a credible alternative to costly SANs using file servers and commodity hardware. With major improvements in 2012 R2, and more coming in Windows Server 2016, it’s an interesting development. Several technology building blocks exist, and they can be used on their own or together for greater synergy.
Scale Out File Server (SOFS) consists of plain Windows file servers that are clustered (two, three, or four are common) for continuous availability. This means that, if a server or connection dies, the switchover of the handles to another node is so fast the Input/Output (IO) interruption doesn’t affect running workloads. (This feature is also known as Transparent Failover.)
Behind the SOFS is either a traditional SAN (Fibre Channel [FC] or iSCSI) or Storage Spaces with hard disks and SSD drives. The former is a convenient way to be able to add more connected nodes to the storage without having to wire each Hyper-V cluster node directly to the SAN, instead relying on normal networking for accessing the SOFS cluster. The latter, Storage Spaces, relies on simple Just a Bunch of Disks (JBOD) enclosures without any RAID backplanes connected to the SOFS nodes through external SAS cables.
The data protection (parity or two-way/three-way mirroring) is controlled by the Windows file servers instead of expensive SAN hardware and software. IO performance is provided by mixing ordinary SAS, near-line, large-capacity disks with SAS SSD drives through storage tiering. This moves hot blocks that require fast access to the SSD tier and colder blocks to the hard drives.
Connecting the compute nodes (Hyper-V or SQL Server) to the storage uses the new SMB 3 protocol, which offers many features built into the protocol, such as multi-channel (which takes advantage of several NICs automatically), encryption, and more. The connection can use ordinary Ethernet networking (1 Gb or 10 Gb/s). However, to truly match the scalability and performance of high-end FC SANs, your NIC choice should be Remote Direct Memory Access (RDMA) capable.
Remote Direct Memory Access flavors
RDMA has its roots in the High Performance Computing (HPC) world (think large number of nodes crunching scientific or financial data) and provides very high-speed and low-latency networking with minimal CPU overhead. A simple description of the technology is that the data is taken from the memory of one node, transferred on the wire, and deposited in the right place in the memory of the receiving node.
Compared to the many layers of the traditional TCP/IP stack that process data in normal networking, it’s not hard to see how RDMA achieves phenomenal performance. It doesn’t hurt that 40 and 56 Gbps cards are commonplace, and even higher speeds (100 Gbps) were recently released. The part of SMB 3 in Windows that takes advantage of RDMA is called SMB Direct. It requires no configuration and will automatically be enabled if the underlying NIC supports RDMA.
You can purchase three flavors of RDMA today. Mellanox offers Infiniband, which is quite expensive and requires a separate infrastructure (similar to how you have to use FC host bus adapters, cabling, and switches)—negating some of the benefits of converging the infrastructure. The speeds (up to 100 Gbps) are quite phenomenal. To return to the concept of a single, converged networking infrastructure for everything, Mellanox also offers RDMA over Converged Ethernet (RoCE) solutions. For those with a SAN background, RoCE is similar to the Fibre Channel over Ethernet (FCoE) solution to run FC traffic over Ethernet networks. It’s worth mentioning that Azure IaaS offers Infiniband networking for select virtual machines for workloads that require high-speed networking.
Finally, there’s Internet Wide Area RDMA Protocol (iWARP) from Chelsio. Currently topping out at 40 Gbps, iWARP has been around since 2007. Although Infiniband has received more space in the limelight, it’s worth noting that Microsoft chose Chelsio RDMA network cards when it designed the Cloud Platform Solution (CPS), its “turnkey private cloud” in one to four racks.
Where you can use RDMA
In Windows Server 2012, RDMA/SMB Direct networking was only supported between the Hyper-V/SQL Server nodes and the SOFS file servers. In 2012 R2, this was extended to provide the ability to connect Hyper-V nodes and run Live Migrations over SMB Direct. It’s not recommended to run both traffic types of the same network interfaces; however, this is being resolved in Windows Server 2016, where you can run both the storage traffic and Live Migration traffic over the same networks.
Testing RDMA with Chelsio NICs
Chelsio kindly offered me two NICs with two 10 Gbps ports (T520-LL), along with two NICs with two 40 Gbps ports (T580-CR) and the required cabling. My test lab consists of four Hyper-V hosts (two Windows 2012 R2 and two Windows 2016 TP2), each with 32 GB of RAM along with a DC/Virtual Machine Manager server and a Windows 2012 R2 file server with three hard drives and two SSD drives. The file server provides backend storage for the VMs hosted across the cluster and uses storage tiering for two-way mirroring.
I wanted to test two scenarios: the first one is a VM with a high IO load connected to the file server over RDMA networking, and the second is live migrating a “large” VM from one host to another over RDMA.
To test the speed of my storage, I installed the 40 Gbps cards in one Windows Server Hyper-V host and in the storage server, and I connected the two with the supplied 40 Gbps QFSP+ cables. Driver installation was simple with a single driver download that covers all Chelsio network adapters. Note that Windows automatically discovers that the NICs support RDMA and turns on SMB Direct (you can disable it if desired with Disable-NetAdapterRdma in PowerShell).
Checking RDMA status in PowerShell
I created a new VM running Windows Server 2012 R2, moved it to the file server, and pinned the VHDX file to the SSD tier. Finally, I ran DiskSPD inside the VM to test IO speeds at different IO sizes. Older Microsoft–based IO tests use SQLIO but, while it’s still available, DiskSPD is the preferred solution today. Results ranged from 300 MB per second for small IO to more than 530 MB per second for larger IO sizes. Latency ranged from 1.6 to 15 milliseconds, depending on the IO size.
Half a Gigabyte per second isn’t bad for a couple of desktop SATA SSDs! And this was all done at about 5% to 6% CPU utilization. Had this been 10 Gbps of non-RDMA networking, CPU would have been 80%+ for the equivalent storage traffic. I also tested the same setup with the two 10 Gbps NICs, with very similar results.
DiskSPD script results
The storage IO test didn’t push the limits of the 2 x 40 Gbps network interfaces, however. I would have to have a LOT more SSD drives (around 40 drives) to provide the required storage performance to match the network speed. If you’re doing any kind of storage testing on your own hardware using SMB 3, make sure to not use file copy because the results will not represent the kind of performance you’re going to get with Hyper-V and SQL workloads.
Heavy load consuming VM memory
For the next test, I moved the 40 Gbps cards from the file server to a second Windows Server 2012 R2 Hyper-V host. I then created another Windows 2012 R2 VM and allocated it a fixed 24 GB of memory. I configured both hosts to use SMB Direct for Live Migration traffic. Inside the VM, I ran HeavyLoad to ensure that the full 24 GB of memory was in use during the test and live migrated it from one host to another. The results were astounding, with a live migration taking just under 10 seconds.
Although this seems like a cool feature for a demo with limited application in the real world, this is definitely not the case. Imagine a cluster with 20+ VMs running on each node and it’s time to patch the hosts. The ability to drain a single host in minutes instead of hours will make a huge difference in your sysadmin’s quality of life.
What’s next for RDMA
Microsoft has clearly made its bet on RDMA networking being able to compete with high-end FC SANs. In Windows Server 2012 R2, you must have one RDMA network connected to the backend SOFS storage and another connected to other hosts in the cluster for Live Migration traffic. In Windows Server 2016 TP2, both types of traffic can use the same network interface. Windows Server 2016 TP2 already has the drivers for Chelsio RDMA NICs built in.
More importantly, however, as Microsoft enters the hyper-converged arena with Storage Spaces Direct (S2D - no, that’s not a typo), RDMA will (most likely) be a requirement for connecting the storage/Hyper-V nodes together. Unfortunately, S2D requires a minimum of four nodes to test. Although this isn’t a problem with VMs, I didn’t have enough nodes available to do a full test.
My experience with these Chelsio NICs has been stellar. They’re simple to install and configure. And they just work, with excellent performance. The combination of Storage Spaces, SAS enclosures with HDD/SSD tiering, SOFS and fast, low-latency storage, and Live Migration networks built on Chelsio, RDMA technology is very compelling for ease of setup, performance, and cost effectiveness.
Want to write for 4sysops? We are looking for new authors.
It seems however that end users are beginning to make use of 10Gb NICs I know I am in the form of two SFPs running on two Windows 10 Pro boxes with 20Tb SAS3 arrays.
The thing that has annoyed me is I do believe RDMA would be useful to the end user but microsoft created yet another windows sku in an attempt to milk more money out of users.
Apparently Chelsio has gotten around this with something called iWarp and you can indeed use their implementation of RDMA on Windows 10 non business sku PC’s – can you confirm this?
Thanks for your comments.
Chelsio does indeed support RDMA in the form of iWarp for Windows 10 – however it's only for Win 10 Enterprise and Pro for Workstations. See here https://www.chelsio.com/wp-content/uploads/resources/T5-ClientRDMA-Windows.pdf and here (at the end of the thread) https://social.technet.microsoft.com/Forums/en-US/27a1cc13-ee1e-4c9c-8156-271da0540c4b/why-dont-windows-client-os-support-rdma. I've dealt with Saqib at Chelsio (he posted the last comment in that thread), if you pursue this further with him you can mention my name.
But Enterprise and Pro for Workstations aren't non business SKUs. Note that Pro for Workstations isn't Pro (gotta love MS naming), https://www.howtogeek.com/321736/what-is-windows-10-pro-for-workstations-and-how-is-it-different/.
Hope that helps,
In other words Microsoft have pretty much screwed end users out of RDMA unless they pay 100's of dollars, pretty much the sort of old thinking move that I would have expected from Microsoft.
"Oh I know end users couldn't possibly want to remotely access memory directly via high speed Ethernet so lets call it a Workstation functionality and charge 12x as much for it even though you can do it in Linux for free" … Genius move after all nobody will ever need more than 640K of memory either.
Thanks for the response at least now I know for certain that it's time to start looking for a well supported Linux distribution that lets me do what I want.