Disk / Data Deduplication is a feature new to Windows in Server 2012 and has recently been improved in Server 2012 R2. Data Deduplication is based on the idea that if you have multiple copies of the same file you can only actually write one to disk and then just provide pointers to the copy.
Server 2012 takes that idea one step further to be able to recognize duplication down to the level of chunks of files (32–128 KB). Further with Windows Server 2012 R2 they have improved the performance of deduplicated volumes to allow them to be a repository for Hyper-V VHDs.
So where does this really help? The use case for this technology is obviously fileservers, specifically those that have large files that aren’t modified very often. In my case, my two biggest usages are for Veeam backup files of my virtual machines and the images within the document retention system. According to Microsoft you can expect to see space savings up to 95% on virtual hard drives, 80 on ISOs and other deployment software and 50% on basic user documents. If you have a mix of all of the above the process can save you between 50-60% of the actual data size.
Install and enable Data Deduplication in GUI
Getting started with Windows Server 2012 R2 disk deduplication is not hard but it does require a few steps. To make this easy here’s a quick run down.
- Enable the Data Deduplication Role under File and Storage Services> File and iSCSI Services.
- Enable Deduplication on a volume under Server Manager > Volumes by right clicking on the volume and choosing “Configure Data Deduplication…”
- Configure the basic settings for the volume in the window that pops up. You can mess with these depending on your particular need, but I recommend not setting the “Deduplicate files older than…” setting any lower than the default of three days. More than that can have a negative impact on your I/O load.
- Finally you can modify the schedule in which deduplication occurs. By default Windows will just allow it to run at low priority and pause when the server is busy. In our environment backups are typically done by midnight so I allow it to run at normal priority between 12 AM and 6 AM in addition to the default.
After you set Data Deduplication up you can easily monitor how it’s doing, although I would probably give it at least a day to process on a mature file server before expecting to see real results. Graphically you can look at the properties either of the volume or any folder contained with it to see how it’s doing by comparing Size versus Size on disk. Know that that will reflect not only deduplication but also any compression you may have turned on as well. As you can see from the screenshot, the results can be rather staggering.
Verfiy Data Deduplication
Further you run Get-DedupeStatus as Administrator from the PowerShell prompt to see the output of just what Data Deduplication is doing for you.
Enable Data Deduplication with Powershell
So you want to do this in a few places at once? The good news is all the configuration above can be done via PowerShell if you would like to script it. The commands below will enable the role, configure Data Deduplication on a mythical volume F:, and configure the Minimum File Age to 5 days. There are other settings possible but these should get you going for most uses.
Add-WindowsFeature -name FS-Data-Deduplication
Set-Dedupvolume F: -MinimumFileAgeDays 5
For more information regarding when and where to use Data Deduplication Technet has a pretty good read on the subject.