- kube-scheduler: The Kubernetes scheduler - Fri, Sep 29 2023
- Kubernetes logs: Pod logs, container logs, Docker logs, kubelet logs, and master node logs - Mon, Sep 25 2023
- Kubernetes DaemonSets - Wed, Sep 6 2023
Admins who need to manage file servers and storage understand the hassle that duplicate files may cause. When storage disks start running out of space, admins tend to look for various tools to find and delete duplicate files that unnecessarily waste storage. However, PowerShell provides all you need to deal with duplicate files.
Data loss warning
The procedure described in this post involves the removal of files, which could cause accidental data loss. Therefore, it is highly recommended to create a good backup of the source directory before proceeding.
Get-FileHash cmdlet
PowerShell offers the Get-FileHash cmdlet to compute the hash (or checksum) of one or more files. This hash can be used to uniquely identify a file. In this post, we will use the hash value to identify duplicate files. The syntax of the command is as follows:
Get-FileHash -Path file_path -Algorithm hashing_algorithm
To calculate the hash of a single file, you can run the command shown below:
Get-FileHash -Path 'D:\ISO\WinPE.iso' -Algorithm SHA512
When no hashing algorithm is specified, SHA256 is used by default, which is sufficient in most cases. The hash is used to verify file integrity, particularly for files downloaded from the Internet.
The good thing about Get-FileHash is that it accepts the input from the pipeline, which allows us to use it with the Get-ChildItem cmdlet to compute the hash of each item returned by Get-ChildItem.
Find duplicate files based on the hash
The following command gets all the files in a folder recursively, passes them to Get-FileHash to calculate their hash, and finally groups them based on matching hash values.
$srcDir = "D:\MyScripts" Get-ChildItem $srcDir -File -Recurse ` | Get-FileHash | Group -Property Hash ` | where {$_.Count -gt 1} | foreach { $_.Group | select Path, Hash}
The command has one serious flaw, though. It works well for a few small files, but you will run into trouble if the source directory contains many large files. The hash computation is a resource-intensive operation, and the aforementioned command computes it for each file, regardless of its size. The command will therefore take ages to find duplicates when used against a directory with a large number of large files. In the next section, we further optimize this command.
Find duplicate files based on length and hash
A necessary condition for duplicate files is that their size must match, which means that files where the size does not match cannot be duplicates. You probably know the Length property of the Get-ChildItem PowerShell cmdlet. Because the Length value is retrieved from the directory, no computation is required. The trick is to only compute the hashes of files having the same length because we already know that files with different lengths can't be duplicates. In this way, the overall time of the command is significantly reduced. This is accomplished by the PowerShell commands below:
$srcDir = "D:\ISO Files" Get-ChildItem -Path $srcDir -File -Recurse | Group -Property Length ` | where { $_.Count -gt 1 } | select -ExpandProperty Group | Get-FileHash ` | Group -Property Hash | where { $_.count -gt 1 }| foreach { $_.Group | select Path, Hash }
With the help of the Group cmdlet, we first group the files that match in size, and then pass those files to Get-FileHash to calculate their hash. The output is identical to the previous command, but we needed significantly less execution time to get the result. The screenshot below shows a performance comparison of both solutions:
You can see that the second command took just 36 seconds, whereas the first one took about 5 minutes. This difference could increase significantly, depending on the size and number of duplicate files in your source directory.
To speed it up even further, you could use the MD5 algorithm with the GetFileHash cmdlet, which is way faster than the default (SHA256). MD5 isn't good for hashing secret information, but we are only using it here to compare files. The hash-collision risk is relatively higher in MD5, but it doesn't make much difference unless you're comparing billions of files.
Removing duplicate files
Now that we have come up with the right command to identify duplicate files in PowerShell, we need to handle them in a suitable manner. If you are working with important files, it is not recommended to delete duplicates straight away. Instead, you can move them to another directory, probably on a different drive, with enough free space. This is very easy with our new command. We just need to pipe the output of the above command and pass it to the Move-Item cmdlet. See the updated code shown below.
# Define source directory $srcDir = "D:\ISO Files" # Define destination directory $targetDir = "E:\DuplicateFiles\$(Get-Date -Format 'yyyyMMdd')" # Create destination directory if(!(Test-Path -PathType Container $targetDir)){ New-Item -ItemType Directory -Path $targetDir | Out-Null } # Move duplicate files to a different location Get-ChildItem -Path $srcDir -File -Recurse | group -Property Length | where { $_.Count -gt 1 } ` | select -ExpandProperty Group | Get-FileHash | group -Property Hash ` | where { $_.Count -gt 1 }| foreach { $_.Group | select -Skip 1 } ` | Move-Item -Destination $targetDir -Force -Verbose
The only change here is that I used -Skip 1 to leave one file in the source directory while moving other duplicates to the specified target directory. Once moved, you can manually review them later on and remove them, if necessary. If you're working on a huge source directory with millions of files, it is a good idea to avoid using the -verbose parameter with Move-Item.
If your directory contains a handful of files, you may want to manually select which file to move and which to leave in the source directory. You can modify the above code as follows:
# Define source directory $srcDir = "D:\ISO Files" # Define destination directory $targetDir = "E:\DuplicateFiles\$(Get-Date -Format 'yyyyMMdd')" # Create destination directory if(!(Test-Path -PathType Container $targetDir)){ New-Item -ItemType Directory -Path $targetDir | Out-Null } # Manually choose duplicate files to move to target directory Get-ChildItem -Path $srcDir -File -Recurse | Group -Property Length ` | where { $_.Count -gt 1 } | select -ExpandProperty Group | Get-FileHash ` | Group -Property Hash | where { $_.count -gt 1 } | foreach { $_.Group | select Path, Hash } ` | Out-GridView -Title "Select the file(s) to move to `"$targetDir`" directory." -PassThru ` | Move-Item -Destination $targetDir -Force -Verbose
Here, we are using the Out-GridView cmdlet with the -PassThru switch to display duplicate files in a new window, so you can manually control what to leave in the source directory and what to move. To select multiple files, you would press and hold the Ctrl key while using the mouse to click through the files. When you click OK, the selected files will be moved to the target directory.
Furthermore, if you already have a good backup of your source directory and want to remove the duplicate files permanently, use the following code instead:
# Define source directory $srcDir = "D:\ISO Files" # Permanently delete duplicate files; use with caution Get-ChildItem -Path $srcDir -File -Recurse | group -Property Length | where { $_.Count -gt 1 } ` | select -ExpandProperty Group | Get-FileHash | group -Property Hash ` | where { $_.Count -gt 1 }| foreach { $_.Group | select -Skip 1 } ` | Remove-Item -Force -Verbose
You can see in the screenshot that four duplicate files were deleted successfully, and in the end, our source directory has only unique files.
Subscribe to 4sysops newsletter!
You can use any variation of these code snippets to move (or remove) duplicate files as per your requirements. Again, just make sure you have a good backup of your source directory so that you can recover in case any important file is deleted accidentally.
Thank you for this. Excellent. i checked your blog, too. Enough there to keep me busy for the rest of my life. I won’t need any other sources for what I want to know, looks like.
I’ve got big problems with duplicate files, I know. But i also know from experience it is not a simple job fixing it. You have to be very careful, right?
Blindly eliminating duplicates can wreck everything I think I have found in the past.
But that may have been because my duplicate elimination software was too simplistic – perhaps not checking file length, calling files ‘duplicates’ merely because they had the same name.
Which can destroy all your config.txt or similar in a flash.
Whatever it was I know I created a real hell for myself in the past by trying to organise files and eliminate duplicates.
‘Organising’ meant collecting like files together in one place. That didn’t work. Moving them is the same as deleting them to software that expects to find them there.
Part of the problem is hinted at here where you caution to make backups before doing anything. Well that’s just it. We make backups and then get interrupted before the whole task is complete – which can take a long time when using only ‘spare’ time – and with maybe as many as eight disks to work on, something greater than 10 Terabytes – and there we are during the process with now ‘extra’ duplicates created just for the sake of the exercise!
I very much would like a workable technique for this.
Currently I have lapsed into the ‘redundancy’ mode: i.e. I never delete anything and just keep buying hard drives.
There’s gotta be a better way but it’s not simply running some duplicate finder/eliminator software is it?
There has to be some kind of rigorous and careful and fool proof procedure.
And I can’t devise it.
But this stuff is good. I like it very much.
I thank you for it.
I
The are multiple techniques to find duplicates, but the most challenging part is scanning directories efficiently. Get-ChildItem is sequential and limited to 260-character path length (solved in PS7). In order to scan a disk with millions of files, I created a C# app, using p/invoke FindFirstFile thus solving the path length issue and using async functions and a channel. Scanning a 12TB disk for files over 1MB is less than three minutes (including the file system owner!). Most of the time, file name, date and size returns 90%+ of the duplicates. Hashing is an expensive operation on millions of files. Putting a threshold on the size helps with performance too.
De-duplication can also be used so you don’t need to care about duplicates. It has a ‘cost’ of course.
Kudos for publishing this method in Powershell.
Years ago, I wrote the same app in Delphi, but use CRC32 as the hashing tool.
Part of that app is a non-recursive file search that starts at the specified parent directory, then searches all child leafs below. Each filespec is recorded in a TOPAZ (DB4) flat file database which is a source-code level toolbox for Delphi.
On the first pass, I keep track of all three file create/modified/accessed time stamps, the fully qualified file name, and the file size in the database.
The DB is indexed on file size and all identical-sized files are further examined.
Those with different file extensions are skipped.
Those with identical extensions are then each calculated for CRC32 and stored in the database. This is a slow process when examining 30,000 to 50,000 files. I rewrote CRC32 into Assembler but got no appreciable boost in performance, as the Delphi compiler is very good at optimizing code performance. Profiling the code, the CRC32 is indeed the slow point.
Last, I don’t delete duplicate files, but rename them with a representation of their fully qualified file name, i.e “root-mydocuments-pictures-vacation.img1234.jpg”, then move them all as unique file names to a common directory called !Duplicate Files for safe keeping until certain to delete them.
This is awesome!
Glad to know you found it interesting.
This is a great way to find duplicates. However, I have a significant number of small files (over 100,000) that I’m trying to scan. So for them, filesize is a good enough indicator to move them and investigate further. But I can’t seem to modify what you’ve got above to just use filesize. It finds the duplicates, and opens the selection window, but the move command fails because the Path is blank or null. What am I doing wrong?
Here’s my command:
Get-ChildItem -Path “C:\MyFiles” -Filter *.jpg -File -Recurse `
| Group -Property Length | where { $_.Count -gt 1 } `
| foreach { $_.Group | select Path, Name, Length } `
| Out-GridView -Title “Select files to move to R:\Dupes” -PassThru `
| Move-Item -Destination “R:\Dupes” -Force -Verbose
This should work:
Get-ChildItem -Path “C:\MyFiles” -Filter *.jpg -File -Recurse `
| Group -Property Length | where { $_.Count -gt 1 } `
| foreach { $_.Group } `
| Out-GridView -Title “Select files to move to R:\Dupes” -PassThru `
| Move-Item -Destination “R:\Dupes” -Force -Verbose
Thank you VERY much! Your script will save me a lot of time and effort, and I really appreciate it.
For anyone else reading the comments, upgrading to Powershell v7 and using the SHA1 algorithm for hashing sped up my deduping by a HUGE margin. I think the first time I ran it it took over 7 minutes. But the with Powershell v7 and changing from MD hashing to SHA1 it was down to seconds – approx 30 seconds. That’s a huge improvement in speed.