Search for duplicated files using C# and LINQ

Over the years I downloaded, copied, moved around my files, sometimes I made lot of copies, or put them in different directories. And now, there is a time, to clean up some duplicates. I took the easy way, quickly created a small application, which filtering my drive based on file name and length.
The solution is easy, first I create a FileInfo list, fill the list with FileInfo’s. I walk through the directory tree with recursion, and not bothering myself with permission violation. Than search for duplicates, than I create a file with possible duplicates.
Here is the basic code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Security;
using System.Security.Permissions;

namespace ConsoleApplication1
{
    public class DuplicateFileFinderClass
    {
        public static List<FileInfo> files = new List<FileInfo> ();
        public static void ListDrive (string drive, bool enumerateFolders)
        {
            try
            {
                DirectoryInfo di = new DirectoryInfo (drive);
                foreach (FileInfo fi in di.EnumerateFiles ())
                {
                    files.Add (fi);
                }

                if (enumerateFolders)
                {
                    foreach (DirectoryInfo sdi in di.EnumerateDirectories ())
                    {
                        ListDrive (sdi.FullName, enumerateFolders);
                    }
                }


            }

            catch (UnauthorizedAccessException) { }
        }

        public static void ListDuplicates ()
        {
            var duplicatedFiles = files.GroupBy (x => new { x.Name, x.Length}).Where (t => t.Count () > 1).ToList ();

            Console.WriteLine ("Total items: {0}", files.Count);
            Console.WriteLine ("Probably duplicates {0}", duplicatedFiles.Count ());

            StreamWriter duplicatesFoundLog = new StreamWriter ("DuplicatedFileList.txt");

            foreach (var filter in duplicatedFiles)
            {
                duplicatesFoundLog.WriteLine ("Probably duplicated item: Name: {0}, Length: {1}",
                    filter.Key.Name,
                    filter.Key.Length);

                var items = files.Where (x => x.Name == filter.Key.Name &&
                    x.Length == filter.Key.Length).ToList ();

                int c = 1;
                foreach (var suspected in items)
                {
                    duplicatesFoundLog.WriteLine ("{3}, {0} - {1}, Creation date {2}",
                        suspected.Name,
                        suspected.FullName,
                        suspected.CreationTime,
                        c);
                    c++;
                }

                duplicatesFoundLog.WriteLine ();
            }

            duplicatesFoundLog.Flush ();
            duplicatesFoundLog.Close ();
        }
    }
}

From the console application I first call the ListDrive method, than call ListDuplicates method. Well, I don’t say it’s the best and most elegant way, but quickly served my needs. The whole process took around 31 seconds, 6 for compile the list, 25 for create the log, in 500GB HDD, with over 6600 duplications. With less than 100 lines of code.

4 thoughts on “Search for duplicated files using C# and LINQ

  1. Mike P says:

    Interesting

  2. I truly love your site.. Very nice colors & theme. Did you make this web site yourself?
    Please reply back as I’m wanting to create my own site and would love to know where you got this from or what the theme is called. Thanks!

  3. Kevin Sheth says:

    this is very nice an concise. I wrote something a little more verbose but also does md5 hashing to confirm the duplicates. I’ve also attempted a WPF gui, but a lot of work is still needed. check it out at https://github.com/kns98/ndupfinder

Leave a reply to Mike P Cancel reply