Flat folder to folder comparison with several files of same name, size and date.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Altair8801
    Visitor
    • May 2009
    • 3

    Flat folder to folder comparison with several files of same name, size and date.

    Hello world!

    I have been a Beyond Compare 3 addict for the past six months. It all started out as a need for a software application designed specifically to compare files. You see I had this heavily loaded collection of all sorts of data that have accumulated over the past five years or so. Great part of it comes from the Web a.k.a. Internet a.k.a. The Net. As such, there are plenty of duplicates of both Audio files and other file types or several unwanted versions of same files and information. But it also contains virtually all my old backup copies of my personal and important information which is not replicable. In the past I used to make backups on CD and later on DVD discs. I don't know the exact figures, because I forgot to note it down when I began with this project. But this massive collection is on the Gigabyte scale, that's for sure. We are talking about hundreds of Gigabytes here, I would guess somewhere between 300 to 400 GB at least.

    Since the collection as whole contains both useful and useless data, deleting it all together and starting all over was out of the question. So I really had to dig in to it, clean up and sort things out to get even close of an overview of it. I am usually an orderly person, and this will for sorting these things out kind of shows it doesn't it? I am also a collector so I like to keep old good things mostly because of the memories associated to them, and this is kind of reflected on my computer use as well. As such person I have worked out a pretty damn good system for organizing my digital assets (data). But sometimes even I do mistakes and other times I just neglect to keep up with the organization process. Apply time to that, and the deep and complex directory/folder structure I usually use and you end up with a pretty messy, nested folder structure with no end to it and with no logic or overview of it. In the end you don't go through those folders, you just search for the files you need.

    This is where a comparison software could come in handy. I was about to dig in to this data mess, and while at it I wanted to get rid of the duplicates for good. Except for saving me disk space, that would also save me hundreds of hours of organizing files that I already have organized once and have a record of. While no comparison software could do this whole work for me automatically, and even if there was such software I would fear to use it on sensitive data. However, it would be of great help to use a comparison software to at least partially automate this long, never ending process. (I have already messed up some of the less important data solely by using Beyond Compare.)

    Now it was official. I started planning on how I would perform this trick the best way practically and in the shortest time possible. I also started to hunt for a comparison software. One of my main requirements was that this software would need to have a GUI and be able to compare not only files one to another but to compare folders and whole sets of folders to other folders. So using traditional command line based Diff tools was not an option. Wiki articles File Comparison and Comparison of file Comparison Tools were of great help here. That's how I found out about Beyond Compare. Other softwares that stack to my mind are Araxis Merge, ExamDiff Pro, Kompare, WinDiff, and WinMerge. Some of these are proprietary like Beyond Compare, and some are freeware or OSS. I am a fan of OSS but I prefer what best confirms to my needs. So in the end I chose to go with Beyond Compare. But prior to that I have also tested some of the above mentioned, and I have read some serious reviews on them and on Beyond Compare 3. I don't see Beyond Compare 3 as the winner among them, but it sure did stood the test and it has become one of my favorite tools now.

    I am still working on this project, and I am using Beyond Compare 3 (BC3). I estimate that I have done about two thirds (2/3) of the work, leaving 1/3 remaining. I have read most of the stuff written in the Help file for BC3, and by experimenting with the procedures described there I have come to understand it (so please no RTFM here, lol). So I am pretty good with BC3 now. But I just can't figure out how to do the following.

    There is a folder called Pictures which is located in the parent folder temp at %UserProfile%\Desktop\. It contains 1798 files and 147 sub-folders, totaling 203 MB. The files are mostly pictures, with very few exceptions. I would like to figure out which one of these files I already have in my main pictures folder called Pictures (Windows special folder) at %UserProfile%. I would like to delete those that I allready have from the %UserProfile%\Desktop\temp\Pictures location, and keep those that I don't have and then move them to the main Pictures folder. How do I do it?

    I know this sounds simple, I did similar things many times before, but the problem is that the main Pictures folder at %UserProfile% contains thousands of files that have the same file name and extension but are located in different sub-folders. This seem to give BC3 headaches. When I do a folder to folder comparison here, these are the settings I usually use.
    • Show same
    • Ignore folder structure (flatten)
    • Compare file size: yes
    • Compare timestamps: yes
    • 0 second tolerance
    • Ignore daylight saving difference (1 hour): no
    • Ignore timezone difference: no
    • Compare file attributes: archive, system, hidden, read-only
    • Compare contents: CRC comparison
    • Override quick test results: yes


    As left folder I chose the %UserProfile%\Desktop\temp\Pictures and for the right folder I chose %UserProfile%\Pictures.

    The results?

    No results! No equal files found!

    Actually, BC3 did help me find at least half of the equal files using these same settings, but only up to some point. I think that when BC3 started finding more then two files of same name, extension, size and date it kind of hit a dead end. That's my interpretation of BC3's behavior. Two files equal is fine, it shows them. But if more then two files in a row are equal it kind of doesn't know what to do so it gives you nothing. It seems as if BC3 is relying too much on the name of the files.

    I am using Windows Vista as my primary OS, so I opened up two instances of Windows Explorer where in one of them I have chosen to show the new advanced search features, and in the other I just showed the Pictures folder I want to compare. File by file I took the file name, size, and date and entered it into the advanced search options. This way I was able to find out which files are duplicates so I could delete them. This proves it that there are for sure dublicates of files in these two folders, but this takes so much time to find them this way one by one. Why can't Beyond Compare 3 help me here?

    I have made few modifications to the settings in BC3 but it didn't help me find any duplicates. What do I do to make this right so that BC3 can help me find these files?

    I love BC3, but on this point it has disappointed me. I would really appreciate if anyone could help me figure out how to do this comparison in BC3, in case I am missing something. If this is something that BC3 lacks in functionality, I expect it to be included in the future.

    Thanks in advance!
  • Michael Bulgrien
    Carpal Tunnel
    • Oct 2007
    • 1772

    #2
    Sorry, I didn't read your book... I just skim-read what caught my eye... Beyond Compare matches/aligns files by filename regardless of how you've configured your compare settings. So a flattened folder view will not identify duplicate files if they have different names. When I am looking for duplicate files, I sort by CRC and manually look for consecutive files with identical CRC values. I don't know if it is possible to generate a report that contains CRC values... I haven't tried it...
    BC v4.0.7 build 19761
    ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

    Comment

    • Altair8801
      Visitor
      • May 2009
      • 3

      #3
      Originally posted by Michael Bulgrien
      Sorry, I didn't read your book... I just skim-read what caught my eye... Beyond Compare matches/aligns files by filename regardless of how you've configured your compare settings. So a flattened folder view will not identify duplicate files if they have different names. When I am looking for duplicate files, I sort by CRC and manually look for consecutive files with identical CRC values. I don't know if it is possible to generate a report that contains CRC values... I haven't tried it...
      Thank you for you answer! Yeah, that's just me! I don't like to leave out the details.

      Yeah, I read about that somewhere, that's probably what makes it difficult to find duplicates. Is it possible to tell BC3 to stop matching/aligning the files by filename? That should do the trick. Well, at least half the trick anyway.

      For the rest of it, I think there has to be something more to this that needs to be done so that BC3 will find the rest of the duplicates. Because as I explained (and if you had read it ) above, I would estimate that about 95% of these duplicates have the same name, the same size, and even the same date and timestamp on both of the locations. But in cases where this holds true, there seems to be usually more then just one duplicate of the same file. There are sometimes two duplicates (three same files in total) or more. And apart from the first part discussed above with filename matching, this seems to be one more point where BC3 bites the dust.

      I just came to think of one important thing that needs to be said about BC3. BC3 is essentially a file comparison software! As such, it is not primarily designed to hunt for duplicates of files, no matter if it's one, two, three or more duplicates.

      For finding duplicate files there are very specific software packages available that focus solely on that. One of them is appropriately called Duplicate File Hunter and is running on my desktop right now as I type. I just let it hunt for duplicate files on the two directory locations for about 9 minutes and it found no less then 1810 duplicates across these two locations and 49 groups (for whatever that is). It's a very simple application, occupying no more then 380 KB of disk space (with the configuration file) and requires no installation. I am using version 1.4 which dates back to November 2007. That's quite amazing for such old application as far as the compatibility is concerned, because I'm running it on a Vista 64-bit machine with no issues. Thanks to it's creator Alexander G. Styopkin. On the toolbar it has only four buttons: search, delete, print, and help, and it has no traditional menu. Sadly, the results are displayed as a list with the file order number, file name, path, size, and crc32 columns.

      1810 duplicates found?! Take that BC3! That not only equals but it also exceeds the 1782 I originally had in the Pictures folder I wanted to compare to my main Pictures folder. This means it found duplicates within my main Pictures folder as well. Duplicate File Hunter (DFH) seems to be a great tool for finding duplicates, while BC3 is not essentially a duplicate file hunter. BC3 sort of requires you to know the locations of the files you want to compare, in order to figure out if they are exactly the same or not, i.e. if they are duplicates. DFH on the other hand does not have the extensive control and options found in BC3.

      This types of "find my file duplicates" kind of tools, which there are plenty of on the Web, are usually not sufficient enough for this work. They are sure handy for small file collections, but on the big scale when you're talking hundreds of Gigabytes they fall behind. That's why I in first case focused on finding a file comparison tool instead of a file duplicate finder tool. That's why I chose BC3 in first place. But I didn't expect it to be this insufficient for finding file duplicate, which it could and should handle quite flawlessly.

      I think that allowing the user to set the options to ignore file name and file name matching would give better response when hunting duplicates. I understand that BC3 is essentially a file comparer and not file duplicate hunter, but finding duplicates should be an interest for BC3 users and it's something that's within BC3's scope of application/functionality. I can confirm that the DFH software I mentioned does not care about the file names, it seems to focus only on the CRC values.

      What about sorting the files by their CRC value? I mean, is it possible in BC3? I have already set the options to "compare contents: crc comparison" but it doesn't and it could not have any affect if BC3 is persistently file name matching/aligning them. Michael, could you explain your approach in more detail? Give me some details man, I love details. Details are what make up the overall picture. Could your approach be a possible solution in my case? I would like to try it before I give up on BC3.

      Comment

      • Altair8801
        Visitor
        • May 2009
        • 3

        #4
        Why can't BC3 just match and then when done, just align the files by their respective CRC values?! End of story!

        I mean BC3 is already smart enough to calculate the CRC values. So why not make use of that information for the sake of comparing the files to each other with respect to their CRC values? Maybe some users would also like to be able to do the same but with other file information such as file size, or file date, et cetera. Why focus only on file name match and alignment when you can broaden the view and align files by virtually any file information that BC3 acquires? It's not like it's too much to ask, right?!...

        Maybe this is something to be seen in BC4? I mean, come on... what exactly do you base the phrase "beyond compare" on? Where is the "beyond"? BC3 is very much just "compare", with no or very little "beyond". That has to change for this name to be truthful.
        Last edited by Altair8801; 06-May-2009, 04:28 PM. Reason: Beyond Compare? Where is the "beyond"?

        Comment

        • Zoë
          Team Scooter
          • Oct 2007
          • 2666

          #5
          You're not the first person (by a long shot) to ask that Beyond Compare support searching for duplicate files.

          Yes, Beyond Compare has a lot of features in common with duplicate file finders, but that doesn't mean we can just slap in an extra "Align by CRC" checkbox and call it a day. To do it well would require a new interface that supports lining up more than 2 files as a group, as well as new file operations designed to make corralling the duplicates easier. At least some of the existing utilities also support finding resized/rotated images, reencoded MP3s, etc, so it would take a lot of effort to actually compete with them.

          Finding duplicate files is on the wishlist, but there are other utilities out there that already fill that niche better than we could, and I personally think we're better off focusing elsewhere.
          Zoë P Scooter Software

          Comment

          • Michael Bulgrien
            Carpal Tunnel
            • Oct 2007
            • 1772

            #6
            Originally posted by Craig
            there are other utilities out there that already fill that niche better than we could, and I personally think we're better off focusing elsewhere.
            If you need advanced dup finder fuctionality (rotated images, re-encoded sound files, and the like) then I agree that BC3 is not the right tool for the job and you're better off focusing elsewhere. But for true dupes (identical size and CRC) I believe it would be well worth the effort to implement. I don't want a separate app installed on my PC just for simple dup finding.
            BC v4.0.7 build 19761
            ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

            Comment

            • Michael Bulgrien
              Carpal Tunnel
              • Oct 2007
              • 1772

              #7
              Originally posted by Altair8801
              Michael, could you explain your approach in more detail? ... Could your approach be a possible solution in my case? I would like to try it before I give up on BC3.
              If you already have a dup finder installed, then I wouldn't bother trying to find dups with BC3 until it becomes a supported feature. Many BC users (myself included) have BC running on company hardware. Many companies (including mine) frown on users installing miscellaneous apps (like dup finders) if they are not on the approved software list. Rather than break company policy and install unapproved software, I use BC3 and manually sort through the files looking at CRC values. Not something I would recommend if you have over 1800 duplicate files on your system...
              BC v4.0.7 build 19761
              ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

              Comment

              Working...