Find duplicate files with different names

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • frankcollins
    New User
    • Jan 2010
    • 1

    Find duplicate files with different names

    Hello all,

    We have bc3 here at my job. We have recently received a large number (90,000) of contracts in pdf. In viewing the contracts we realized that many are duplicated over and over again with different file names. So what I need to do is compare the folder's contents against itself, based on the document's contents NOT the file name.

    Is this something bc3 can handle?

    Thanks!
    Frank
  • Chris
    Team Scooter
    • Oct 2007
    • 5538

    #2
    Sorry, BC3 doesn't provide duplicate file searching.
    Chris K Scooter Software

    Comment

    • peterr
      Fanatic
      • Nov 2004
      • 142

      #3
      I was going to ask a similar question.

      I'm in the process of cleaning up a lot of old files, and the ones that are in my email client (Kmail on Ubuntu/Linux) are in different folders, and have different filenames. KMail creates one file for every email message, so it is quite impossible to track down the dupliate email messages.

      However, I came across a nice little program called 'fdupes", it runs recursively through any path, and stores MD5 for each file, then it must sort to MD5, and shows the (possible) duplicates. I have manually checked a few files that 'fdupes' has picked up, and they are in fact duplicates, and they have very different filenames.

      Now, it would be great if there was some way to feed the results of 'fdupes' into BC3 ??

      I know BC3 folder compare works on exact filename match, but the BC3 filename compare, where there are 2 filenames, ..hmm, can that be run in batch mode, and a list of files fed into it somehow ??

      Peter

      Comment

      • peterr
        Fanatic
        • Nov 2004
        • 142

        #4
        Frank, if fdupes would help, and you run Windooze, see if 'fdupes' will run under CYGWIN - http://www.cygwin.com/

        But even if it can pick up all the duplicate PDF's, I'd still want to parse the results through BC3 somehow, just to be 100% sure the file/s were a real duplicate.

        HTH

        Peter

        Comment

        • chrroe
          Pooh-Bah
          • Oct 2007
          • 588

          #5
          One way to do some kind of duplicate file search ist to activate the CRC column in folder-view. You can then sort by CRC values and manually look for duplicates. Duplicates can be compared via context-menu "Compare to... (F7)"
          When the files are spread over different folders you can use the "Flatten Folders" feature.


          Bye
          Christoph

          Comment

          • peterr
            Fanatic
            • Nov 2004
            • 142

            #6
            Thanks for the tips; some features there that I didn't know exist. I selected a fairly large path, and then sorted by CRC, and there were quite a few where there where duplicate CRC's, for example see the attachment.

            Although these 2 files were in the same path, there were others that were in different paths, and had different filenames, but the same CRC.

            Now, the big question. Can I be 100% certain that where the same CRC is shown for 2 different filenames, that I can then safely delete one of them (assuming that is the objective, to get rid of duplicate files, where the filename and/or path are different).

            I guess rather than look for duplicate CRC's manually, I could save the folder compare report as plain text, and then parse the file through a script, and look for the CRC (where it is the same as previous line).

            Christoph, your solution would help Frank, to find those duplicate PDF's, where the filename is different.

            Thanks,

            Peter
            Last edited by peterr; 06-Jan-2010, 05:34 AM.

            Comment

            • peterr
              Fanatic
              • Nov 2004
              • 142

              #7
              Originally posted by peterr
              I guess rather than look for duplicate CRC's manually, I could save the folder compare report as plain text, and then parse the file through a script, and look for the CRC (where it is the same as previous line).
              The path isn't shown as a column, in the Folder Compare Report.

              Comment

              • Erik
                Team Scooter
                • Oct 2007
                • 437

                #8
                The Folder Compare Report will be fixed to include the path column when appropriate in a future release (probably 3.2).
                Erik Scooter Software

                Comment

                • chrroe
                  Pooh-Bah
                  • Oct 2007
                  • 588

                  #9
                  Now, the big question. Can I be 100% certain that where the same CRC is shown for 2 different filenames, that I can then safely delete one of them (assuming that is the objective, to get rid of duplicate files, where the filename and/or path are different).
                  Using CRC alone brings roughly 99,99999999% sureness that the files are the same. But when you consider the filesize and date+time like in your screenshot then you can be pretty sure.

                  Years ago some users suggested to include MD5 calculation besides CRC. But this feature request seems to be a too small entry on the famous internal wishlist of Scootersoftware.


                  Bye
                  Christoph

                  Comment

                  • Michael Bulgrien
                    Carpal Tunnel
                    • Oct 2007
                    • 1772

                    #10
                    I agree. Filesize plus CRC is sufficient to ensure duplicity.
                    BC v4.0.7 build 19761
                    ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

                    Comment

                    • Zoë
                      Team Scooter
                      • Oct 2007
                      • 2666

                      #11
                      Originally posted by chrroe
                      But this feature request seems to be a too small entry on the famous internal wishlist of Scootersoftware.
                      It would be less infamous if our customers would stop having new suggestions so we could get caught up.
                      Zoë P Scooter Software

                      Comment

                      • peterr
                        Fanatic
                        • Nov 2004
                        • 142

                        #12
                        Originally posted by Erik
                        The Folder Compare Report will be fixed to include the path column when appropriate in a future release (probably 3.2).
                        Okay, thanks, that is good news.

                        I wonder if a snapshot includes path name ?

                        Peter

                        Comment

                        • peterr
                          Fanatic
                          • Nov 2004
                          • 142

                          #13
                          Originally posted by chrroe
                          Using CRC alone brings roughly 99,99999999% sureness that the files are the same. But when you consider the filesize and date+time like in your screenshot then you can be pretty sure.
                          Originally posted by Michael Bulgrien
                          I agree. Filesize plus CRC is sufficient to ensure duplicity.
                          That's good, thanks Christoph and Michael.

                          Peter

                          Comment

                          • aussieboykie
                            Expert
                            • Oct 2009
                            • 55

                            #14
                            A related question. Several times recently I've had occasion to want to check for missing files in a scenario where left folder contains a bunch of files with original names and right folder contains a subset of the contents of left folder with original names modified by prefix or suffix. For example, using prefix...

                            File1.txt == January-File1.txt
                            File2.txt == January-File2.txt
                            File3.txt
                            File4.txt
                            File5.txt == January-File5.txt

                            Is there a way of transforming names on one side or the other - e.g. to add or subtract a prefix/suffix? In this case, it would be a simpler approach than using CRC.

                            Regards, AB

                            Comment

                            • aussieboykie
                              Expert
                              • Oct 2009
                              • 55

                              #15
                              Having looked at what's possible with Session --> Folder Compare Report, the one column label that is not listed/selectable is Name, presumably because it is assumed that a name comparison is always required. If Name was added as an option, and could therefore be unchecked, it would be become trivial to compare on CRC/Size/Modified.

                              Consider this duly suggested.

                              Regards, AB

                              Comment

                              Working...