PDA

View Full Version : Find duplicate files with different names


frankcollins
05-Jan-2010, 11:33 AM
Hello all,

We have bc3 here at my job. We have recently received a large number (90,000) of contracts in pdf. In viewing the contracts we realized that many are duplicated over and over again with different file names. So what I need to do is compare the folder's contents against itself, based on the document's contents NOT the file name.

Is this something bc3 can handle?

Thanks!
Frank

Chris
05-Jan-2010, 02:48 PM
Sorry, BC3 doesn't provide duplicate file searching.

peterr
06-Jan-2010, 12:09 AM
I was going to ask a similar question.

I'm in the process of cleaning up a lot of old files, and the ones that are in my email client (Kmail on Ubuntu/Linux) are in different folders, and have different filenames. KMail creates one file for every email message, so it is quite impossible to track down the dupliate email messages.

However, I came across a nice little program called 'fdupes", it runs recursively through any path, and stores MD5 for each file, then it must sort to MD5, and shows the (possible) duplicates. I have manually checked a few files that 'fdupes' has picked up, and they are in fact duplicates, and they have very different filenames.

Now, it would be great if there was some way to feed the results of 'fdupes' into BC3 ??

I know BC3 folder compare works on exact filename match, but the BC3 filename compare, where there are 2 filenames, ..hmm, can that be run in batch mode, and a list of files fed into it somehow ??

Peter

peterr
06-Jan-2010, 12:27 AM
Frank, if fdupes would help, and you run Windooze, see if 'fdupes' will run under CYGWIN - http://www.cygwin.com/

But even if it can pick up all the duplicate PDF's, I'd still want to parse the results through BC3 somehow, just to be 100% sure the file/s were a real duplicate.

HTH

Peter

chrroe
06-Jan-2010, 04:02 AM
One way to do some kind of duplicate file search ist to activate the CRC column in folder-view. You can then sort by CRC values and manually look for duplicates. Duplicates can be compared via context-menu "Compare to... (F7)"
When the files are spread over different folders you can use the "Flatten Folders" feature.


Bye
Christoph

peterr
06-Jan-2010, 05:31 AM
Thanks for the tips; some features there that I didn't know exist. I selected a fairly large path, and then sorted by CRC, and there were quite a few where there where duplicate CRC's, for example see the attachment.

Although these 2 files were in the same path, there were others that were in different paths, and had different filenames, but the same CRC.

Now, the big question. Can I be 100% certain that where the same CRC is shown for 2 different filenames, that I can then safely delete one of them (assuming that is the objective, to get rid of duplicate files, where the filename and/or path are different).

I guess rather than look for duplicate CRC's manually, I could save the folder compare report as plain text, and then parse the file through a script, and look for the CRC (where it is the same as previous line).

Christoph, your solution would help Frank, to find those duplicate PDF's, where the filename is different.

Thanks,

Peter

peterr
06-Jan-2010, 06:00 AM
I guess rather than look for duplicate CRC's manually, I could save the folder compare report as plain text, and then parse the file through a script, and look for the CRC (where it is the same as previous line).


The path isn't shown as a column, in the Folder Compare Report. :(

Erik
06-Jan-2010, 09:05 AM
The Folder Compare Report will be fixed to include the path column when appropriate in a future release (probably 3.2).

chrroe
06-Jan-2010, 10:27 AM
Now, the big question. Can I be 100% certain that where the same CRC is shown for 2 different filenames, that I can then safely delete one of them (assuming that is the objective, to get rid of duplicate files, where the filename and/or path are different).

Using CRC alone brings roughly 99,99999999% sureness that the files are the same. But when you consider the filesize and date+time like in your screenshot then you can be pretty sure.

Years ago some users suggested to include MD5 calculation besides CRC. But this feature request seems to be a too small entry on the famous internal wishlist of Scootersoftware. :p


Bye
Christoph

Michael Bulgrien
06-Jan-2010, 10:33 AM
I agree. Filesize plus CRC is sufficient to ensure duplicity.

Craig
06-Jan-2010, 10:34 AM
But this feature request seems to be a too small entry on the famous internal wishlist of Scootersoftware.

It would be less infamous if our customers would stop having new suggestions so we could get caught up. ;)

peterr
06-Jan-2010, 06:01 PM
The Folder Compare Report will be fixed to include the path column when appropriate in a future release (probably 3.2).

Okay, thanks, that is good news. :)

I wonder if a snapshot includes path name ?

Peter

peterr
06-Jan-2010, 06:04 PM
Using CRC alone brings roughly 99,99999999% sureness that the files are the same. But when you consider the filesize and date+time like in your screenshot then you can be pretty sure.

I agree. Filesize plus CRC is sufficient to ensure duplicity.

That's good, thanks Christoph and Michael. :)

Peter

aussieboykie
19-Jan-2010, 08:39 PM
A related question. Several times recently I've had occasion to want to check for missing files in a scenario where left folder contains a bunch of files with original names and right folder contains a subset of the contents of left folder with original names modified by prefix or suffix. For example, using prefix...

File1.txt == January-File1.txt
File2.txt == January-File2.txt
File3.txt
File4.txt
File5.txt == January-File5.txt

Is there a way of transforming names on one side or the other - e.g. to add or subtract a prefix/suffix? In this case, it would be a simpler approach than using CRC.

Regards, AB

aussieboykie
19-Jan-2010, 08:58 PM
Having looked at what's possible with Session --> Folder Compare Report, the one column label that is not listed/selectable is Name, presumably because it is assumed that a name comparison is always required. If Name was added as an option, and could therefore be unchecked, it would be become trivial to compare on CRC/Size/Modified.

Consider this duly suggested. :)

Regards, AB

peterr
20-Jan-2010, 02:46 AM
Is there a way of transforming names on one side or the other - e.g. to add or subtract a prefix/suffix? In this case, it would be a simpler approach than using CRC.


Easy with Linux, just parse through a folder/path, and add a prefix/sufix to the filename.

Not sure if windooze can do it though ??

Either the second or forth post at http://www.linuxforums.org/forum/linux-newbie/48034-rename-bash-script-help.html , will do it.

Michael Bulgrien
20-Jan-2010, 07:39 AM
Is there a way of transforming names on one side or the other - e.g. to add or subtract a prefix/suffix?

So long as the prefix is constant throughout the folder, you can use:
Session \ Session Settings... \ Misc tab \ Alignment overrides

Click New...
Align left file: *
with right file: January-*

If your prefix can differ, then you would need to code a smarter alignment override using a regular expression

aussieboykie
20-Jan-2010, 01:41 PM
If your prefix can differ, then you would need to code a smarter alignment override using a regular expression
What I actually have is a bunch of digital images in one folder and some of the same images in another folder, renamed with a shooting date/time prefix, e.g.

IMG_4634.JPG == 2010_01_10_19_38_58_IMG_4634.JPG
IMG_4635.JPG == 2010_01_10_19_38_58_IMG_4635.JPG
IMG_4636.JPG == 2010_01_10_19_38_58_IMG_4636.JPG
IMG_4637.JPG == (missing)
IMG_4638.JPG == 2010_01_10_19_41_01_IMG_4638.JPG
etc.

The prefix is a constant length, but variable text. My level of competence with regular expressions is, at best, embarrassing. If some kind soul could point me in the right direction I'd be grateful.

Regards, AB

Aaron
20-Jan-2010, 02:14 PM
A couple of notes: the Alignment Overrides feature is only in BC3 Pro.

It also will not help with your current issue. The Regular Expression can be useful in defining the specific, matching text, but cannot match on the different text. In this case, your prefix (2010_01_10_19... must be explicit while IMG_4634 can be a regular expression.

In this case you would align:
(IMG_\d*\.jpg)
with 2010_01_10_19_38_58_$1

We do not currently allow masking on the matchTo side. This is on our Customer Wishlist.

Michael Bulgrien
20-Jan-2010, 02:39 PM
Try this:

Align left file: [0-9,_]+(IMG_\d\d\d\d.JPG)
with right file: $1

It should work if your date qualified files are on the left.

aussieboykie
20-Jan-2010, 03:19 PM
Try this:

Align left file: [0-9,_]+(IMG_\d\d\d\d.JPG)
with right file: $1

It should work if your date qualified files are on the left.

Woohoo! Many thanks Michael. I may even take the time to try to understand the regular expression. This is very useful.

Regards, AB :)

Michael Bulgrien
20-Jan-2010, 04:15 PM
[0-9,_] defines a set of valid characters. The square brackets simply enclose the set. The actual characters are 0 through 9 and the underscore character. Since \d also represents a numeric digit, this could also have been written [\d,_]

+ means that the prior item occurs one or more times. So we are including any combination of numeric digits and underscores at the beginning of the regular expression.

(IMG_\d\d\d\d.JPG) The four \d explicitly define four numeric digits. This could also have been written (IMG_\d+.JPG) to indicate one or more digits without limiting it to four, or (IMG_\d{4}.JPG) with {4} meaning repeat the prior item exactly 4 times.

(IMG_\d\d\d\d.JPG) The ( ) indicate that whatever matches the expression inside should be assigned to a variable to be used later.

$1 is the variable being used later. It contains what was matched in the ( ) on the other side. If you had more than one set of ( ), you would have more than one variable assigned: $2, $3, etc.

aussieboykie
20-Jan-2010, 09:11 PM
(IMG_\d\d\d\d.JPG) The ( ) indicate that whatever matches the expression inside should be assigned to a variable to be used later.

$1 is the variable being used later. It contains what was matched in the ( ) on the other side. If you had more than one set of ( ), you would have more than one variable assigned: $2, $3, etc.
At the risk of pushing my luck beyond reasonable bounds... ;)

I actually have images from more than one camera, so some are IMG_xxxx, some are DSCxxxxx, and so on. I would therefore like to have more than one set of ( ) and matching $1, $2, $3, etc.. I understand the thrust of what you are saying but am unsure of precisely what to code for left and right to replace your earlier simple case.


Align left file: [0-9,_]+(IMG_\d\d\d\d.JPG)
with right file: $1


Regards, AB

Michael Bulgrien
21-Jan-2010, 03:14 AM
Sorry, you won't be able to capture and use more than one backreference in Beyond Compare for the purpose of aligning different kinds of files.

Simply create a separate alignment override definition for each file type and you're done.

Or you could do something like this:

Align left file: [0-9,_]+([A-Z]{3}_*\d+\.JPG)
with right file: $1

[A-Z]{3} Three alphabetic characters will match both IMG and DSC.
_* Using an * instead of a + means zero or more instances instead of one or more instances of the previous character. This, then, will recognize the _ in the IMG format but not require one for the DSC format.

aussieboykie
21-Jan-2010, 01:54 PM
Excellent Michael. Thanks for the code and the clear explanation. Much appreciated.

Regards, AB

Aaron
22-Jan-2010, 10:26 AM
Thanks for the creative Regular Expression solution and clear explanation, Michael. :)

peterr
05-Jul-2010, 08:19 AM
Could this be added to the BC wishlist ? That is, find duplicate files in a folder comparison, regardless of filename, just match on file size and CRC.

btw, when I searched for 'crc' I kept getting "Sorry - no matches. Please try some different terms.", had to use Google which told me 122 hits ? Possibly strings of length 3 are ignored ?

Peter

Chris
06-Jul-2010, 05:49 PM
Finding duplicate files is still on our wish list, we do keep track of how often it is requested.

Yes, the forum software that we're using has a minimum of 4 characters for search terms. I sometimes use Google myself to search our forums. In Google I enter the term I'm searching for plus "site:http://www.scootersoftware.com/vbulletin/" to limit the search to our forums.

peterr
07-Jul-2010, 06:05 AM
Finding duplicate files is still on our wish list, we do keep track of how often it is requested.

Okay, so does that mean if I post a request here each day, it will get on the list quicker ? :D

Aaron
07-Jul-2010, 04:24 PM
Okay, so does that mean if I post a request here each day, it will get on the list quicker ? :D

I think we'll notice if it's just you. But we have had several users request this. It is something we would like to do, but we have several other large projects already scheduled and being worked on, so it is still on the Wishlist for now.