Announcement

Collapse
No announcement yet.

How to match similar filenames of exact files to compare.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to match similar filenames of exact files to compare.

    I'm looking for the best approach to comparing files that I believe are identical but which have different filenames. Comparison tools like BeyondCompare are great but they don't yet handle different filenames of exact files - when comparing files in separate folders they attempt comparisons with the files that have the same name on either side.

    There is MindGems Fast Duplicate File Finder for matching files in any location throughout several folder trees that have different names but this is based on CRC checks I believe, I am using this tool but I am only gradually trusting it, so far no faults but don't trust it as much as BeyondCompare yet. BeyondCompare offers the complete piece of mind of doing a full binary compare on the file.

    In my case the files tend to have similar names, the difference being ordering of the words, punctuation, case differences and not all words present. So it's not easy to use a regex filter to match the files that Beyond Compare already provides because the file substrings can be out of order.

    I'm looking for a way to match similar filenames before renaming the files to be the same and then 'feeding' them to BeyondCompare. Solutions could be scripts or perhaps in the form of an application.

    At the moment I have an idea for an algorithm (to implement in Perl) to match the filenames to suit my problem whereby the filenames are similar as described above.

    Can you suggest something better or a completely different approach?
    1. Find a list of files with the exact same filesize
    2. Make a hash of alphanumeric substrings from first file, using non-alphanumeric characters or space as delimiter
    3. Make a hash of alphanumeric substrings from second file, using non-alphanumeric characters or space as delimiter
    4. Match occurrences
    5. Find which file has the highest number of substrings.
    6. Calculate a percentage score for the comparison on the pair based on number of matches divided by the highest number of substrings.
    7. Repeat comparison for each file with every other file with the exact file size
    8. sort the pair comparisons by percentage score to get suggestions of files to compare.
    9. Rename one file in the pair so that it is the same as the other. Place in separate folders.
    10. Run a comparison tool like BeyondCompare with the files, folder comparison mode.
    Last edited by arjaydavis; 30-Jan-2012, 08:03 PM.

  • #2
    Thanks for the suggestion, arjay. Matching based on other criteria (size, crc, etc) is on our Customer Wishlist. I'll add your notes and algorithm to the wishlist entry. Thanks for the feedback.
    Aaron P Scooter Software

    Comment


    • #3
      MY SOLUTION RIGHT HERE!

      I've written a Perl script that will rename the pairs of duplicate files found by MindGems Fast Duplicate File Finder (the exported CSV report).

      With the files renamed this then means that the 2 folders can be run through Beyond Compare (with flatten folder mode ON) to do a reassuring full binary matching comparison.



      Code:
      #!/usr/bin/perl -w 
      
      use strict;
      use warnings;
      
      
      use File::Basename;
      
      my $fdffCsv = undef;
      
      # fixed
      # put matching string - i.e. some or all of path of file to keep here e.g. C:\\files\\keep\\ or just keep
      my $subpathOfFileToKeep = "keep";
      # e.g. jpg mp3 pdf etc.
      my $fileExtToCompare = "jpg";
      
      # changes
      my $currentGroup = undef;
      my $group = undef;
      my $filenameToKeep = "";
      
      my $path = undef;
      my $name = undef;
      my $extension = undef;
      my $filename = undef;
      
      open ( $fdffCsv, '<', "fast_duplicate_filefinder_export_as_csv.csv" );
      
      my @filesToRenameArray = ();
      
      while ( <$fdffCsv> )
      {
        my $line = $_;
        
        my @lineColumns = split( /,/, $line );
        
        # is the first column and index value
        if ( $lineColumns[0] =~ m/\d+/ )
        {
          $group = $lineColumns[0];
      	
          ( $line ) =~ /("[^"]+")/;
          $filename = $1;
          
          $filename =~ s/\"//g;
          
          if ( defined $currentGroup )
          {
            if ( $group == $currentGroup )
            {
              ( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );
      	  
      	store_keep_and_rename();
            }
            else # group changed
            {
              match_the_filenames();
      	
      	( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );
      	
      	store_keep_and_rename();
            }
          }
          else # first time - beginning of file
          {
            $currentGroup = $group;
      	  
            ( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );
      	  
            store_keep_and_rename();
          }
        }
      }
      
      close( $fdffCsv );
      
      match_the_filenames();
      
      sub store_keep_and_rename
      {
              if ( $path =~ /($subpathOfFileToKeep)/ )
            {
      	    $filenameToKeep = $name.$extension;
            }
            else
            {
              push( @filesToRenameArray, $filename );
            }
      }
      
      sub match_the_filenames
      {
      	my $sizeOfFilesToRenameArraySize = scalar( @filesToRenameArray );
      		
              if ( $sizeOfFilesToRenameArraySize > 0 )
      	{
      	  for (my $index = 0; $index < $sizeOfFilesToRenameArraySize; $index++ )
      	  {
      	    my $PreRename = $filesToRenameArray[$index];
      
      	    my ( $preName, $prePath, $preExtension ) = fileparse ( $PreRename, '\..*' );
      	    my $filenameToChange = $preName.$preExtension;
      			
      	    my $PostRename = $prePath.$filenameToKeep;
      			
      	    print STDOUT "Filename was: ".$PreRename."\n";
      	    print STDOUT "Filename will be: ".$PostRename."\n\n";
      	    
      	    rename $PreRename, $PostRename;
      	  }
      	}
      		
      	undef( @filesToRenameArray ); @filesToRenameArray = ();
      		
      	$currentGroup = $group;
      	}

      Comment

      Working...
      X