Compare files with partially matching file names?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Aaron
    Team Scooter
    • Oct 2007
    • 16000

    #16
    Arjay,

    With the original examples Michael provided, I created pairs of text files, renamed them to your original examples (alpha123.txt) and they worked fine, verifying Michael's suggestion.

    Please also note that Michael Bulgrien is not a Scooter Software employee. He is a fellow customer who is posting in this thread with suggestions for you to try.

    With the updated example of "[email protected], Aaron@beyond compare email.com, Compare files with partially matching file names.eml" I created a pair of files that match those names, and used Beyond Compare to find the cursor position in the filename (displayed in the bottom Text Compare status bar) to determine where the truncation is taking place. I then put the suggested Alignment Override into the Misc tab of the Session Settings and it also aligned the file examples without issues.

    The reason for the extra .* after the initial match is the regular expression must "match" on the entire left side. So the definition of:
    .*\.eml would match on the left file. In order to 'find' the truncated beginning, the part we want to align by for the second file, we split it up into two parts:
    (.{95})
    followed by
    .*

    This allows us to pick out the Right File's name, and continue matching on the entire Left File Name.
    The (parentheses) allow us to reference the text found on the left again on the right. Since the truncated version in your example was 95 characters long (before the .eml), matching on the first 95 characters, followed by a .eml works. The regular expression for the right side looks like: $1 (matching the first 95), followed by .eml. $1.eml. You then need to enable/check the Regular Expression checkbox when defining the Alignment Override (or editing it).

    This way, the Left expression matches the File Name that would be found on the left, and aligns it with the File Name you expect on the right.

    Michael's suggestion relies on the logic that your truncation happens at a specific character count (7 characters based on your first example, 95 on your second). I tried counting the number of characters in your screenshot, and they appear to be more than 95. Is the truncation count a variable number?

    If you are still having trouble, I would suggest emailing us at [email protected] with:
    1) Your Support.zip (from the Help menu -> Support; Export)
    2) A link back to this forum post for reference
    3) Two Snapshot files generated from the Tools menu -> Save Snapshot. One for each side of the comparison.

    A Snapshot is a virtual directory that can be loaded in Beyond Compare and contains all of your file names, but not the directories. With this information we would have a complete comparison and could test a few different Alignment Override regular expressions to see what trouble you are running into and why it isn't working for you.
    Aaron P Scooter Software

    Comment

    • arjaydavis
      Expert
      • Oct 2010
      • 53

      #17
      A precise step by step guide please (here's what I did 1/3)

      What I need is a precise step by step guide as to how to achieve the partial filename comparison.

      Here is what I tried, step by step. Specifically: please advise on where I am going wrong and add and/or edit the steps I am making so that I can do the comparison.

      I've split it into 3 posts because your forum system doesn't allow more than 4 images per post.


      Part 1 of 3

      (Version of Beyond Compare 3.2.4 (build 13298))

      Steps:
      1) On Desktop create 2 folders:

      New Folder
      a@b New Text Document New Text Document New Text Document 0123456789.eml


      the file name length (without .eml extension) is 72 characters long


      New Folder (2)
      a@b New Text Document New Text Document New Text Document 0123.eml

      the file name length (without .eml extension) is 66 characters long


      The files are identical and contain New Text Document string

      1) Run Beyond Compare

      2) Select folder compare option



      3) Drag in the created folders so that beyond compare sees them and attempts the compare



      4) Select all contents (I'm guessing this is supposed to tell beyond compare what to operate on)



      5) Go to Session Settings Session drop down menu option

      Comment

      • arjaydavis
        Expert
        • Oct 2010
        • 53

        #18
        A precise step by step guide please (here's what I did 2/3)

        Part 2 of 3:


        6) Click the Misc tab and enter the regex in Alignment Overide



        7) Click OK to see the reg ex session override entry



        8) Re-select all contents



        9) Do a full refresh from the Edit Drop down menu

        10) The files don't line up (still blue as well not black). So try Update Session defaults instead



        Still no luck.

        Comment

        • arjaydavis
          Expert
          • Oct 2010
          • 53

          #19
          A precise step by step guide please (here's what I did 3/3)

          Part 3 of 3:


          Now, here's what I would like to see (but don't):

          A mock up:




          Both the identical files line up and are in black because the match. The regex in session overide has allowed beyond compare to match the files partially by filename so that it can go on and do the full binary comparison.


          I hope what I am trying to do is now absolutely crystal clear!

          Please advise on what I need to do to achieve the comparison of files that have filenames that partially match.

          Comment

          • arjaydavis
            Expert
            • Oct 2010
            • 53

            #20
            P S in
            step
            6) Click the Misc tab and enter the regex in Alignment Overide of Part 2 of 3. I used a length to match of 65 because both filename strings should be identical for at least the first 65 characters in my example.

            I'm assuming I can stick with one value i.e. use it throughout to compare lots of files and not have to vary its value depending on the situation, provided I assume that the first 'n' or so characters will match in any situation I would always use n.

            Alas, I still need some advice to get the solution working, please insert/edit my steps.

            Comment

            • Aaron
              Team Scooter
              • Oct 2007
              • 16000

              #21
              Hello Arjay,

              1) Determine the number of the shorter filename on the right:
              When I copy and paste "a@b New Text Document New Text Document New Text Document 0123.eml" into BC3, I find that I can get to the 67th position *including* the extension. Since ".eml" is explicitely defined in the regular expression, you want to only count the number of characters in the main part of the filename, up to the "3". If the "a" is position 1, then place the cursor before the "3" to find that position number in the Text Compare (this assumes there is no preceding whitespace.)

              2) In the Folder Compare, Session Settings dialog, Misc tab, define the Alignment Override:
              (.{62}).*\.eml
              with
              $1.eml
              X Regular Expression
              Works with this pair of example files. If you replace 65 with 62, does that work for you?

              If you are still having any trouble, please email us at [email protected] with:
              1) Your Support.zip (from the Help menu -> Support; Export)
              2) A link back to this forum post for reference
              3) Two Snapshot files generated from the Tools menu -> Save Snapshot. One for each side of the comparison.

              This will help us quickly compare using your settings on your specific folder structure and see if we can reproduce any of the trouble you have been running into.
              Aaron P Scooter Software

              Comment

              • Aaron
                Team Scooter
                • Oct 2007
                • 16000

                #22
                In addition, I work with 2 tabs in Beyond Compare. The Folder Compare tab and Text Compare tabs are both open.

                In the Folder Compare, I
                1) Rename command the file on the left. With all the text of the filename highlighted, I press Ctrl+C to copy to clipboard.
                2) In the Text Compare tab, I click on the left pane, go to the File menu -> Open Clipboard
                3) Rename command the file on the right. Ctrl+C the right file name.
                4) In the Text Compare tab, I click on the right pane, and go to the File menu -> Open Clipboard.

                This opens a Text Compare session comparing the two file names, quickly showing the differences and where they occur.
                Aaron P Scooter Software

                Comment

                • arjaydavis
                  Expert
                  • Oct 2010
                  • 53

                  #23
                  bingo!

                  Originally posted by Aaron
                  Hello Arjay,

                  (.{62}).*\.eml
                  with
                  $1.eml
                  X Regular Expression
                  Works with this pair of example files. If you replace 65 with 62, does that work for you?
                  Yes thanks BUT you have to know the length of the filename to do the partial match, i.e. 62 in this case, because if I set it lower e.g. 50 where it should match at least the first 50 of the characters, it doesn't work, only on the precise length of the individual shorter filename of the 2 being compared. You cannot say, for example, (.{62,}).*\.eml - meaning match at least 62 or more.

                  Obviously I want to automate as much as possible the comparison, hence the partial match, so having to know each filename length defeats the object; I might as well do a manual compareto comparison.
                  Last edited by arjaydavis; 03-May-2011, 04:10 PM.

                  Comment

                  • arjaydavis
                    Expert
                    • Oct 2010
                    • 53

                    #24
                    Any thoughts on how I can get it to match "at least" a number of characters? Or "1 or more" successfully. The regex reference in the beyond compare defines these but they don't work in this situation.

                    Comment

                    • Aaron
                      Team Scooter
                      • Oct 2007
                      • 16000

                      #25
                      Hello,

                      The {62,} would be a greedy expression, and would match more than you intend. Since it then matches more of the left file, past the truncation, there would be no match on the right side. Adding a {62,}? makes it Non-greedy, but it could then still match on more of the left line than you want.

                      The goal of the {62} is to make a regular expression that is the length of the truncation and matches the text on the right side, so that it works when it is transplanted onto the right side with $1.eml. In most truncation scenarios, there is a hard character count limit where truncation will occur. If your left file is smaller than that limit, then it theoretically was not truncated on the right, both file names are of equal size and equal text, and it should align without assistance of an alignment override.

                      Could you go into more detail on your truncation and folders? Do you expect your pair of folders to have files of variable truncation on the right side? If so, what is causing the variable length of the file names?
                      Aaron P Scooter Software

                      Comment

                      • arjaydavis
                        Expert
                        • Oct 2010
                        • 53

                        #26
                        Originally posted by Aaron
                        In most truncation scenarios, there is a hard character count limit where truncation will occur.
                        This makes sense: the filename would have been truncated to be with the UDF1.02 length before burning to disc. So all the filenames that were originally longer than that should all be the same length, so the fixed value e.g. 62, 95 as we've used or whatever ought to apply.

                        But the other problem is that if the truncation results in 2 files with the same name, then the truncate has to append a unique identifier to the filename to make the filename unique. So even if we know that the length is always going to be 62 for example, we should also know that the last 1, 2 or maybe 3 characters in the truncated filename will be different (i.e. not seen in) from the original longer length file.

                        So our regex needs to be modified to account for the appended unique identifier seen in the truncated filename. I will take a look at what is actually being appended later and come back with more info later.

                        Comment

                        • arjaydavis
                          Expert
                          • Oct 2010
                          • 53

                          #27
                          partial match with unique identifier/index added

                          So I accept that the truncation will be fixed everytime at a certain value, so...

                          A specific example following on from my last comment would be:
                          left hand side:
                          original non-truncated filenames:
                          a@b New Text Document New Text Document New Text Document really quite long part a.eml
                          a@b New Text Document New Text Document New Text Document really quite long part b.eml
                          a@b New Text Document New Text Document New Text Document really quite long part c.eml

                          right hand side:
                          truncated files, with index appended to make unique (so that truncate rename possible i.e. doesnt clash with same name of existing filename when rename attempt made):
                          a@b New Text Document New Text Document New Text Document1.eml
                          a@b New Text Document New Text Document New Text Document2.eml
                          a@b New Text Document New Text Document New Text Document3.eml

                          How would we modify the regex so that the partial match was possible here?
                          Last edited by arjaydavis; 04-May-2011, 09:55 AM.

                          Comment

                          • Aaron
                            Team Scooter
                            • Oct 2007
                            • 16000

                            #28
                            Hello,

                            Unfortunately, this is the scenario I was worried about in my earlier reply to this forum thread. If
                            "a@b New Text Document New Text Document New Text Document really quite long part a.eml"
                            shorted to
                            "a@b New Text Document New Text Document New Text Documenta.eml"
                            Then we could work on a regular expression to catch this.

                            If it shortens to:
                            "a@b New Text Document New Text Document New Text Document1.eml"
                            then the "1" is different text that must be explicitly defined.
                            The right side would be:
                            $1\1.eml
                            or
                            $1\2.eml

                            With multiple possible matches, it is not guaranteed that part a will align with 1; it may align to 2 or 3 since the "a" is not a part of the regular expression on the left. You could alter the regular expression to match:
                            (.{57}).*a\.eml
                            to
                            $1\1.eml

                            This assumes the "a" is literal. Is there an "a", or something similar, that could be defined and matched on with the Left regular expression?
                            Aaron P Scooter Software

                            Comment

                            • arjaydavis
                              Expert
                              • Oct 2010
                              • 53

                              #29
                              Originally posted by Aaron
                              This assumes the "a" is literal. Is there an "a", or something similar, that could be defined and matched on with the Left regular expression?
                              The appended value is variable so we are going to be unlucky with your suggestion as you feared.

                              However I had about 60 partially matched files to compare and I counted the fixed truncated length that they were at and came up with the following regex setup, based on the suggestion already made here:

                              left hand:
                              (.{123}).*\.eml

                              right hand:
                              $1.eml

                              this matched the files with names where the one side was a pure partial match of the other.

                              The remaining files not matched were those with a unique value appended on the truncated version as discussed in my last comment.

                              I can test these for being identical (and therefore purge them) using MindGems Fast Duplicate File Finder which i also have a license for.

                              I put both folders into the program, one folder being the folder that I want to keep all contents intact and the second folder being the folder that I want to purge the duplicates in, to leave files that are not present in the first folder which i will want to merge in.

                              I make sure I keep the first folder intact by right clicking on this and in the pop up selecting disable auto-scan for this folder so that the program doesnt remove files from the folder but instead the other one i want to purge.

                              another preventative measure is to turn off unicode in burning the DVD so that single bytes are used for filenames which doubles the length available, reducing or eliminating truncation. This is fine if the filenames are ascii only and still complies with UDF standard as there is actually a bit set on or off in the standard to indicate unicode in use, i believe from reading imgburn forums

                              other measures are to write a script to uniquely truncate the source files before burning. this can get complicated if the files are referenced from other files - the reference would be broken. for me this may not be an issue because my files tend to be standalone .eml record files of important or memorable emails i want to keep.



                              the best solution is for beyond compare to have a more advance file selection 'engine', perhaps for BC4 future release that can operate on partial matches with the option for the user to click a button to cycle through the matches if not a single one can be determined, coupled with a selection precedence system whereby filesize can have higher precedence than filename to find a match (this ive discussed as a want in another thread)

                              Therefore it's good that the Beyond Compare featureset hasn't plateaued and that there are opportunities for more releases - and revenue streams for you.

                              In the meantime, No silver bullet for my need at the moment i would think, but several partial solutions.

                              I've outlined them here too:
                              http://superuser.com/questions/27840...tools-software

                              Comment

                              • Aaron
                                Team Scooter
                                • Oct 2007
                                • 16000

                                #30
                                Hello,

                                Thanks for the detailed summary and suggestions. We do have a content alignment method on our Customer Wishlist. I've added your current workflow as a workcase example. Thanks for all the details.
                                Aaron P Scooter Software

                                Comment

                                Working...