An Apples to Oranges Comparison?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • a_bc_user
    Journeyman
    • Jun 2009
    • 11

    An Apples to Oranges Comparison?

    How can, alone or in conjunction with a 3rd party utility or some Windows OS tweaking, BC conduct a Text to Folder session?

    My current understanding is that BC does Text file to Text file, or Folder to Folder, but NOT Text file to Folder sessions. I have a text file that includes a large list of phrases, each on its own line (e.g. titles of famous paintings, songs I would like to obtain images and tracks of). I also have a huge folder in Windows with many such media files I have already obtained, sorted in deep levels of subfolders. I'd like to know which of the items in my list I have already obtained (as the filenames usually contain some part of a string from the phrase), so I can remove them from my list and do not go after those again. Maybe there's some batch search utility I don't know about, so I'm trying to do this with BC as BC goes "beyond" comparisons.

    The search would have to be approximate (like Google/Windows Search) not exact as the filenames in the folders would be close but not the same as the items in my list file, and I noticed the Text comparison logic BC3 uses is good enough. Any ideas? I may sound like a power user but I'm not so I appreciate details. Thanks...
    Last edited by a_bc_user; 17-Jan-2010, 05:48 AM.
  • tlscales
    Expert
    • Oct 2007
    • 74

    #2
    A quick and dirty way would be to open up a command prompt, go to the folder you are interested in, and enter this command:

    dir > filenames.txt

    This puts a list of all of the file names in the folder in "filenames.txt", one per line, which you can then compare to your existing file of phrases. This ought to be at least close to what you are looking for.

    Comment

    • a_bc_user
      Journeyman
      • Jun 2009
      • 11

      #3
      Thanks tlsscales, I think we're making progress with this approach, though it's looking like it's going to be more of a slow and dirty way for me! I channeled the output into a text file with the syntax you provided, adding /s after the dir command to drill into subdirectories. I then loaded the two text files in a BC Text Compare session. It seems somehow I have to tweak some BC settings for it to be useful, as right now I'm getting all red text on both sides, with 1-3 coincidental letters being matched as black, rather than words or phrases. How can I do this? Also, I don't know if it's necessary for the comparison to work, but there's a lot of useless information in the Dir output that doesn't have to do with a filename (like directory names, size, spaces, etc).

      Left side example (my small list of items I am looking for):

      Carl Orff - O Fortuna
      Leonardo Davinci - Mona Lisa
      [ETC...]

      Right side example (directory output):

      Volume in drive E is LACIE-NTFS
      Volume Serial Number is XXXX-XXXX

      Directory of E:\MYDOCU~1\MYMEDI~1

      17-Jan-10 10:00 PM <DIR> .
      17-Jan-10 10:00 PM <DIR> ..
      23-Jun-09 03:30 PM <DIR> Paintings
      17-Jan-10 10:00 PM 0 output.txt
      14-Jan-10 02:48 PM <DIR> Operas
      1 File(s) 0 bytes

      Directory of E:\MYDOCU~1\MYMEDI~1\Paintings

      23-Jun-09 03:30 PM <DIR> .
      23-Jun-09 03:30 PM <DIR> ..
      05-Sep-08 12:24 AM 1,000,000 01-Henri Matisse Odalisque Watercolor.jpg
      05-Sep-08 12:24 AM 1,000,000 02-Mona Lisa and other Louvre Works.jpg
      2 File(s) 2,000,000 bytes

      Directory of E:\MYDOCU~1\MYMEDI~1\Operas

      23-Jun-09 03:30 PM <DIR> .
      23-Jun-09 03:30 PM <DIR> ..
      05-Sep-08 12:24 AM 1,000,000 8-Fortuna Imperatrix Mundi.wav
      05-Sep-08 12:24 AM 2,000,000 Fiddler on the Roof.wav
      [ETC...]

      So in this example I'd need BC to align the lines containing "Mona Lisa" and "Fortuna" on both sides on the same line and mark them accordingly as a match.

      Or a completely different approach to knowing which items in the left are already contained in the right (some kind of search maybe?).

      Comment

      • Chris
        Team Scooter
        • Oct 2007
        • 5538

        #4
        If the file you're comparing the directory listing against only has filenames in it, using "DIR /B > filenames.txt" might help. /B is for bare, it gives a directory listing with only filenames.

        You can list all command line switches supported by DIR by entering DIR /? on the command line.

        If the names in your list aren't in alphabetical order, you might also want to use the "Sorted" file format to sort the filenames before they are compared.
        Chris K Scooter Software

        Comment

        • a_bc_user
          Journeyman
          • Jun 2009
          • 11

          #5
          Ahh you're bringing back good ol' DOS to me now after 15 years. good of MS to keep "cmd" in Windows 7 and also the nifty Windows PowerShell that understands unix and dos commands.

          So, the best I could do with the Dir command attributes is:
          dir /a:-D /B O:N /s > list.txt
          The resulting file drills into subdirectories and lists files only with no directories, sorted alphabetically.

          Still, I'm not getting just the filename to be useful when looking at it in BC... the full path of every file is listed on each line, which is quite verbose as these 1000s of files go about 7 directory levels deep.

          Example: E:\Dir\Dir1\Dir1A\Dir1Aa\file.wav

          And the files are not sorted alphabetically compared to each other, but relative to only the other files in their subdirectory then the order resets.

          If there isn't some syntax to do/script the dos command so it ignores the path, maybe there's a rule in BC to cut out everything before the last "\" on each line, etc.?

          Comment

          • Chris
            Team Scooter
            • Oct 2007
            • 5538

            #6
            It is possible to mark some of the path unimportant in BC3. See the following link for instructions: http://www.scootersoftware.com/suppo...mportantv3.php
            Chris K Scooter Software

            Comment

            • Zoë
              Team Scooter
              • Oct 2007
              • 2666

              #7
              If you want to stick with a purely command-line script, I'd suggest passing your file through Sed to strip off the paths then use the Windows "sort" command to sort the resulting file. A plain text editor that supports seach and replace using regular expressions would work too.

              The regular expressions to use would be:

              Find: .*\\([^\\]+)$
              Replace with: $1

              I tested it with EditPad Pro and it handled the replacement and has a sort command built-in.
              Zoë P Scooter Software

              Comment

              • a_bc_user
                Journeyman
                • Jun 2009
                • 11

                #8
                OK, I've tried both approaches and decided to first strip the paths then load into BC. The regular expression provided worked beautifully. Of course there are always some unexpected remaining things needing to be stripped, such as some filenames are preceded by track numbers, which I had some hurdles forming the right [1-9][0-9] syntax for.
                I'll keep learning about regex;
                Perhaps even without sorting it is good enough now after paths have been stripped to look at in BC... at least all the filenames fit on the right screen.

                In BC, the compare is still turning up a lot of reds as it is trying to match letters (matched in black) rather than whole words. Was there some way to do this with the Grammar match whole words only, or another regular expression within BC?

                Comment

                • Aaron
                  Team Scooter
                  • Oct 2007
                  • 15997

                  #9
                  Could you give some specific examples? You may just need to tweak your Alignment settings to avoid mismatches, such as disable Align Similar, or Never Align Mismatches.
                  Aaron P Scooter Software

                  Comment

                  • a_bc_user
                    Journeyman
                    • Jun 2009
                    • 11

                    #10
                    Sure,

                    Left side (list of items I am looking for):

                    Blah
                    Carl Orff - O Fortuna
                    Blah Blah
                    Leonardo Davinci - Mona Lisa
                    Blah Blah Blah
                    [ETC...]

                    Right side (actual directory output I am looking in):

                    22-Mona Lisa and other Louvre Works.jpg
                    Mumbo
                    Mumbo Jumbo
                    8-Fortuna Imperatrix Mundi.wav
                    Coco Mumbo Jumbo
                    [ETC...]

                    What is currently happening is:
                    - The text on both sides is all mainly red.
                    - The lines on both sides are aligned when just 1 or 2 letters match.
                    - Those corresponding letters are in black.
                    I think the "l" in the first left Blah would be black as is the first "L" in 22-Mona Lisa and other Louvre Works.jpg, but it may be that this is only marked as a match when the letters occupy the same position in the phrase.

                    What should happen is:
                    - The text on both sides is all mainly red
                    - The lines on both sides are aligned when 1 complete word matches (e.g., "Mona" or "Fortuna", irrespective of the position of the letters on that line. That word is marked black. Something similar to:

                    Left:
                    Carl Orff - O Fortuna
                    Leonardo Davinci - Mona Lisa
                    Blah
                    Blah Blah
                    Blah Blah Blah

                    Right:
                    8-Fortuna (Imperatrix Mundi).wav
                    22-Mona Lisa Displayed at the Louvre.jpg
                    Mumbo
                    Mumbo Jumbo
                    Coco Mumbo Jumbo

                    Basically I just want the simplest way to catch which titles I already have files corresponding for.

                    Comment

                    • Aaron
                      Team Scooter
                      • Oct 2007
                      • 15997

                      #11
                      Hello,

                      Disable Align Similar lines. Also, create a grammar element for ^\d+-, which will match on the number at the beginning of the line. You can then mark this grammar element as Unimportant.

                      If you can define grammar elements for any Unimportant sections, you can then enable Never Align Mismatches, which will only take Important sections into account.

                      Also, try the Alternate method of Alignment to see if it provides better results.
                      Aaron P Scooter Software

                      Comment

                      • a_bc_user
                        Journeyman
                        • Jun 2009
                        • 11

                        #12
                        Thanks all who got me to this stage. It's helped me a lot. I do admit what I'm looking for is idealistic --- to do a multiple simultaneous file search in one shot through BC with quick meaningful results instantly -- but with your assistance at least now I'm 70% there. Kudos to this great product...

                        Comment

                        • Aaron
                          Team Scooter
                          • Oct 2007
                          • 15997

                          #13
                          Thanks.

                          If you can pre-process your data with a command line helper program, you can automatically call this helper program each time the files are opened:
                          http://www.scootersoftware.com/suppo...rnalconversion
                          Aaron P Scooter Software

                          Comment

                          Working...