ignoring word wrapping

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jon bondy
    Visitor
    • Mar 2015
    • 7

    ignoring word wrapping

    I would like the following two paragraphs to be seen as matching. Is there any way to achieve this?

    one two three four
    five six seven
    eight nine

    one two three
    four five six
    seven eight
    nine
  • Aaron
    Team Scooter
    • Oct 2007
    • 15941

    #2
    Hello,

    BC4 always determines line breaks to be important. The current workaround for this is to perform an external conversion that normalizes where these breaks occur. We have a few for download for specific formats like Java and HTML on our website. We also have a general guide for setting up any command line tidy app and plugging it into Beyond Compare:
    http://www.scootersoftware.com/suppo...rnalconversion

    Comparing across line breaks is on our wishlist. What file format type are you comparing?
    Aaron P Scooter Software

    Comment

    • jon bondy
      Visitor
      • Mar 2015
      • 7

      #3
      Thanks!

      We are comparing HTML that has already been processed (by a plugin) to strip away everything except for the underlying text. Seems like the merging of paragraph text into a single line would be best done with in the plugin, since I hope that it will know what is a paragraph and what is not.

      Another approach would be to make the entire page a single line of text, but that would make it difficult to identify exactly how the observed differences corresponded to the original page.
      Last edited by jon bondy; 30-Mar-2015, 10:16 AM.

      Comment

      • Aaron
        Team Scooter
        • Oct 2007
        • 15941

        #4
        Hello,

        We have an HTML Tidy format available in the "Alternatives" section of our file formats, here:
        http://www.scootersoftware.com/downl...kb_moreformats

        Once installed, go to the Tools menu -> File Formats, and move it up or down the list (above or below) the default HTML format included. Whichever is top-most is the one that is used automatically when scanning or opening HTML files. The other can be manually selected using the Formats dropdown menu on the toolbar or in the Session Settings.

        Please note, that if you Save the file, it will save with the new line breaks, as is.

        If you do find yourself with a single, long line that contains multiple differences, Ctrl+Shift+N will go to the Next Difference within a line.
        Aaron P Scooter Software

        Comment

        • jon bondy
          Visitor
          • Mar 2015
          • 7

          #5
          Thanks. I have no idea how to install the HTML Tidy software. When I go to your link, and then the next link, and then the next link, I end up at a page that does not contain anything useful to download. Can you provide more explicit instructions please?

          My original problem seems to originate in the HTML To Text plugin which takes the original HTML with text on a single line and re-formats it so that the text wraps after 80 characters or so. While this does make things pretty, it also creates a huge number of artificial differences in the text, given BC's notion that a new line is significant. Is there any way to "fix" the HTML To Text plugin, or to make the wrapping optional?

          Thanks, again

          Comment

          • Aaron
            Team Scooter
            • Oct 2007
            • 15941

            #6
            Hello,

            That would depend on the exact version of the format you are using, and if the title is exactly "HTML to Text". We have several "HTML" file format variants, and each formats differently. If you go to the above Alternatives link, then click Windows, then search "HTML" you should see several results. These include HTML Tables, HTML tidied, and HTML to Text.

            Note that HTML to Text is not included in the default installation. It must have been downloaded and installed from this website, or deployed by your IT department. The default format is "HTML" and is also in the list of file formats. You can switch from "HTML to Text" or disable it by unchecking it in the Tools menu -> File Formats dialog, which will then allow the default "HTML" behavior, which will not wrap at 80.

            HTML Tidied would reformat your HTML into unified line breaks. This also includes the HTML code (which HTML to Text removes). Did you need the code removed from the view?
            If you are still having trouble, it may be best to email in your current settings with the Help menu -> Support; Export, to [email protected]
            Please include a link back to this original forum thread, and an example file if you could.
            Aaron P Scooter Software

            Comment

            • jon bondy
              Visitor
              • Mar 2015
              • 7

              #7
              The plugin I am using is exactly "HTML to Text" and I did download and install it.

              The default HTML behavior is not useful because I do not care whether the font changed; I only care if the visible text changed. With full HTML comparison, there are even MORE false differences shown.

              So HTML Tidied tidies everything up (nice), but includes the HTML (not helpful); while HTML as Text removes the HTML (nice) but inserts line breaks (not helpful)?

              Seems like nothing that you offer will help me. It is hard to believe that I am the first person in a decade to want this.

              I noticed that it is possible to attach an external text processor. Are there any tech notes about how to write one of these? Is it just a command line program with two parameters (input file and output file)?

              Thanks!

              Comment

              • Aaron
                Team Scooter
                • Oct 2007
                • 15941

                #8
                Hello,

                If you could email your BCsupport.zip and sample files, I could verify it is set up correctly. For example, if the File Format's Conversion tab, characters per line limit has been set to 80, this would explain the behavior you are hitting. The default value is 4096.

                And correct: the File Format's Conversion tab, External program takes an input and output (*.txt). We then open the output.txt file.
                Aaron P Scooter Software

                Comment

                • Aaron
                  Team Scooter
                  • Oct 2007
                  • 15941

                  #9
                  Thanks for the sample files. With these text blocks, it does look like the downloadable HTML to Text format will introduce line breaks, presumably for readability. There don't appear to be any command line parameters to help control this behavior either. If you've found one, I would recommend plugging another HTML2Text utility in as the External Conversion that produces content closer to what you would like to compare.

                  The External Conversion command line can run a program as if it were from the Windows Command Line, given an input.html file and an output.txt file as our parameters. We have documentation on this setup (using .resx as the example file type) here:
                  http://www.scootersoftware.com/suppo...rnalconversion
                  Aaron P Scooter Software

                  Comment

                  • jon bondy
                    Visitor
                    • Mar 2015
                    • 7

                    #10
                    I wrote my own plugin. Thanks

                    Comment

                    • Aaron
                      Team Scooter
                      • Oct 2007
                      • 15941

                      #11
                      Great to hear. If your plug-in has general purpose, please feel free to email it to us with a link back to this forum thread. We'll look it over and may be able to host it on the website.
                      Aaron P Scooter Software

                      Comment

                      • jon bondy
                        Visitor
                        • Mar 2015
                        • 7

                        #12
                        It took an hour to write. I am sure it is within the capabilities of Scooter Software.

                        Comment

                        Working...