Vergleich von Textdateien mit mehr als 65535 Zeichen pro Zeile

**Aaron** · 17-Aug-2011, 10:40 AM

max. Zeichen pro Zeile

BC3's current limit is 65536 characters per line. We cannot compare greater than that without wrapping. If you can, set a lower limit to automatically wrap at specific points, or use a pre-process conversion utility to cut your lines at specific intervals less than the max line length.

We have a KB article that goes into more detail on pre-processing, as detailed here:
http://www.scootersoftware.com/suppo...rnalconversion

============================

BC3's Begrenzung liegt derzeit bei 65536 Zeichen pro Zeile. Wir können größere Zeilen nicht ohne Umbrechen vergleichen. Wenn möglich, setz eine niedrigere Obergrenze, um an bestimmten Punkten umzubrechen, oder verwende ein Präprozess-Konvertierungsprogramm, um die Zeilen in bestimmten Abständen unterhalb der Maximalbegrenzung zu schneiden.

Wir haben einen Knowledge-Base-Artikel, der hinsichtlich Pre-Processing mehr ins Detail geht:
http://www.scootersoftware.com/suppo...rnalconversion

**MiroJ** · 16-May-2013, 10:16 AM

Hi Aaron,

the suggested preprocessing sounds maybe interesting, but it is a workaround.

Problem:
Let’s consider we have two XML files A and B, with approx. 1 MB content with no line breaks, which differs only in 3 extra continuous blanks in the file “A”. The blanks are pre-set as unimportant difference, but BC presents those 2 files as having plenty of important differences!

I suppose and suggest, this could be solved in BC programmatically.

Proposal:
If you look at the differences, than you first see the 3 unimportant spaces in the middle of the first line of difference in A. But this line is also marked as having an important difference! The “important” difference are exact 3 “extra” characters (letters) in the line from file B. Exactly this 3 characters on the beginning of the fictive “next line” of A are then claimed to be an important difference to B.

This scenario with 3 characters on each BC fictive line is then repeated till the end of the real line or EOF.

Generally speaking, if there is a length difference of X characters between the line of A and B files before BC breaks the line, and A has the extra X characters more, then B shows exactly amount of X extra characters of important difference at the end of the first fictive BC line. On the beginning of the next line A shows to have a difference in exactly the same string with length X as on previous fictive line end. And so it commutates till the end of line. Just give it a try in BC to get a better picture of this.

Could it be possible to:
a) compare the whole lines in one piece? If no, then to
b) break the “shorter” B line count of X characters sooner than the line of A? Consider you have the sense of the lines being the same till the fictive line end minus X!
c) design differentiation of the fictive and real line breaking. I hope you are not limited by an 16 Bit int…

Background:
I am a colleague of Roland and now I am coping with the differences between two sets of half a million of XHTML documents. They are not “pretty formatted”, e.g. because the extra whitespace it is unwanted and could cause also different spacing in the HTML pages. Also because it is a lawyer text, the difference recognition is crucial and any manipulation of data should be prohibited. Also a preprocessing is considered as a source of errors.

Unformatted XML is almost always one or couple of huge lines. We cope with documents at maximum smaller than 5 MB.

Bye
Miro

**Aaron** · 16-May-2013, 04:01 PM

Hello Miro,

Thanks for the suggestion. Comparing across line breaks and XML comparisons are on our wishlist. I'll add your ideas to our entry on the subject.

Vergleich von Textdateien mit mehr als 65535 Zeichen pro Zeile

Vergleich von Textdateien mit mehr als 65535 Zeichen pro Zeile

Comment

Comment

Comment