ignoring CRLF

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mikedgre
    Enthusiast
    • Jan 2018
    • 28

    ignoring CRLF

    I am comparing 2 data files from Mainframe. Inside the data records there can be CRLF inside the data record. The hex value is 15 that shows in the record. When the compare results are displayed, the CRLF is interpreted and a CRLF is performed. The line splits once the CRLF is hit so it becomes 2 lines in the compare output. For example, if there are 10 lines all having a CRLF in it, 20 lines show in the compare output.
    Can the CRLF be ignored and interpreted as part of the data ? This doubling of the line is skewing the results.
    Many unexpected split up lines show. The encoding is ANSI.


    Thanks.
  • Aaron
    Team Scooter
    • Oct 2007
    • 15941

    #2
    If you use the Table Compare, using a format with a defined string can include the CRLF in the data of a cell.

    The Table Compare is column/row defined cells, with a default Key as Column 1. You can define different Key columns by right clicking any column header and setting each as Standard, Unimportant, or Key (and multiple Key columns act together as the Key).
    Aaron P Scooter Software

    Comment

    • mikedgre
      Enthusiast
      • Jan 2018
      • 28

      #3
      update

      The file does not have delimiters (semicolon, comma, etc).
      The CRLF is needed for source code compares because CRLF always means do a line break.
      If I have 1 million records, 100s of records can have CRLF when records having over 1000 bytes of data.
      I tried the table compare but it doesn't work with a non delimited file.

      If I do a CRLF replace with space before doing the compare, there will be no un needed line breaks.
      This is not ideal since I have to copy the file and convert it without CRLF.
      It seems having an option to ignore doing a line break when it sees CRLF in data files would be a good feature to have in BC.
      It would include CRLF in the compare but do not do the line break in the compare result. Line breaking is fine for source code or report type files though. I am only referring to non delimited raw data files.



      Originally posted by Aaron
      If you use the Table Compare, using a format with a defined string can include the CRLF in the data of a cell.

      The Table Compare is column/row defined cells, with a default Key as Column 1. You can define different Key columns by right clicking any column header and setting each as Standard, Unimportant, or Key (and multiple Key columns act together as the Key).
      Last edited by mikedgre; 15-Feb-2018, 04:34 PM.

      Comment

      • Aaron
        Team Scooter
        • Oct 2007
        • 15941

        #4
        Hello,

        Are your data files Fixed Width (every column is a specific number of characters)? Because the Table Compare format can be defined for either Delimited or Fixed style in the Format button (or Tools menu -> Formats dialog, new Table format).

        Is the Hex value 15 the line break interior of your lines or the one that should be respected? What differentiates the two between those you need ignored and those you need to break? If you can define any string to swallow the former, then the Table Compare can encapsulate the entire line as a single cell (if the Fixed style doesn't work).

        Would it be possible to send us example files (and note an example line number/character position of the line break you want to ignore vs. one you want preserved) to [email protected]?
        Aaron P Scooter Software

        Comment

        • mikedgre
          Enthusiast
          • Jan 2018
          • 28

          #5
          no progress...

          Yes. The hex is 15 for this CRLF. The width of the fields are fixed but there are rare times where this may not be the case.
          I do not want to map the fields out every time. Some files have over 200 fields and can be packed, binary, etc.
          I was hoping the single cell approach would work. It only shows a max number of bytes in the compare result. I only see part of the record and the changed bytes do not show if they are later in a long record.
          After looking further, this line splitting is not that much of a negative. Because I am only focused in differences, the chance of
          CRLF splits is very minimal. When I share this to others, I will just mention that this can happen. However, it would be
          nice to have a checkbox to not do CRLF line breaks if the user was only doing raw file data compare.

          The CRLF happens at random places and can be any position in the record. It could be byte 17 in 1 record, byte 85 in another, or even 2,3,4,5,6..... instances of CRLF in the record. Having BC not 'process' a CRLF would be
          very nice in this case. I would say the best solution is to treat the CRLF as any other data in the compare.


          Originally posted by Aaron
          Hello,

          Are your data files Fixed Width (every column is a specific number of characters)? Because the Table Compare format can be defined for either Delimited or Fixed style in the Format button (or Tools menu -> Formats dialog, new Table format).

          Is the Hex value 15 the line break interior of your lines or the one that should be respected? What differentiates the two between those you need ignored and those you need to break? If you can define any string to swallow the former, then the Table Compare can encapsulate the entire line as a single cell (if the Fixed style doesn't work).

          Would it be possible to send us example files (and note an example line number/character position of the line break you want to ignore vs. one you want preserved) to [email protected]?
          Last edited by mikedgre; 16-Feb-2018, 02:26 PM.

          Comment

          • Aaron
            Team Scooter
            • Oct 2007
            • 15941

            #6
            Hello,

            Ok, thanks. To clarify, what character does signify the end of your lines? It looks like you have mostly fixed width, except when it's not? Is there a different line ending character that you are using?
            Aaron P Scooter Software

            Comment

            • mikedgre
              Enthusiast
              • Jan 2018
              • 28

              #7
              Im not sure how the end of line gets done. All the records have the same length. A CRLF must be attached to the end of each record. I put max character per line as 3000. Each record is 352 bytes long. On the mainframe, the record can have a CRLF or hex value 15. This value would be randomly buried in inside the record. Your software detects mainframe file length and inserts a CRLF at end of each record? I think telling BC to process 352 bytes of record 'as is' with only CRLF at end would work. Im just brainstorming

              Comment

              • Dave_L
                Veteran
                • Dec 2007
                • 348

                #8
                a CRLF or hex value 15
                I'm curious about what kind of enviroment this is, if it's not proprietary information.

                The conventional line terminator characters are CR (carriage return), which is hexadecimal D (decimal 13); and LF (line feed), which is hexadecimal A (decimal 10). These two characters occur by themselves or in the combination CR followed by LF. The terms "carriage return" and "line feed", of course, are taken from the old mechanical typewriters or printers, where they correspond to physical movements of the carriage.

                In this context, I've only seen "15" as the octal representation of hexidecimal D (decimal 13).

                Comment

                • mikedgre
                  Enthusiast
                  • Jan 2018
                  • 28

                  #9
                  correction

                  I have a correction. Pardon me. The hex value that I see in the file is 15. Sorry about that. I went into the following link to obtain the value and chart for hex 15 https://www.ibm.com/support/knowledg...ef/asciit.html

                  It is 'new line' from what I see. New line must be causing the 'line break' and not CRLF.

                  Decimal Value = 21
                  Hex Value = 15
                  Control Character = Ctrl-U
                  ASCII Symbol = NAK
                  Meaning = negative acknowledge
                  EBCDIC Symbol = NL
                  Meaning = new-line




                  Originally posted by Dave_L
                  I'm curious about what kind of enviroment this is, if it's not proprietary information.

                  The conventional line terminator characters are CR (carriage return), which is hexadecimal D (decimal 13); and LF (line feed), which is hexadecimal A (decimal 10). These two characters occur by themselves or in the combination CR followed by LF. The terms "carriage return" and "line feed", of course, are taken from the old mechanical typewriters or printers, where they correspond to physical movements of the carriage.

                  In this context, I've only seen "15" as the octal representation of hexidecimal D (decimal 13).

                  Comment

                  • Dave_L
                    Veteran
                    • Dec 2007
                    • 348

                    #10
                    mikedgre, thanks for the information.

                    Comment

                    • mikedgre
                      Enthusiast
                      • Jan 2018
                      • 28

                      #11
                      I also see hex 25 showing line breaks too. This corresponds to line feed.

                      I think these records with hex 15 or 25 will show up on high volume files so it is not an issue to me. I only filter on differences so expect the % of the file having these breaks to be microscopic( .1 percent or less) Thanks.

                      Comment

                      • Aaron
                        Team Scooter
                        • Oct 2007
                        • 15941

                        #12
                        It sounds like you might be using an EBCDIC mainframe over FTP? These usually perform the conversion themselves for the newline character for FTP clients. If you are in the Text Compare, and use the View menu -> Hex Details, what is the hex value at the end of the line, and is it the same hex value for the data you are trying to ignore?
                        Aaron P Scooter Software

                        Comment

                        • mikedgre
                          Enthusiast
                          • Jan 2018
                          • 28

                          #13
                          values shown in editor on line breaks...

                          Attached are 4 clippings of what I see. I captured the hex value on the Mainframe. I put hex 15 on the 1st 2 records shown and hex 25 on the 3rd record in the clipping. This is 8 bytes into the record. I type the command 'hex on' from the command line in the editor to see this hex representation and enter these 2 values. For the BC editor in text mode you can see the 'last char of lines' clip I attached. This pink circle with 4 marks protruding out is the last character I see on all the
                          line endings. I was only able to see zero in hex mode for this end of line value.

                          For the lines that have the random unexpected line break, I attached a snip showing the value. It is the pink 0A in hex mode. The text version looks like a J with a half moon at the top. In the chart I see the control character representation as Control + J. If I do control + J in notepad, it jumps to next line with no value.
                          I decided to test the hex representation of CRLF in the editor. I hit enter somewhere in the line and the 0D 0A shows as the hex (CRLF) value.

                          In my opinion the hex 15 and hex 25 on Mainframe are showing as the hex 0A in the editor. Do you know what
                          is going on ?

                          Attached Files

                          Comment

                          • Aaron
                            Team Scooter
                            • Oct 2007
                            • 15941

                            #14
                            Hello,

                            When acting as an FTP server, the mainframe will accept FTP commands from a client, including ASCII or binary transfers. These usually translate line ending characters into something an FTP client could read. It looks like your file has both LF (0A) and CRLF (0D0A) line ending characters in the file.
                            https://en.wikipedia.org/wiki/Newlin...specifications

                            We don't have explicit control over this, but should behave similarly to other plain FTP clients. We do have some options under the Tools menu -> Profiles dialog to edit settings (like Transfer type), although I'd expect to see a CRLF or LF when accessing an FTP Server.

                            BC4 will treat both as line endings, and handles mixed line ending files (since, theoretically, a file should always generate with either/or, not both, but this allows you to find and fix mixed character files). Graphically, we show the two different symbols (circle and J) for each type of the line endings.
                            Aaron P Scooter Software

                            Comment

                            • mikedgre
                              Enthusiast
                              • Jan 2018
                              • 28

                              #15
                              more info and suggestion

                              The character that I see at the end of each line is the CRLF. That was the circle with 4 marks symbol I was referring to.
                              I noticed this after hitting enter inside the record. I also see that the Carriage Return 0D by itself can give a line break.
                              So any combinations producing a line break are CR only(0D), LF only (0A), and CRLF (0A0D)

                              I copied each of these record scenarios into MS Word. All of these records do the line break too. The editor
                              interprets the CR, LF, and CRLF and does the line break. BC 4 follows this.
                              Notepad does NOT do any breaking when it gets CR, LF, and CRLF. The record length is picked up from the FTP tool
                              and builds the records having the file length detected.

                              I am armed with knowledge about what is going on now. Do you think BC can have an option or enhancement to have a hands off approach like Notepad ? Notepad does not line break on CR, LF, CRLF. I can see this being a plus in my case
                              since I want to compare high volume raw data files. Thanks for your help.



                              Originally posted by Aaron
                              Hello,

                              When acting as an FTP server, the mainframe will accept FTP commands from a client, including ASCII or binary transfers. These usually translate line ending characters into something an FTP client could read. It looks like your file has both LF (0A) and CRLF (0D0A) line ending characters in the file.
                              https://en.wikipedia.org/wiki/Newlin...specifications

                              We don't have explicit control over this, but should behave similarly to other plain FTP clients. We do have some options under the Tools menu -> Profiles dialog to edit settings (like Transfer type), although I'd expect to see a CRLF or LF when accessing an FTP Server.

                              BC4 will treat both as line endings, and handles mixed line ending files (since, theoretically, a file should always generate with either/or, not both, but this allows you to find and fix mixed character files). Graphically, we show the two different symbols (circle and J) for each type of the line endings.

                              Comment

                              Working...