Ignore accumulated Byte Order Markers (BOM)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • lbeazley
    New User
    • Nov 2015
    • 2

    Ignore accumulated Byte Order Markers (BOM)

    I've been trying to get BC4 to ignore accumulated BOM at the front of a text file but can't seem to find the right regular expression magic.

    I have UTF-16 text files in a build that accumulate UTF-16LE Byte Order Markers (BOM) at the front of the file due to a bug in a tool. A new raw file has a BOM pair, 0xFF, 0xFE, at the beginning of the file but due to the bug an additional pair or two may get added each time the file is processed. While we get the vendor to fix the tool, I'd like to create a file format/grammar for these files that ignores the accumulated BOM.

    An education/suggestions on regular expressions involving one or more pair of hex bytes and \x would be greatly appreciated
  • Aaron
    Team Scooter
    • Oct 2007
    • 16002

    #2
    Hello,

    Our Text Compare will show the Encoding in the status bar, but the BOM itself is not in the comparison text below. A rules-based scan would normally ignore this information. Is there also hex or binary characters inserted in the Text Compare's main text pane? If you enable the View menu -> Hex Details, what does it show for the information on the first line? Could we see a full screen screenshot? You can post here or email at [email protected] along with a link back to this forum thread.

    Generally, you are correct, you'd define an unimportant grammar RegEx with \x which can define a hex code.
    http://www.scootersoftware.com/suppo..._unimportantv3
    But you would want to verify the Hex info you are trying to ignore is actually represented in the main text pane.
    Aaron P Scooter Software

    Comment

    • lbeazley
      New User
      • Nov 2015
      • 2

      #3
      Thank you for your time (and the great tool)
      I attached a screenshot and an archive of the example files used in the screenshot.
      The files in the archive are of type .uni which are UTF-16LE Text Files
      I should add:
      File Example1.uni is an example of an accumulation of two additional pairs of BOC
      File Example2.uni is an example of an accumulation of one additional pair of BOC.
      A normal file would only have one pair
      Attached Files
      Last edited by lbeazley; 12-Mar-2018, 03:34 PM.

      Comment

      • Aaron
        Team Scooter
        • Oct 2007
        • 16002

        #4
        Hello,

        The quick answer is Little Endian requires inverting the \x{NNNN} sequence, so it'd be:
        \x{FEFF}

        The another method is you could select the literal blank character(s) by placing the cursor just left of the first /, then shift+arrow to select the invisible characters (which you can see Selecting in the View menu -> Hex Details below), Copy to clipboard, and then in the Session Settings, Importance tab, create a new Unimportant element and Paste this invisible, literal character in. Make sure it *isn't* Regular Expression, and click ok to ignore the literal (invisible) character from the clipboard.
        Aaron P Scooter Software

        Comment

        Working...