No announcement yet.

Ignore accumulated Byte Order Markers (BOM)

  • Filter
  • Time
  • Show
Clear All
new posts

  • Ignore accumulated Byte Order Markers (BOM)

    I've been trying to get BC4 to ignore accumulated BOM at the front of a text file but can't seem to find the right regular expression magic.

    I have UTF-16 text files in a build that accumulate UTF-16LE Byte Order Markers (BOM) at the front of the file due to a bug in a tool. A new raw file has a BOM pair, 0xFF, 0xFE, at the beginning of the file but due to the bug an additional pair or two may get added each time the file is processed. While we get the vendor to fix the tool, I'd like to create a file format/grammar for these files that ignores the accumulated BOM.

    An education/suggestions on regular expressions involving one or more pair of hex bytes and \x would be greatly appreciated

  • #2

    Our Text Compare will show the Encoding in the status bar, but the BOM itself is not in the comparison text below. A rules-based scan would normally ignore this information. Is there also hex or binary characters inserted in the Text Compare's main text pane? If you enable the View menu -> Hex Details, what does it show for the information on the first line? Could we see a full screen screenshot? You can post here or email at along with a link back to this forum thread.

    Generally, you are correct, you'd define an unimportant grammar RegEx with \x which can define a hex code.
    But you would want to verify the Hex info you are trying to ignore is actually represented in the main text pane.
    Aaron P Scooter Software


    • #3
      Thank you for your time (and the great tool)
      I attached a screenshot and an archive of the example files used in the screenshot.
      The files in the archive are of type .uni which are UTF-16LE Text Files
      I should add:
      File Example1.uni is an example of an accumulation of two additional pairs of BOC
      File Example2.uni is an example of an accumulation of one additional pair of BOC.
      A normal file would only have one pair
      Attached Files
      Last edited by lbeazley; 12-Mar-2018, 03:34 PM.


      • #4

        The quick answer is Little Endian requires inverting the \x{NNNN} sequence, so it'd be:

        The another method is you could select the literal blank character(s) by placing the cursor just left of the first /, then shift+arrow to select the invisible characters (which you can see Selecting in the View menu -> Hex Details below), Copy to clipboard, and then in the Session Settings, Importance tab, create a new Unimportant element and Paste this invisible, literal character in. Make sure it *isn't* Regular Expression, and click ok to ignore the literal (invisible) character from the clipboard.
        Aaron P Scooter Software