Results 1 to 4 of 4
  1. #1
    Join Date
    Nov 2015
    Posts
    2

    Default Ignore accumulated Byte Order Markers (BOM)

    I've been trying to get BC4 to ignore accumulated BOM at the front of a text file but can't seem to find the right regular expression magic.

    I have UTF-16 text files in a build that accumulate UTF-16LE Byte Order Markers (BOM) at the front of the file due to a bug in a tool. A new raw file has a BOM pair, 0xFF, 0xFE, at the beginning of the file but due to the bug an additional pair or two may get added each time the file is processed. While we get the vendor to fix the tool, I'd like to create a file format/grammar for these files that ignores the accumulated BOM.

    An education/suggestions on regular expressions involving one or more pair of hex bytes and \x would be greatly appreciated

  2. #2
    Join Date
    Oct 2007
    Location
    Madison, WI
    Posts
    11,953

    Default

    Hello,

    Our Text Compare will show the Encoding in the status bar, but the BOM itself is not in the comparison text below. A rules-based scan would normally ignore this information. Is there also hex or binary characters inserted in the Text Compare's main text pane? If you enable the View menu -> Hex Details, what does it show for the information on the first line? Could we see a full screen screenshot? You can post here or email at support@scootersoftware.com along with a link back to this forum thread.

    Generally, you are correct, you'd define an unimportant grammar RegEx with \x which can define a hex code.
    http://www.scootersoftware.com/suppo..._unimportantv3
    But you would want to verify the Hex info you are trying to ignore is actually represented in the main text pane.
    Aaron P Scooter Software

  3. #3
    Join Date
    Nov 2015
    Posts
    2

    Default

    Thank you for your time (and the great tool)
    I attached a screenshot and an archive of the example files used in the screenshot.
    The files in the archive are of type .uni which are UTF-16LE Text Files
    I should add:
    File Example1.uni is an example of an accumulation of two additional pairs of BOC
    File Example2.uni is an example of an accumulation of one additional pair of BOC.
    A normal file would only have one pair
    Attached Images Attached Images
    Attached Files Attached Files
    Last edited by lbeazley; 12-Mar-2018 at 03:34 PM.

  4. #4
    Join Date
    Oct 2007
    Location
    Madison, WI
    Posts
    11,953

    Default

    Hello,

    The quick answer is Little Endian requires inverting the \x{NNNN} sequence, so it'd be:
    \x{FEFF}

    The another method is you could select the literal blank character(s) by placing the cursor just left of the first /, then shift+arrow to select the invisible characters (which you can see Selecting in the View menu -> Hex Details below), Copy to clipboard, and then in the Session Settings, Importance tab, create a new Unimportant element and Paste this invisible, literal character in. Make sure it *isn't* Regular Expression, and click ok to ignore the literal (invisible) character from the clipboard.
    Aaron P Scooter Software

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •