Announcement

Collapse
No announcement yet.

Regular expression delimiters

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regular expression delimiters

    In some regular expressions, you are able to use \< and \> to delimit a word, e.g., \<map\> would find map by itself but not find map in remap or in mapper. This capability seems to be missing in the regular expressions in BC. Is there some way to emulate it? Also the use of [ and ] seems to be missing so that it looks like I will have to use ( and ) for alternate character selections.

  • #2
    Re: Regular expression delimiters

    The use of [ ], such as [a-z] or [ABCD] to specify multiple characters is supported in BC's regular expression engine.

    Word boundaries are represented by the \b expression. The regular expression \bmap\b will match map but it won't match remap.

    I'm not familiar with the \< and \> delimiters, but I don't think they're supported by BC's regular expression library.
    Chris K Scooter Software

    Comment


    • #3
      Originally posted by Chris View Post
      Re: Regular expression delimiters

      Word boundaries are represented by the \b expression. The regular expression \bmap\b will match map but it won't match remap.
      It doesn't work. I've tried to use it to matach keyword (ie. via a RegEx "\b(if|and|or)\b") but nothing matches at all.


      The problem would be solved if keywords/token lists would be automatically treated by BC4 as boundaried tokens (this IMO would be the easiest solution). But this is not the case for subwords are matched as false-positives eg. "or" inside "door" is captured as a keyword.

      A similar problem occurs for numbers definitions, eg. "\d+" will cause a number part of an identifier to be seen as a number token (eg. the "1" in "func1"). Again, trying to use "\b\d+\b" will cease matching any number at all.

      As of now, in BC4 v4.2.9 there seems no way to define keywords list that won't lead to false-positives as sub-words.

      I'm not sure if this is a bug or a limitation of the RegEx engine subset, but support of "\b" is definitely needed (since lookhead isn't supported either). I wouldn't worry too much about performance today, for CPUs are fast, so adding a full RegEx engine (eg. Onigurama) would be an improvement. Then of course, it's up to the end users to make a wise use of RegEx patterns, and avoid expensive patterns.

      But right now, there is very limited support for custom syntaxes.

      Comment


      • #4
        Hello,

        In reference to BC2 and the original post from 2005, \b does function, but assumes whitespace to determine an exact match, so \b\d+\b matches on "123" but not "func123".

        If working with a newer version of BC4, \b is not supported as part of a Grammar Element, but is supported in the Session Settings, Importance tab, Unimportant Text definition. Expanding the Regular Expression engine for grammar elements is something on our wishlist.

        In your above example, the "LIST" type of BC4 grammar definition is a list of keywords that functions like your example (assuming whitespace matches). You can compare against the default examples (C#, Delphi, HTML, etc) to see how to create a list of keywords. If you do not have BC4 yet, you can test with the fully functional trial, here:
        http://www.scootersoftware.com/download.php
        Aaron P Scooter Software

        Comment


        • #5
          Thanks for your reply Aaron,

          Originally posted by Aaron View Post
          If working with a newer version of BC4, \b is not supported as part of a Grammar Element ... Expanding the Regular Expression engine for grammar elements is something on our wishlist.
          That's a pity, for grammars is an area which would have much benefited from more powerful RegEx (especially lookaround). Looking forward for its implementation.

          Originally posted by Aaron View Post
          In your above example, the "LIST" type of BC4 grammar definition is a list of keywords that functions like your example (assuming whitespace matches). You can compare against the default examples (C#, Delphi, HTML, etc) to see how to create a list of keywords. If you do not have BC4 yet, you can test with the fully functional trial, here:
          http://www.scootersoftware.com/download.php
          I'm an old time BC4 user, just never posted on the forums before.

          As for the grammars examples, I've looked into C, etc., but extensive tests in the last years have confirmed the above mentioned problem.

          In a language where "in" and "or" are keywords, identifiers like "origin" will produce a false positive: ORigIN (where OR and IN are seen as keywords although they are part of a bigger token).

          I think that the tokens in the list should only match when they are preceded by a non alphanumerical character, at least this is the expected behaviour in most syntax highlighters or editor syntax definitions (having written quite a number of these, I find rather impractical the system in BC4).

          With some custom BC4 grammars, I've managed to obtain results by excluding false positive via RegExs that would match first as identifiers (higher priority than keywords).

          Eg: "(\w+for|for\w+)" to prevent seeing "forth" as the "for" kewyord. But this can become soon quite painful (and computational expensive). But I have to use this approach to prevent the 1 in "test1" being seen as a number -- i.e. capture "(\w+\d|\d\w+)" as an identifer, before it's matched by the number regex.

          Again, having \b or lookaround would have made all this a trivial task.

          Syntax highlighting can really help sifting through long diffs, especially when integrating BC4 with Git to solve thre-way merges, and one has to spot visually the various elements to double check which side of a conflicting hunk to choose.

          So, it's more than just mere aesthetics, it's functional to provide visual anchors when comparing the three panes of a diff (an highlighted digit can become the reference spot to start reading from).

          Comment


          • #6
            Hello,

            Ok, thanks for confirming the version, since this was originally an old BC2 thread in the BC2 subforum.

            And I see the problem now: we do need to define both the Keyword and Identifier in the grammar element list (similar to how out built in Delphi, C#, and other formats have defined them as a pair). Make the Keyword (list) element the topmost in the Grammar tab's element list, but then also define an Identifier element directly below it, to help avoid the issue you are running into. Your regular expression is close, but a bit complex. Does one like this work for your files? Identifier, Type: Basic, Regular Expression:
            Code:
            [_a-z]\w*
            Aaron P Scooter Software

            Comment

            Working...
            X