Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Order match regular expression alternates from longest to shortest #629

Merged
merged 1 commit into from
Apr 5, 2018

Commits on Mar 25, 2018

  1. *: Order match regular expression alternates from longest to shortest

    The alternation operator (|) is commutative in the POSIX spec, which
    has [1]:
    
    > If the pattern permits a variable number of matching characters and
    > thus there is more than one such sequence starting at that point,
    > the longest such sequence is matched.
    
    Testing with a few engines, here are some that match POSIX:
    
      $ echo abc | grep -Eo '|a|ab'
      ab
      $ python -c 'import regex; print(regex.match("|a|ab", "abc", regex.POSIX).group(0))'
      ab  # Background in https://bitbucket.org/mrabarnett/mrab-regex/issues/150
      $ psql -Atc "select substring('abc' from '|a|ab')"
      ab
    
    The docs for PostgreSQL are somewhat complicated, but for two or more
    branches connected by the | operator they are always greedy [2].
    
    Here are some engines that prefer the left-most branch:
    
      $ python -c 'import re; print(re.match("|a|ab", "abc").group(0))'
      $ python -c 'import regex; print(regex.match("|a|ab", "abc").group(0))'
      $ node -e 'console.log("abc".match(/|a|ab/)[0])'  # [3]
      $ ruby -e 'print "abc".match(/|a|ab/); print "\n"'  # [4]
    
    Go's stdlib provides both left-most-branch [5] and longest-match [6]
    implementations.
    
    So the old order was compliant with POSIX EREs (as referenced in
    schema/ListedLicense.xsd), but this commit sorts the branches for
    longest-match-first for compatibility with engines that break POSIX
    and prefer the left-most branch.  This only really matters when a
    longer match includes a shorter match (e.g. 'ab' includes 'a', so we
    want 'ab|a' and not 'a|ab').  It doesn't matter when the longer match
    does not include the shorter match (e.g. 'a|bc' and 'bc|a' are
    equivalent regardless of regexp engine).  But in this commit I've
    ordered by decreasing match length regardless of inclusiveness,
    because that requires less thinking ;).
    
    [1]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_01_02
    [2]: https://www.postgresql.org/docs/10/static/functions-matching.html#POSIX-MATCHING-RULES
    [3]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#special-or
    [4]: https://ruby-doc.org/core-2.5.0/Regexp.html#class-Regexp-label-Alternation
         Although these docs don't have much to say on longest-match
         vs. left-most branch.
    [5]: https://golang.org/pkg/regexp/#Compile
    [6]: https://golang.org/pkg/regexp/#CompilePOSIX
    wking committed Mar 25, 2018
    Configuration menu
    Copy the full SHA
    f08ab22 View commit details
    Browse the repository at this point in the history