Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Order match regular expression alternates from longest to shortest #629

Merged
merged 1 commit into from
Apr 5, 2018

Conversation

wking
Copy link
Contributor

@wking wking commented Mar 25, 2018

The alternation operator (|) is commutative in the POSIX spec, which has:

If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched.

Testing with a few engines, here are some that match POSIX:

$ echo abc | grep -Eo '|a|ab'
ab
$ python -c 'import regex; print(regex.match("|a|ab", "abc", regex.POSIX).group(0))'
ab  # Background in https://bitbucket.org/mrabarnett/mrab-regex/issues/150
$ psql -Atc "select substring('abc' from '|a|ab')"
ab

The docs for PostgreSQL are somewhat complicated, but for two or more branches connected by the | operator they are always greedy.

Here are some engines that prefer the left-most branch:

$ python -c 'import re; print(re.match("|a|ab", "abc").group(0))'
$ python -c 'import regex; print(regex.match("|a|ab", "abc").group(0))'
$ node -e 'console.log("abc".match(/|a|ab/)[0])'  # https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#special-or
$ ruby -e 'print "abc".match(/|a|ab/); print "\n"'  # https://ruby-doc.org/core-2.5.0/Regexp.html#class-Regexp-label-Alternation

Go's stdlib provides both left-most-branch and longest-match implementations.

So the old order was compliant with POSIX EREs (as referenced in ListedLicense.xsd), but this commit sorts the branches for longest-match-first for compatibility with engines that break POSIX and prefer the left-most branch. This only really matters when a longer match includes a shorter match (e.g. ab includes a, so we want ab|a and not a|ab). It doesn't matter when the longer match does not include the shorter match (e.g. a|bc and bc|a are equivalent regardless of regexp engine). But in this commit I've ordered by decreasing match length regardless of inclusiveness, because that requires less thinking ;).

This commit sorts everything in master turned up by git grep 'match="[^"]*|'.

My personal preference is to have tests that exercise all of our intended matches (spdx/license-test-files#3), but our current CI only excercises one.

Spun off from here.

The alternation operator (|) is commutative in the POSIX spec, which
has [1]:

> If the pattern permits a variable number of matching characters and
> thus there is more than one such sequence starting at that point,
> the longest such sequence is matched.

Testing with a few engines, here are some that match POSIX:

  $ echo abc | grep -Eo '|a|ab'
  ab
  $ python -c 'import regex; print(regex.match("|a|ab", "abc", regex.POSIX).group(0))'
  ab  # Background in https://bitbucket.org/mrabarnett/mrab-regex/issues/150
  $ psql -Atc "select substring('abc' from '|a|ab')"
  ab

The docs for PostgreSQL are somewhat complicated, but for two or more
branches connected by the | operator they are always greedy [2].

Here are some engines that prefer the left-most branch:

  $ python -c 'import re; print(re.match("|a|ab", "abc").group(0))'
  $ python -c 'import regex; print(regex.match("|a|ab", "abc").group(0))'
  $ node -e 'console.log("abc".match(/|a|ab/)[0])'  # [3]
  $ ruby -e 'print "abc".match(/|a|ab/); print "\n"'  # [4]

Go's stdlib provides both left-most-branch [5] and longest-match [6]
implementations.

So the old order was compliant with POSIX EREs (as referenced in
schema/ListedLicense.xsd), but this commit sorts the branches for
longest-match-first for compatibility with engines that break POSIX
and prefer the left-most branch.  This only really matters when a
longer match includes a shorter match (e.g. 'ab' includes 'a', so we
want 'ab|a' and not 'a|ab').  It doesn't matter when the longer match
does not include the shorter match (e.g. 'a|bc' and 'bc|a' are
equivalent regardless of regexp engine).  But in this commit I've
ordered by decreasing match length regardless of inclusiveness,
because that requires less thinking ;).

[1]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_01_02
[2]: https://www.postgresql.org/docs/10/static/functions-matching.html#POSIX-MATCHING-RULES
[3]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#special-or
[4]: https://ruby-doc.org/core-2.5.0/Regexp.html#class-Regexp-label-Alternation
     Although these docs don't have much to say on longest-match
     vs. left-most branch.
[5]: https://golang.org/pkg/regexp/#Compile
[6]: https://golang.org/pkg/regexp/#CompilePOSIX
@jlovejoy jlovejoy added this to the 3.1 release milestone Apr 5, 2018
@jlovejoy jlovejoy merged commit 80b781b into spdx:master Apr 5, 2018
@wking wking deleted the left-most-branch-regexps branch April 5, 2018 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants