The alternation operator (|) is commutative in the POSIX spec, which
has [1]:
> If the pattern permits a variable number of matching characters and
> thus there is more than one such sequence starting at that point,
> the longest such sequence is matched.
Testing with a few engines, here are some that match POSIX:
$ echo abc | grep -Eo '|a|ab'
ab
$ python -c 'import regex; print(regex.match("|a|ab", "abc", regex.POSIX).group(0))'
ab # Background in https://bitbucket.org/mrabarnett/mrab-regex/issues/150
$ psql -Atc "select substring('abc' from '|a|ab')"
ab
The docs for PostgreSQL are somewhat complicated, but for two or more
branches connected by the | operator they are always greedy [2].
Here are some engines that prefer the left-most branch:
$ python -c 'import re; print(re.match("|a|ab", "abc").group(0))'
$ python -c 'import regex; print(regex.match("|a|ab", "abc").group(0))'
$ node -e 'console.log("abc".match(/|a|ab/)[0])' # [3]
$ ruby -e 'print "abc".match(/|a|ab/); print "\n"' # [4]
Go's stdlib provides both left-most-branch [5] and longest-match [6]
implementations.
So the old order was compliant with POSIX EREs (as referenced in
schema/ListedLicense.xsd), but this commit sorts the branches for
longest-match-first for compatibility with engines that break POSIX
and prefer the left-most branch. This only really matters when a
longer match includes a shorter match (e.g. 'ab' includes 'a', so we
want 'ab|a' and not 'a|ab'). It doesn't matter when the longer match
does not include the shorter match (e.g. 'a|bc' and 'bc|a' are
equivalent regardless of regexp engine). But in this commit I've
ordered by decreasing match length regardless of inclusiveness,
because that requires less thinking ;).
[1]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_01_02
[2]: https://www.postgresql.org/docs/10/static/functions-matching.html#POSIX-MATCHING-RULES
[3]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#special-or
[4]: https://ruby-doc.org/core-2.5.0/Regexp.html#class-Regexp-label-Alternation
Although these docs don't have much to say on longest-match
vs. left-most branch.
[5]: https://golang.org/pkg/regexp/#Compile
[6]: https://golang.org/pkg/regexp/#CompilePOSIX