*: Order match regular expression alternates from longest to shortest #629
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The alternation operator (
|
) is commutative in the POSIX spec, which has:Testing with a few engines, here are some that match POSIX:
The docs for PostgreSQL are somewhat complicated, but for two or more branches connected by the
|
operator they are always greedy.Here are some engines that prefer the left-most branch:
Go's stdlib provides both left-most-branch and longest-match implementations.
So the old order was compliant with POSIX EREs (as referenced in
ListedLicense.xsd
), but this commit sorts the branches for longest-match-first for compatibility with engines that break POSIX and prefer the left-most branch. This only really matters when a longer match includes a shorter match (e.g.ab
includesa
, so we wantab|a
and nota|ab
). It doesn't matter when the longer match does not include the shorter match (e.g.a|bc
andbc|a
are equivalent regardless of regexp engine). But in this commit I've ordered by decreasing match length regardless of inclusiveness, because that requires less thinking ;).This commit sorts everything in master turned up by
git grep 'match="[^"]*|'
.My personal preference is to have tests that exercise all of our intended matches (spdx/license-test-files#3), but our current CI only excercises one.
Spun off from here.