Skip to content

Commit

Permalink
Update required phrase generation
Browse files Browse the repository at this point in the history
* This update decouples the creation of is_required_phrase rules from
  updating existing rules in a separate CLI. This makes it easier to
  control which rule are used as required phrases.

* This now skip to process more rules when adding required phrases to
  existing rules: any rule that cannot be matched approximately is
  skipped and only tiny rules, but also many other rules.

* This checks that no rule get a required phrase added that would
  break in the middle of a URL, email, or copyright. This is done by
  checking that no required phrase injection changes the set of
  ignorables of a rule and could break a URL making it no longer a
  proper URL. Same for emails or copyrights.

* This extends "skipping" the collection of required phrases to skip
  a rule from both required phrases collection for generationg new rules
  AND injection of new required phrases in rule text. This allow to
  handle exceptions more easily.

* The "is_required_phrase" rules creation now creates rules using
  improved content: the case and punctuation of the phrase text are
  preserved; the rule is created as "is_license_reference" which is
  going to be correct in the vast majority of the cases.

* When matched, the "is_required_phrase" rules are treated the same
  as continuous rules and can only be matched exactly.

* The "is_required_phrase" rules are now validated extensively to
  ensure that there is no conflict with other rule flags.

* The code to "trace" the source of a required_phase inject now uses
  the new standard "source" rule field, and the code related to handling
  this field has been simplified.

* Required phrases injection has not yet been tested as working.

Signed-off-by: Philippe Ombredanne <[email protected]>
  • Loading branch information
pombredanne committed Oct 8, 2024
1 parent 8e16712 commit 1bcf3fc
Show file tree
Hide file tree
Showing 4 changed files with 807 additions and 705 deletions.
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ console_scripts =
scancode-license-data = licensedcode.license_db:dump_scancode_license_data
regen-package-docs = packagedcode.regen_package_docs:regen_package_docs
add-required-phrases = licensedcode.required_phrases:add_required_phrases
gen-new-required-phrases-rules = licensedcode.required_phrases:gen_required_phrases_rules

# These are configurations for ScanCode plugins as setuptools entry points.
# Each plugin entry hast this form:
Expand Down
10 changes: 6 additions & 4 deletions src/licensedcode/match.py
Original file line number Diff line number Diff line change
Expand Up @@ -2129,12 +2129,14 @@ def filter_matches_missing_required_phrases(
A required phrase must be matched exactly without gaps or unknown words.
A rule with "is_continuous" set to True is the same as if its whole text
was defined as a keyphrase and is processed here too.
was defined as a required phrase and is processed here too.
Same for a rule with "is_required_phrase" set to True.
"""
# never discard a solo match, unless matched to "is_continuous" rule
# never discard a solo match, unless matched to "is_continuous" or "is_required_phrase" rule
if len(matches) == 1:
rule = matches[0]
if not rule.is_continuous:
if not (rule.is_continuous or rule.is_required_phrase):
return matches, []

kept = []
Expand All @@ -2149,7 +2151,7 @@ def filter_matches_missing_required_phrases(
if trace:
logger_debug(' CHECKING KEY PHRASES for:', match)

is_continuous = match.rule.is_continuous
is_continuous = match.rule.is_continuous or match.rule.is_required_phrase
ikey_spans = match.rule.required_phrase_spans

if not (ikey_spans or is_continuous):
Expand Down
Loading

0 comments on commit 1bcf3fc

Please sign in to comment.