Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* This update decouples the creation of is_required_phrase rules from updating existing rules in a separate CLI. This makes it easier to control which rule are used as required phrases. * This now skip to process more rules when adding required phrases to existing rules: any rule that cannot be matched approximately is skipped and only tiny rules, but also many other rules. * This checks that no rule get a required phrase added that would break in the middle of a URL, email, or copyright. This is done by checking that no required phrase injection changes the set of ignorables of a rule and could break a URL making it no longer a proper URL. Same for emails or copyrights. * This extends "skipping" the collection of required phrases to skip a rule from both required phrases collection for generationg new rules AND injection of new required phrases in rule text. This allow to handle exceptions more easily. * The "is_required_phrase" rules creation now creates rules using improved content: the case and punctuation of the phrase text are preserved; the rule is created as "is_license_reference" which is going to be correct in the vast majority of the cases. * When matched, the "is_required_phrase" rules are treated the same as continuous rules and can only be matched exactly. * The "is_required_phrase" rules are now validated extensively to ensure that there is no conflict with other rule flags. * The code to "trace" the source of a required_phase inject now uses the new standard "source" rule field, and the code related to handling this field has been simplified. * Required phrases injection has not yet been tested as working. Signed-off-by: Philippe Ombredanne <[email protected]>
- Loading branch information