Allow an option for \b regexp regression in JDK >= 19 #24

arysin · 2024-10-26T17:24:26Z

JDK19 has changed the default behavior for \b regexp (to be compatible with \w).
The problem is that all the existing Java code with regular expressions that relied on \b for unicode characters would be pretty much broken when using JDK>=19. See (languagetool-org/languagetool#9854)
LanguageTool project mentioned above uses segment project and has tons of \b in segmentation rules for 25 languages.
The fix would be to add (?U) for each regexp that uses \b
It would be nice to have an option in net.loomchild.segment project to specify Pattern.UNICODE_CHARACTER_CLASS to compile rules from segment.srx so there's no need to adjust each rule separately (we may not be able to turn on Pattern.UNICODE_CHARACTER_CLASS by default as it may have regressions for other projects, even though probability is low).

loomchild · 2024-10-26T18:31:19Z

Hi, thanks for reporting this issue.

I see that LanguageTool also uses regular expressions for grammar rules. I think it would be more consistent to modify all instances of \b everywhere, or enable the option globally in LanguageTool.

If you think that adding an option to Segment makes sense as a long-term solution and not just as a temporary workaround, I'd be happy to add it.

arysin · 2024-10-26T20:00:36Z

Yes, so for the rules we do compile them in LanguageTool code so we were able to add the regexp option in our code that would apply to all rules, there were also some changes for language-specific word tokenizer classes (in that branch), but for the sentence tokenization rules we don't have access to the regexp compilation, so for now you can see the commit contains tons of (?U) in the segment.srx.
If we had an option from segment module then it would be only 2-3 places in the core and a few changes in some language tokenizers.
I am not sure if the new options should be specific for this fix or more general e.g. setRegexpFlags() - in either case it'd help LanguageTool to make this transition smoother.

loomchild · 2024-10-27T16:47:40Z

OK, I understand, thanks for the info. I'll add the option to Segment.

loomchild · 2024-10-27T18:47:43Z

I have published a fix. It'll be present in segment 2.0.4 (should be available in Maven Central shortly).

You need to add a paramter when creating SrxTextIterator (I think it's in languagetool-core/src/main/java/org/languagetool/tokenizers/SrxTools.java` in LanguageTool):

parameterMap.put(SrxTextIterator.DEFAULT_PATTERN_FLAGS_PARAMETER, Pattern.UNICODE_CHARACTER_CLASS);

Please let me know if it works.

arysin · 2024-10-27T20:09:47Z

Thanks a bunch!! I've just pulled the new version and tried the new flag, and it worked beautifully.
Much appreciated!!

arysin closed this as completed Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow an option for \b regexp regression in JDK >= 19 #24

Allow an option for \b regexp regression in JDK >= 19 #24

arysin commented Oct 26, 2024

loomchild commented Oct 26, 2024

arysin commented Oct 26, 2024

loomchild commented Oct 27, 2024

loomchild commented Oct 27, 2024

arysin commented Oct 27, 2024

Allow an option for \b regexp regression in JDK >= 19 #24

Allow an option for \b regexp regression in JDK >= 19 #24

Comments

arysin commented Oct 26, 2024

loomchild commented Oct 26, 2024

arysin commented Oct 26, 2024

loomchild commented Oct 27, 2024

loomchild commented Oct 27, 2024

arysin commented Oct 27, 2024