Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow an option for \b regexp regression in JDK >= 19 #24

Closed
arysin opened this issue Oct 26, 2024 · 5 comments
Closed

Allow an option for \b regexp regression in JDK >= 19 #24

arysin opened this issue Oct 26, 2024 · 5 comments

Comments

@arysin
Copy link

arysin commented Oct 26, 2024

JDK19 has changed the default behavior for \b regexp (to be compatible with \w).
The problem is that all the existing Java code with regular expressions that relied on \b for unicode characters would be pretty much broken when using JDK>=19. See (languagetool-org/languagetool#9854)
LanguageTool project mentioned above uses segment project and has tons of \b in segmentation rules for 25 languages.
The fix would be to add (?U) for each regexp that uses \b
It would be nice to have an option in net.loomchild.segment project to specify Pattern.UNICODE_CHARACTER_CLASS to compile rules from segment.srx so there's no need to adjust each rule separately (we may not be able to turn on Pattern.UNICODE_CHARACTER_CLASS by default as it may have regressions for other projects, even though probability is low).

@loomchild
Copy link
Owner

Hi, thanks for reporting this issue.

I see that LanguageTool also uses regular expressions for grammar rules. I think it would be more consistent to modify all instances of \b everywhere, or enable the option globally in LanguageTool.

If you think that adding an option to Segment makes sense as a long-term solution and not just as a temporary workaround, I'd be happy to add it.

@arysin
Copy link
Author

arysin commented Oct 26, 2024

Yes, so for the rules we do compile them in LanguageTool code so we were able to add the regexp option in our code that would apply to all rules, there were also some changes for language-specific word tokenizer classes (in that branch), but for the sentence tokenization rules we don't have access to the regexp compilation, so for now you can see the commit contains tons of (?U) in the segment.srx.
If we had an option from segment module then it would be only 2-3 places in the core and a few changes in some language tokenizers.
I am not sure if the new options should be specific for this fix or more general e.g. setRegexpFlags() - in either case it'd help LanguageTool to make this transition smoother.

@loomchild
Copy link
Owner

OK, I understand, thanks for the info. I'll add the option to Segment.

@loomchild
Copy link
Owner

I have published a fix. It'll be present in segment 2.0.4 (should be available in Maven Central shortly).

You need to add a paramter when creating SrxTextIterator (I think it's in languagetool-core/src/main/java/org/languagetool/tokenizers/SrxTools.java` in LanguageTool):

parameterMap.put(SrxTextIterator.DEFAULT_PATTERN_FLAGS_PARAMETER, Pattern.UNICODE_CHARACTER_CLASS);

Please let me know if it works.

@arysin
Copy link
Author

arysin commented Oct 27, 2024

Thanks a bunch!! I've just pulled the new version and tried the new flag, and it worked beautifully.
Much appreciated!!

@arysin arysin closed this as completed Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants