-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow an option for \b regexp regression in JDK >= 19 #24
Comments
Hi, thanks for reporting this issue. I see that LanguageTool also uses regular expressions for grammar rules. I think it would be more consistent to modify all instances of If you think that adding an option to Segment makes sense as a long-term solution and not just as a temporary workaround, I'd be happy to add it. |
Yes, so for the rules we do compile them in LanguageTool code so we were able to add the regexp option in our code that would apply to all rules, there were also some changes for language-specific word tokenizer classes (in that branch), but for the sentence tokenization rules we don't have access to the regexp compilation, so for now you can see the commit contains tons of (?U) in the segment.srx. |
OK, I understand, thanks for the info. I'll add the option to Segment. |
I have published a fix. It'll be present in segment You need to add a paramter when creating SrxTextIterator (I think it's in languagetool-core/src/main/java/org/languagetool/tokenizers/SrxTools.java` in LanguageTool):
Please let me know if it works. |
Thanks a bunch!! I've just pulled the new version and tried the new flag, and it worked beautifully. |
JDK19 has changed the default behavior for \b regexp (to be compatible with \w).
The problem is that all the existing Java code with regular expressions that relied on \b for unicode characters would be pretty much broken when using JDK>=19. See (languagetool-org/languagetool#9854)
LanguageTool project mentioned above uses segment project and has tons of \b in segmentation rules for 25 languages.
The fix would be to add (?U) for each regexp that uses \b
It would be nice to have an option in net.loomchild.segment project to specify Pattern.UNICODE_CHARACTER_CLASS to compile rules from segment.srx so there's no need to adjust each rule separately (we may not be able to turn on Pattern.UNICODE_CHARACTER_CLASS by default as it may have regressions for other projects, even though probability is low).
The text was updated successfully, but these errors were encountered: