Skip to content

Commit

Permalink
[en] fix \b regex for JDK>=19
Browse files Browse the repository at this point in the history
  • Loading branch information
arysin committed Oct 25, 2024
1 parent b474be4 commit 5654786
Showing 1 changed file with 1 addition and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -113239,7 +113239,7 @@ The accident victim died from her injuries.
</rule>
<rulegroup id="UNIT_SPACE" name="Missing space between numeric value and unit (e.g., 25 km)" type="typographical" tags="picky">
<rule>
<regexp case_sensitive='yes'>(?&lt;![A-Z\$€£¥฿฿=]-?[0-9\.]{0,5})((\b|\-)[0-9]{1,5}[0-9,.]{0,5}(€|¥|฿|฿|°C|°F|°De?|°R[éeøa]?|(Z|E|P|T|G|M|k|h|da|d|c|m|µ|n|f|z|y)[ΩΩm]|[ΩΩ]|(Z|E|P|T|G|M|k|h|da|d|c|m|µ|n|p|f|a|z|y)?N|[kKMGTPEZY]i?B|[kmµnp]g|[Mk]t|kWh|GWa|MWd|MWh)(?!\w))</regexp>
<regexp case_sensitive='yes'>(?U)(?&lt;![A-Z\$€£¥฿฿=]-?[0-9\.]{0,5})((\b|\-)[0-9]{1,5}[0-9,.]{0,5}(€|¥|฿|฿|°C|°F|°De?|°R[éeøa]?|(Z|E|P|T|G|M|k|h|da|d|c|m|µ|n|f|z|y)[ΩΩm]|[ΩΩ]|(Z|E|P|T|G|M|k|h|da|d|c|m|µ|n|p|f|a|z|y)?N|[kKMGTPEZY]i?B|[kmµnp]g|[Mk]t|kWh|GWa|MWd|MWh)(?!\w))</regexp>
<message>Insert a space between the numerical value and the unit symbol.</message>
<suggestion><match no="1" regexp_match="((\-)?[0-9]+[0-9,.]*{1,30})" regexp_replace="$1&nbsp;"/></suggestion>
<example correction="25&nbsp;°C">The temperature is <marker>25°C</marker>.</example>
Expand Down

2 comments on commit 5654786

@jaumeortola
Copy link
Member

@jaumeortola jaumeortola commented on 5654786 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arysin Do we know the changes that are needed for JDK>=19?
Can you document it here? #9854

@arysin
Copy link
Contributor Author

@arysin arysin commented on 5654786 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general everywhere where there is \b regular expression and non-ASCII characters you need to use Pattern.UNICODE_CHARACTER_CLASS or (?U) in the regex.
Ideally for the rules this would be fixed in common code, but it looks like @danielnaber hit some issues with that last year.
For [en] it was just one rule that failed, so I've added (?U) to make the tests pass.
For Ukrainian I had to adjust sentence and work tokenizers.
Unfortunately I don't have the knowledge of other languages that fail so it's hard for me to work on a common fix.

Please sign in to comment.