[pt] Enable multi-token spell-checking #10052

p-goulart · 2024-01-05T17:01:07Z

I've been having a look at the nightly results and making some changes to our dictionaries, and I think it is time to enable it, with v0.12 of the Portuguese dictionary binaries.

Global changes 🌍

@jaumeortola, in order to get the letter case to work properly (without needing to add all manner of weird variants to multiwords.txt), I made a change to the MultiWordChunk class. Now, in addition to checking Upcased tokens, we also check Titlecased Tokens. To get this to work well, I added a new StringTools method. The new tests should make the intention here clear, but do ask.

⚠️ The lists of titlecasing exceptions were done sort of slapdash, so please review them and approve once you have. @AzadehSafakish, @St-ac-y, @LucieSteib, @mark-baas: specifically, check the word lists in the StringTools class for your language. I am aware titlecasing is not trivial, but all we really need here is something that largely makes sense.

Portuguese changes 🇧🇷

@susanaboatto, changes to multiwords.txt have been tested up the wazoo and compared with the diffs, but still, there might be weirdness, so please do have a cursory look.

There's also a minor change to how we tokenise strings containing %. The current behaviour is just awful, and this change should improve that somewhat. Do note, though, that we may need to supplement it with some XMLage.

- improve handling of percent signs (was: [50%OFF], will be: [50%, OFF]); - add some tests due to the latest dictionary version.

- multitoken suggestions were failing because we were only checking if they were present in the dictionary by upcasing their first letter; - this failed to account for titlecasing (either naively or a little more smartly), which is relatively frequent; - cf. stuff like "The Lord of the Rings".

jaumeortola · 2024-01-05T17:52:39Z

I am not sure about this. It can be useful in some case, but maybe there are some cons.

If there is cordon blue in multiwords.txt, we will accept Cordon Blue, which is fine.
If there is Venus de Milo, accepting Venus De Milo is undesirable. Now, the multitoken spelling rules can provide the right letter casing for Venus de Milo. After this change, we will lose that.

Some non-exhaustive, non-exclusive ideas to make it less risky:

make it configurable per language: different languages can have different needs (including non-premium languages)
do StringTools.titlecaseGlobal, but not WordUtils.capitalize
capitalize only all-lower case expressions (cordon blue, but not Bernat de Ventadorn).

jaumeortola · 2024-01-05T18:01:25Z

The test failure in Catalan is an example of the problems I mentioned:

Test failure for rule CA_MULTITOKEN_SPELLING_THREE[1] in file /org/languagetool/rules/ca/grammar.xml (line 43707): "Ossama Bin Laden"
Errors expected: 1
Errors found   : 0
Message: Possible error d'ortografia.

p-goulart · 2024-01-05T18:03:09Z

@jaumeortola, yes, I realised that as soon as the Catalan tests failed. I've just come up with an alternative that only performs those checks for multiwords entries that begin with a lowercase. Those are the ones for which lettercase is likely to be less crucial.

That being said, I'm personally not convinced that accepting 'Venus De Milo' trumps not accepting 'Rock and Roll'...

jaumeortola · 2024-01-08T08:45:00Z

This will work in most cases. Still, I would prefer to apply the title case only to all-lowercase expressions, not to lowercase first letter.
Languages can have different needs and we have some picky contributors. Reaching a full agreement for all languages can be difficult. We could make it optional with something like this:

private final boolean allowTitleCase = false;
public void setAllowTitleCase(boolean allowTitleCase) {
  this.allowTitleCase = allowTitleCase;
}

p-goulart · 2024-01-08T09:14:24Z

I don't mind implementing either of your suggestions, but I am not convinced that the current behaviour is any more desirable.

The argument that we may have 'picky contributors' goes both ways, doesn't it? Much like someone may expect strict letter case checking, they may also expect smarter tolerance to letter case changes.

I don't see any concrete data, so in the end it seems like a matter of personal preference – what I'm proposing here adds more letter case variation, and should lead to fewer FPs. Limiting the accepted letter case variations should, in turn, lead to fewer FNs. From the UX point of view, I tend to be of the opinion that accepting capitalised prepositions occasionally is not as bad as flagging valid titlecased variants as orthographic errors.

p-goulart · 2024-01-08T09:48:38Z

@jaumeortola I'm updating the method signature in all locales that use the letter case instance variables based on the usages that IntelliJ can find. Not sure if there's a better way of doing this..? Running all tests now, let's see

- only Portuguese has it *on*, all other locales have it set to false; - add a simple StringTools method to check if all words in a multi-token string are lowercase (and tests).

jaumeortola · 2024-01-08T11:36:44Z

Not sure if there's a better way of doing this..?

It is completely fine. An alternative way is to use something like public void setAllowTitleCase(boolean allowTitleCase) {}.

LucieSteib

LGTM (and to Gilles too ;) )

St-ac-y · 2024-01-09T08:09:49Z

I do think it looks good, just wondering why "ob" is included here but other conjunctions are not?

private static final Set<String> GERMAN_TITLECASE_EXCEPTIONS = Collections.unmodifiableSet( new HashSet<>(Arrays.asList("von", "in", "im", "an", "am", "vom", "und", "oder", "dass", "ob", "der", "die", "das", "dem", "den", "des", "ein", "eines", "einem", "einen", "einer", "eine", "kein", "keines", "keinem", "keinen", "keiner", "keine"))

p-goulart · 2024-01-09T08:59:13Z

I do think it looks good, just wondering why "ob" is included here but other conjunctions are not?

No special reason, I just added some short conjunctions 🤷 Not married to this list.

LucieSteib · 2024-01-09T15:03:37Z

[Sorry I did something and some approval were removed 😬]

p-goulart · 2024-01-09T15:12:20Z

Okay, given everyone's more or less approved (even though some reviews have disappeared from the PR but can still be seen in the history), and the only missing one is form the 🇺🇸 team, I think we are safe to merge this.

For now, those word lists will only affect pt-BR anyway.

p-goulart added 7 commits January 5, 2024 16:53

[pt] Adapt resources to multiwords, dictionary fixes

6df212c

Add smart titlecase method to StringTools

ba0e768

[pt] Fix multiword prepositions

916e7a6

[pt] Update PT tokeniser

a8c4f4f

- improve handling of percent signs (was: [50%OFF], will be: [50%, OFF]); - add some tests due to the latest dictionary version.

[pt] Add speller tests due to latest dictionary

a77bd8d

[pt] Bump up dict to v0.12

d14d5b1

p-goulart added the Portuguese label Jan 5, 2024

p-goulart requested review from jaumeortola, AzadehSafakish, St-ac-y, susanaboatto, LucieSteib and mark-baas January 5, 2024 17:01

Improve titlecase logic in MultiWordChunker

57a9ef0

Add titlecasing option to multi-word chunker

1a0dbab

- only Portuguese has it *on*, all other locales have it set to false; - add a simple StringTools method to check if all words in a multi-token string are lowercase (and tests).

LucieSteib approved these changes Jan 8, 2024

View reviewed changes

jaumeortola approved these changes Jan 8, 2024

View reviewed changes

mark-baas approved these changes Jan 9, 2024

View reviewed changes

LucieSteib requested review from LucieSteib and mark-baas January 9, 2024 15:02

LucieSteib approved these changes Jan 9, 2024

View reviewed changes

p-goulart merged commit b882a8b into master Jan 9, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pt] Enable multi-token spell-checking #10052

[pt] Enable multi-token spell-checking #10052

p-goulart commented Jan 5, 2024 •

edited

Loading

jaumeortola commented Jan 5, 2024

jaumeortola commented Jan 5, 2024

p-goulart commented Jan 5, 2024 •

edited

Loading

jaumeortola commented Jan 8, 2024

p-goulart commented Jan 8, 2024

p-goulart commented Jan 8, 2024

jaumeortola commented Jan 8, 2024

LucieSteib left a comment

St-ac-y commented Jan 9, 2024

p-goulart commented Jan 9, 2024

LucieSteib commented Jan 9, 2024

p-goulart commented Jan 9, 2024 •

edited

Loading

[pt] Enable multi-token spell-checking #10052

[pt] Enable multi-token spell-checking #10052

Conversation

p-goulart commented Jan 5, 2024 • edited Loading

Global changes 🌍

Portuguese changes 🇧🇷

jaumeortola commented Jan 5, 2024

jaumeortola commented Jan 5, 2024

p-goulart commented Jan 5, 2024 • edited Loading

jaumeortola commented Jan 8, 2024

p-goulart commented Jan 8, 2024

p-goulart commented Jan 8, 2024

jaumeortola commented Jan 8, 2024

LucieSteib left a comment

Choose a reason for hiding this comment

St-ac-y commented Jan 9, 2024

p-goulart commented Jan 9, 2024

LucieSteib commented Jan 9, 2024

p-goulart commented Jan 9, 2024 • edited Loading

p-goulart commented Jan 5, 2024 •

edited

Loading

p-goulart commented Jan 5, 2024 •

edited

Loading

p-goulart commented Jan 9, 2024 •

edited

Loading