Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pt] Enable multi-token spell-checking #10052

Merged
merged 9 commits into from
Jan 9, 2024
Merged

Conversation

p-goulart
Copy link
Collaborator

@p-goulart p-goulart commented Jan 5, 2024

I've been having a look at the nightly results and making some changes to our dictionaries, and I think it is time to enable it, with v0.12 of the Portuguese dictionary binaries.


Global changes 🌍

@jaumeortola, in order to get the letter case to work properly (without needing to add all manner of weird variants to multiwords.txt), I made a change to the MultiWordChunk class. Now, in addition to checking Upcased tokens, we also check Titlecased Tokens. To get this to work well, I added a new StringTools method. The new tests should make the intention here clear, but do ask.

⚠️ The lists of titlecasing exceptions were done sort of slapdash, so please review them and approve once you have. @AzadehSafakish, @St-ac-y, @LucieSteib, @mark-baas: specifically, check the word lists in the StringTools class for your language. I am aware titlecasing is not trivial, but all we really need here is something that largely makes sense.

Portuguese changes 🇧🇷

@susanaboatto, changes to multiwords.txt have been tested up the wazoo and compared with the diffs, but still, there might be weirdness, so please do have a cursory look.

There's also a minor change to how we tokenise strings containing %. The current behaviour is just awful, and this change should improve that somewhat. Do note, though, that we may need to supplement it with some XMLage.

p-goulart added 7 commits January 5, 2024 16:53
 - improve handling of percent signs (was: [50%OFF], will be: [50%,
   OFF]);
 - add some tests due to the latest dictionary version.
 - multitoken suggestions were failing because we were only checking if
   they were present in the dictionary by upcasing their first letter;
 - this failed to account for titlecasing (either naively or a little
   more smartly), which is relatively frequent;
 - cf. stuff like "The Lord of the Rings".
@jaumeortola
Copy link
Member

I am not sure about this. It can be useful in some case, but maybe there are some cons.

If there is cordon blue in multiwords.txt, we will accept Cordon Blue, which is fine.
If there is Venus de Milo, accepting Venus De Milo is undesirable. Now, the multitoken spelling rules can provide the right letter casing for Venus de Milo. After this change, we will lose that.

Some non-exhaustive, non-exclusive ideas to make it less risky:

  • make it configurable per language: different languages can have different needs (including non-premium languages)
  • do StringTools.titlecaseGlobal, but not WordUtils.capitalize
  • capitalize only all-lower case expressions (cordon blue, but not Bernat de Ventadorn).

@jaumeortola
Copy link
Member

The test failure in Catalan is an example of the problems I mentioned:

Test failure for rule CA_MULTITOKEN_SPELLING_THREE[1] in file /org/languagetool/rules/ca/grammar.xml (line 43707): "Ossama Bin Laden"
Errors expected: 1
Errors found   : 0
Message: Possible error d'ortografia.

@p-goulart
Copy link
Collaborator Author

p-goulart commented Jan 5, 2024

@jaumeortola, yes, I realised that as soon as the Catalan tests failed. I've just come up with an alternative that only performs those checks for multiwords entries that begin with a lowercase. Those are the ones for which lettercase is likely to be less crucial.

That being said, I'm personally not convinced that accepting 'Venus De Milo' trumps not accepting 'Rock and Roll'...

@jaumeortola
Copy link
Member

This will work in most cases. Still, I would prefer to apply the title case only to all-lowercase expressions, not to lowercase first letter.
Languages can have different needs and we have some picky contributors. Reaching a full agreement for all languages can be difficult. We could make it optional with something like this:

private final boolean allowTitleCase = false;
public void setAllowTitleCase(boolean allowTitleCase) {
  this.allowTitleCase = allowTitleCase;
}

@p-goulart
Copy link
Collaborator Author

I don't mind implementing either of your suggestions, but I am not convinced that the current behaviour is any more desirable.

The argument that we may have 'picky contributors' goes both ways, doesn't it? Much like someone may expect strict letter case checking, they may also expect smarter tolerance to letter case changes.

I don't see any concrete data, so in the end it seems like a matter of personal preference – what I'm proposing here adds more letter case variation, and should lead to fewer FPs. Limiting the accepted letter case variations should, in turn, lead to fewer FNs. From the UX point of view, I tend to be of the opinion that accepting capitalised prepositions occasionally is not as bad as flagging valid titlecased variants as orthographic errors.

@p-goulart
Copy link
Collaborator Author

@jaumeortola I'm updating the method signature in all locales that use the letter case instance variables based on the usages that IntelliJ can find. Not sure if there's a better way of doing this..? Running all tests now, let's see

 - only Portuguese has it *on*, all other locales have it set to false;

 - add a simple StringTools method to check if all words in a
   multi-token string are lowercase (and tests).
@jaumeortola
Copy link
Member

Not sure if there's a better way of doing this..?

It is completely fine. An alternative way is to use something like public void setAllowTitleCase(boolean allowTitleCase) {}.

Copy link
Collaborator

@LucieSteib LucieSteib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (and to Gilles too ;) )

@St-ac-y
Copy link
Collaborator

St-ac-y commented Jan 9, 2024

I do think it looks good, just wondering why "ob" is included here but other conjunctions are not?

private static final Set<String> GERMAN_TITLECASE_EXCEPTIONS = Collections.unmodifiableSet( new HashSet<>(Arrays.asList("von", "in", "im", "an", "am", "vom", "und", "oder", "dass", "ob", "der", "die", "das", "dem", "den", "des", "ein", "eines", "einem", "einen", "einer", "eine", "kein", "keines", "keinem", "keinen", "keiner", "keine"))

@p-goulart
Copy link
Collaborator Author

I do think it looks good, just wondering why "ob" is included here but other conjunctions are not?

No special reason, I just added some short conjunctions 🤷 Not married to this list.

@LucieSteib
Copy link
Collaborator

[Sorry I did something and some approval were removed 😬]

@p-goulart
Copy link
Collaborator Author

p-goulart commented Jan 9, 2024

Okay, given everyone's more or less approved (even though some reviews have disappeared from the PR but can still be seen in the history), and the only missing one is form the 🇺🇸 team, I think we are safe to merge this.

For now, those word lists will only affect pt-BR anyway.

@p-goulart p-goulart merged commit b882a8b into master Jan 9, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants