-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pt] Enable multi-token spell-checking #10052
Conversation
- improve handling of percent signs (was: [50%OFF], will be: [50%, OFF]); - add some tests due to the latest dictionary version.
- multitoken suggestions were failing because we were only checking if they were present in the dictionary by upcasing their first letter; - this failed to account for titlecasing (either naively or a little more smartly), which is relatively frequent; - cf. stuff like "The Lord of the Rings".
I am not sure about this. It can be useful in some case, but maybe there are some cons. If there is Some non-exhaustive, non-exclusive ideas to make it less risky:
|
The test failure in Catalan is an example of the problems I mentioned:
|
@jaumeortola, yes, I realised that as soon as the Catalan tests failed. I've just come up with an alternative that only performs those checks for multiwords entries that begin with a lowercase. Those are the ones for which lettercase is likely to be less crucial. That being said, I'm personally not convinced that accepting 'Venus De Milo' trumps not accepting 'Rock and Roll'... |
This will work in most cases. Still, I would prefer to apply the title case only to all-lowercase expressions, not to lowercase first letter.
|
I don't mind implementing either of your suggestions, but I am not convinced that the current behaviour is any more desirable. The argument that we may have 'picky contributors' goes both ways, doesn't it? Much like someone may expect strict letter case checking, they may also expect smarter tolerance to letter case changes. I don't see any concrete data, so in the end it seems like a matter of personal preference – what I'm proposing here adds more letter case variation, and should lead to fewer FPs. Limiting the accepted letter case variations should, in turn, lead to fewer FNs. From the UX point of view, I tend to be of the opinion that accepting capitalised prepositions occasionally is not as bad as flagging valid titlecased variants as orthographic errors. |
@jaumeortola I'm updating the method signature in all locales that use the letter case instance variables based on the usages that IntelliJ can find. Not sure if there's a better way of doing this..? Running all tests now, let's see |
- only Portuguese has it *on*, all other locales have it set to false; - add a simple StringTools method to check if all words in a multi-token string are lowercase (and tests).
It is completely fine. An alternative way is to use something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (and to Gilles too ;) )
I do think it looks good, just wondering why "ob" is included here but other conjunctions are not?
|
No special reason, I just added some short conjunctions 🤷 Not married to this list. |
[Sorry I did something and some approval were removed 😬] |
Okay, given everyone's more or less approved (even though some reviews have disappeared from the PR but can still be seen in the history), and the only missing one is form the 🇺🇸 team, I think we are safe to merge this. For now, those word lists will only affect |
I've been having a look at the nightly results and making some changes to our dictionaries, and I think it is time to enable it, with
v0.12
of the Portuguese dictionary binaries.Global changes 🌍
@jaumeortola, in order to get the letter case to work properly (without needing to add all manner of weird variants to
multiwords.txt
), I made a change to theMultiWordChunk
class. Now, in addition to checkingUpcased tokens
, we also checkTitlecased Tokens
. To get this to work well, I added a newStringTools
method. The new tests should make the intention here clear, but do ask.Portuguese changes 🇧🇷
@susanaboatto, changes to
multiwords.txt
have been tested up the wazoo and compared with the diffs, but still, there might be weirdness, so please do have a cursory look.There's also a minor change to how we tokenise strings containing
%
. The current behaviour is just awful, and this change should improve that somewhat. Do note, though, that we may need to supplement it with some XMLage.