Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pt] Dictionary update to v1.0.1 #10559

Merged
merged 12 commits into from
May 6, 2024
Merged

[pt] Dictionary update to v1.0.1 #10559

merged 12 commits into from
May 6, 2024

Conversation

p-goulart
Copy link
Collaborator

@p-goulart p-goulart commented May 2, 2024

In this version...

v1.0.0

v1.0.1 (bugfix)

  • 😩 fix an annoying bug that meant synthesis of 'mesmo' also led to 'mesmíssimo' forms generated (affected constructions are not detected by gGEC).

English spelling 🇬🇧 🇺🇸

We use the following heuristics:

  • we use a word list to decide if a specific token is likely to be part of a longer string of English words;
  • we tag adjacent words as English words...
    • if they are also in the word list, or
    • if they are untagged (i.e. probably absent from the Portuguese dictionary) and they exist in English (we use the English tagger to determine this).

If the English tagger dictionary lookup fails, the word is not valid in English, and we do not tag it as _english_ignore_. This means that, later, we run the Portuguese speller on such words.

The original version of my idea went even further. I wanted to tag all such words and, later, actually get suggestions for them from the English speller.

Currently, if a user misspells a word in English, they will be probably irrelevant suggestions in Portuguese, rather than the more relevant ones in English. This could be a future improvement.


Lol ignore the branch name. Originally this was going to be v0.16, but @maphjo pointed out it was probably time to revisit the versioning scheme. This wouldn't normally be a major release, as it introduces no breaking changes, but conceptually it makes sense to start with v1.0.0 finally, and we're not planning any breaking changes to the dictionary binaries in the near future.

@p-goulart p-goulart marked this pull request as ready for review May 2, 2024 09:53
@p-goulart p-goulart changed the title [pt] Dictionary update to v1.0.0 [pt] Dictionary update to v1.0.1 May 2, 2024
 - remove '_ORTHOGRAPHY' suffix to allow users more control;

 - add 'picky' and formal/academic tags;

 - this should only really work as of v0.16 of the dictionary, where
   'email' is accepted in pt-BR.
 - rule A_PREPOSITION, sub-rule 4 was removing the preposition tag from
   'a' even when the determiner reading had already been discarded,
   leading to a NULL-tagged 'a';

 - this had to be done here for... reasons... there was *one* new rule
   that was breaking because of this...
@p-goulart p-goulart merged commit cbedf0d into master May 6, 2024
3 checks passed
@p-goulart p-goulart deleted the pt/dict/v016 branch May 6, 2024 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"email" with no hyphen is also correct
2 participants