Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pt] Fix mispelled names #9983

Merged
merged 22 commits into from
Jan 4, 2024
Merged

[pt] Fix mispelled names #9983

merged 22 commits into from
Jan 4, 2024

Conversation

susanaboatto
Copy link
Collaborator

@susanaboatto susanaboatto commented Dec 22, 2023

  • Some names apparently contain the wrong suggestion.

What I found in the PT diff:

Iuri Gagarine (wrong) suggested for Iuri Gagarin (correct);
Ulysses Guimaraes (wrong) suggested for Ulysses Guimarães (correct);
Waldir Maranhao (wrong) suggested for Waldir Maranhão (correct);
Jorge Vercilo (wrong) suggested for Jorge Vercillo (correct).

  • Probably a different issue, but I have also added an extra S to the personality names I have found in the neighboring tags, i.e., replaced the 0 in NPMS000 with NPMSS00. It should be used to distinguish people from places and organizations, but our words are barely tagged as such.

  • More names added to multiwords.txt, spelling.txt, and spelling_global.txt based on the PT diff findings.

  • There are some wrong suggestions in the PT_MULTITOKEN rules that will need more attention. For example, Pára-quedistas is corrected to para-quedistas, when it should be paraquedistas (post-1990). For now, I am adding these to spelling.txt, but a more in-depth fix for this will be needed.

@p-goulart feel free to revert, edit, or comment this branch with your insights.

@jaumeortola
Copy link
Member

I wouldn't add multitoken expressions to spelling.txt. They should go, preferably, to multiwords.txt.

Some person names could also be added to global_spelling.txt (but not Russian names like Gagarin).

@p-goulart
Copy link
Collaborator

I'm doing several things here:

  • moving a bunch of multi-word entries added to spelling.txt to multiwords.txt;
  • removing all new prefixed words that are being added to spelling.txt and adding tests to the Morfologik rule to show that those words are already correctly recognised;
  • moving hyphenated words to the compounds dictionaries in the source files (new binaries to be deployed today);
  • the correction pára -> para is expected behaviour (and once applied Morfologik suggests the correct form); either way, the correct word is already in the dictionary and it makes no sense to duplicate it in spelling.txt. I'm adding pára- para and para- para as pairs to the Morfologik info file.

Be that as it may, we need to make sure we check what the speller can already handle before adding entries blindly to spelling.txt. And we definitely need to add and run Java tests.

@p-goulart
Copy link
Collaborator

I've moved a bunch of stuff off the spelling.txt and multiwords.txt lists and recompiled the dictionary. It was much more work than I hoped, but I think the overall result is pretty good. I mean, tests pass, and I don't think we're losing too much. I've had to quite a bit of research on the terms there, and ended up making some choices, esp. with regards to hyphenation, juxtaposition, or using a space for English compound loanwords (e.g. taskforce, task-force, task force). We're losing very little here in terms of actual coverage, as these are already pretty rare words to begin with.

Those files are very close to being clean, though there are still a couple of hundred entries in spelling.txt that we prob. need to check against some kind of foreign terms rule, since those are mostly very rarely used (or used only in pt-PT) equivalents of loanwords that have already had their orthography adapted. Like coupé instead of cupê, vinaigrette instead of vinagrete, etc. It's not that many (just a few hundred) and sounds like it should be a separate task, since it touches upon the 'barbarism' rules.

Marnie Simpson
María Gabriela
Adrian Fernández
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a misspelling in Spanish. Adrián Fernández: https://en.wikipedia.org/wiki/Adri%C3%A1n_Fern%C3%A1ndez

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't added any new names, I thought these had all already been approved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's true. Don't worry about them. We can fix them afterward.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can fix the ones you mentioned here with the next rebase, it's no biggie.

José Blanc de Portugal
José Mouzinho d'Albuquerque
José Mouzinho de Albuquerque
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both de + d'?

Miguel Burnier
Mike Penner
Mikhail Yuryevich
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid Russian names in global_spelling.txt because different languages use different spellings.
Is this one? https://pt.wikipedia.org/wiki/Mikhail_L%C3%A9rmontov

@p-goulart p-goulart merged commit 5073687 into master Jan 4, 2024
3 checks passed
@p-goulart p-goulart deleted the pt/multitoken branch January 4, 2024 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants