[pt] Fix mispelled names #9983

susanaboatto · 2023-12-22T15:27:54Z

Some names apparently contain the wrong suggestion.

What I found in the PT diff:

Iuri Gagarine (wrong) suggested for Iuri Gagarin (correct);
Ulysses Guimaraes (wrong) suggested for Ulysses Guimarães (correct);
Waldir Maranhao (wrong) suggested for Waldir Maranhão (correct);
Jorge Vercilo (wrong) suggested for Jorge Vercillo (correct).

Probably a different issue, but I have also added an extra S to the personality names I have found in the neighboring tags, i.e., replaced the 0 in NPMS000 with NPMSS00. It should be used to distinguish people from places and organizations, but our words are barely tagged as such.
More names added to multiwords.txt, spelling.txt, and spelling_global.txt based on the PT diff findings.
There are some wrong suggestions in the PT_MULTITOKEN rules that will need more attention. For example, Pára-quedistas is corrected to para-quedistas, when it should be paraquedistas (post-1990). For now, I am adding these to spelling.txt, but a more in-depth fix for this will be needed.

@p-goulart feel free to revert, edit, or comment this branch with your insights.

jaumeortola · 2023-12-22T21:25:10Z

I wouldn't add multitoken expressions to spelling.txt. They should go, preferably, to multiwords.txt.

Some person names could also be added to global_spelling.txt (but not Russian names like Gagarin).

p-goulart · 2023-12-27T10:20:17Z

I'm doing several things here:

moving a bunch of multi-word entries added to spelling.txt to multiwords.txt;
removing all new prefixed words that are being added to spelling.txt and adding tests to the Morfologik rule to show that those words are already correctly recognised;
moving hyphenated words to the compounds dictionaries in the source files (new binaries to be deployed today);
the correction pára -> para is expected behaviour (and once applied Morfologik suggests the correct form); either way, the correct word is already in the dictionary and it makes no sense to duplicate it in spelling.txt. I'm adding pára- para and para- para as pairs to the Morfologik info file.

Be that as it may, we need to make sure we check what the speller can already handle before adding entries blindly to spelling.txt. And we definitely need to add and run Java tests.

p-goulart · 2024-01-03T16:05:54Z

I've moved a bunch of stuff off the spelling.txt and multiwords.txt lists and recompiled the dictionary. It was much more work than I hoped, but I think the overall result is pretty good. I mean, tests pass, and I don't think we're losing too much. I've had to quite a bit of research on the terms there, and ended up making some choices, esp. with regards to hyphenation, juxtaposition, or using a space for English compound loanwords (e.g. taskforce, task-force, task force). We're losing very little here in terms of actual coverage, as these are already pretty rare words to begin with.

Those files are very close to being clean, though there are still a couple of hundred entries in spelling.txt that we prob. need to check against some kind of foreign terms rule, since those are mostly very rarely used (or used only in pt-PT) equivalents of loanwords that have already had their orthography adapted. Like coupé instead of cupê, vinaigrette instead of vinagrete, etc. It's not that many (just a few hundred) and sounds like it should be a separate task, since it touches upon the 'barbarism' rules.

jaumeortola · 2024-01-04T07:53:48Z

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt

+Marnie Simpson
+María Gabriela
+Adrian Fernández


This seems a misspelling in Spanish. Adrián Fernández: https://en.wikipedia.org/wiki/Adri%C3%A1n_Fern%C3%A1ndez

I haven't added any new names, I thought these had all already been approved?

Yes. That's true. Don't worry about them. We can fix them afterward.

I can fix the ones you mentioned here with the next rebase, it's no biggie.

jaumeortola · 2024-01-04T07:55:39Z

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt

+José Blanc de Portugal
+José Mouzinho d'Albuquerque
+José Mouzinho de Albuquerque


Both de + d'?

jaumeortola · 2024-01-04T07:58:18Z

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt

+Miguel Burnier
+Mike Penner
+Mikhail Yuryevich


Avoid Russian names in global_spelling.txt because different languages use different spellings.
Is this one? https://pt.wikipedia.org/wiki/Mikhail_L%C3%A9rmontov

susanaboatto requested review from jaumeortola and p-goulart December 22, 2023 18:42

jaumeortola approved these changes Dec 22, 2023

View reviewed changes

p-goulart force-pushed the pt/multitoken branch from 3957192 to 8cac7bf Compare December 27, 2023 09:36

p-goulart force-pushed the pt/multitoken branch from 9043ee6 to 9206fa3 Compare January 3, 2024 15:53

jaumeortola reviewed Jan 4, 2024

View reviewed changes

susanaboatto and others added 19 commits January 4, 2024 11:44

[pt] Fix mispelled names

29fb26a

[pt] Fix mispelled names

07eb8ca

[pt] more spellings

63ff816

[pt] more spellings

421478d

[pt] more spellings

698539a

[pt] fix build failure

2ae7833

[pt] add words to spelling to avoid wrong corrections

d95ee46

[pt] Prepare speller for next dict binary

c8718ab

[pt] Adapt grammar to tagging fixes

5463bb8

[pt] Clean up added.txt

31f23d5

[pt] Add contraction tests to tagger tests

b8bfd7f

[pt] Add disambiguation rule for adj cão

87c42af

[pt] Move multi-token ignores from disambig to multiwords.txt

84c6eed

[pt] Move multiwords from spelling.txt to multiwords.txt

b84e8da

[pt] Start moving remaining spelling.txt words

3aa11b9

[pt] Fix multiwords

8a5ae09

[pt] Continue moving spelling.txt entries

2fee557

[pt] Update speller tests

94ab1ba

[pt] Update dict version to 0.10

726e9cf

p-goulart added 3 commits January 4, 2024 11:45

[pt] Fix tagger tests for new tags

9ac1c9b

[pt] Update dict version to 0.11

66b8e35

[pt] Fix minor issues with multiwords/global_spelling

3930fb3

p-goulart force-pushed the pt/multitoken branch from 9206fa3 to 3930fb3 Compare January 4, 2024 11:01

p-goulart merged commit 5073687 into master Jan 4, 2024
3 checks passed

p-goulart deleted the pt/multitoken branch January 4, 2024 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pt] Fix mispelled names #9983

[pt] Fix mispelled names #9983

susanaboatto commented Dec 22, 2023 •

edited

Loading

jaumeortola commented Dec 22, 2023

p-goulart commented Dec 27, 2023

p-goulart commented Jan 3, 2024

jaumeortola Jan 4, 2024

p-goulart Jan 4, 2024

jaumeortola Jan 4, 2024

p-goulart Jan 4, 2024

jaumeortola Jan 4, 2024

jaumeortola Jan 4, 2024

[pt] Fix mispelled names #9983

[pt] Fix mispelled names #9983

Conversation

susanaboatto commented Dec 22, 2023 • edited Loading

jaumeortola commented Dec 22, 2023

p-goulart commented Dec 27, 2023

p-goulart commented Jan 3, 2024

jaumeortola Jan 4, 2024

Choose a reason for hiding this comment

p-goulart Jan 4, 2024

Choose a reason for hiding this comment

jaumeortola Jan 4, 2024

Choose a reason for hiding this comment

p-goulart Jan 4, 2024

Choose a reason for hiding this comment

jaumeortola Jan 4, 2024

Choose a reason for hiding this comment

jaumeortola Jan 4, 2024

Choose a reason for hiding this comment

susanaboatto commented Dec 22, 2023 •

edited

Loading