[pt] Ignore probable English-language toponyms #10674

p-goulart · 2024-06-21T10:53:07Z

I noticed, in the process of giving LT text from English-heavy Wikipedia articles to evaluate how well we were doing with English word detection, that this was a big weakness.

The issue was that we were generally expecting English phrases (or complete sentences), but these are often single, isolated words. Furthermore, they don't often contain 'frequent' words.

The heuristics here might be improved in the future, but for now I think it strikes a decent balance.

We need to note a few things:

we're not validating if the lower and higher level toponyms match – it'd be cool to check that, but it's not within the scope of the speller, this is just a heuristic to check if some town name is in the English dictionary;
one thing I dislike about this is that it recognises the town name when it is used right before a state/province/county (e.g. 'Keene, estado de Nova Hampshire' is fine) but not when it's used later (e.g. 'de volta a Keene...');
I'm wondering if maybe a different tag (_english_toponym_ignore_?) could be used, and then... some magic on the Java side to get all words tagged with _english_toponym_ignore_ and tag them that way even outside the context initially used by the disambiguator... like, dynamically updating our tagger dictionary.

- if unknown (i.e. untagged) words precede a US, Canadian, Australian, or British state, province, county or council area, we run the English check on them; - if they are *valid* English, we tag them with _english_ignore_; - this should help us stop correcting English-language toponyms; - in the future, I suppose we could have a more global solution for all sorts of toponyms... why not have a list of German/Austrian bundesländer,French départments, Italian... pizza slices, etc.

p-goulart added the Portuguese label Jun 21, 2024

p-goulart requested review from jaumeortola and susanaboatto June 21, 2024 10:53

p-goulart force-pushed the pt/grammar/improve_english_entities branch from ae7eaf2 to 8b4be06 Compare June 21, 2024 13:00

jaumeortola approved these changes Jun 21, 2024

View reviewed changes

p-goulart force-pushed the pt/grammar/ignore_probable_city_names branch from ad6571e to b9eaacb Compare June 21, 2024 15:16

susanaboatto approved these changes Jun 25, 2024

View reviewed changes

p-goulart force-pushed the pt/grammar/improve_english_entities branch from fb4b5c9 to 5f4f53d Compare June 25, 2024 10:08

Base automatically changed from pt/grammar/improve_english_entities to master June 25, 2024 13:20

p-goulart added 5 commits June 25, 2024 16:50

Add two TV shows from diffs to global sp

5c71812

Add major Seinfeld names to globalsp

f1025a6

[pt] Add UK counties & councils to spelling

44fb0b8

[pt] Add English-language toponym speller tests

646b344

p-goulart force-pushed the pt/grammar/ignore_probable_city_names branch from b9eaacb to 646b344 Compare June 25, 2024 14:56

p-goulart merged commit 190624a into master Jun 26, 2024
2 checks passed

p-goulart deleted the pt/grammar/ignore_probable_city_names branch June 26, 2024 05:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pt] Ignore probable English-language toponyms #10674

[pt] Ignore probable English-language toponyms #10674

p-goulart commented Jun 21, 2024 •

edited

Loading

[pt] Ignore probable English-language toponyms #10674

[pt] Ignore probable English-language toponyms #10674

Conversation

p-goulart commented Jun 21, 2024 • edited Loading

p-goulart commented Jun 21, 2024 •

edited

Loading