Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pt] Ignore probable English-language toponyms #10674

Merged
merged 5 commits into from
Jun 26, 2024

Conversation

p-goulart
Copy link
Collaborator

@p-goulart p-goulart commented Jun 21, 2024

I noticed, in the process of giving LT text from English-heavy Wikipedia articles to evaluate how well we were doing with English word detection, that this was a big weakness.

The issue was that we were generally expecting English phrases (or complete sentences), but these are often single, isolated words. Furthermore, they don't often contain 'frequent' words.

The heuristics here might be improved in the future, but for now I think it strikes a decent balance.


We need to note a few things:

  • we're not validating if the lower and higher level toponyms match – it'd be cool to check that, but it's not within the scope of the speller, this is just a heuristic to check if some town name is in the English dictionary;
  • one thing I dislike about this is that it recognises the town name when it is used right before a state/province/county (e.g. 'Keene, estado de Nova Hampshire' is fine) but not when it's used later (e.g. 'de volta a Keene...');
  • I'm wondering if maybe a different tag (_english_toponym_ignore_?) could be used, and then... some magic on the Java side to get all words tagged with _english_toponym_ignore_ and tag them that way even outside the context initially used by the disambiguator... like, dynamically updating our tagger dictionary.

@p-goulart p-goulart force-pushed the pt/grammar/ignore_probable_city_names branch from ad6571e to b9eaacb Compare June 21, 2024 15:16
@p-goulart p-goulart force-pushed the pt/grammar/improve_english_entities branch from fb4b5c9 to 5f4f53d Compare June 25, 2024 10:08
Base automatically changed from pt/grammar/improve_english_entities to master June 25, 2024 13:20
 - if unknown (i.e. untagged) words precede a US, Canadian, Australian,
   or British state, province, county or council area, we run the
   English check on them;

 - if they are *valid* English, we tag them with _english_ignore_;

 - this should help us stop correcting English-language toponyms;

 - in the future, I suppose we could have a more global solution for all
   sorts of toponyms... why not have a list of German/Austrian
   bundesländer,French départments, Italian... pizza slices, etc.
@p-goulart p-goulart force-pushed the pt/grammar/ignore_probable_city_names branch from b9eaacb to 646b344 Compare June 25, 2024 14:56
@p-goulart p-goulart merged commit 190624a into master Jun 26, 2024
2 checks passed
@p-goulart p-goulart deleted the pt/grammar/ignore_probable_city_names branch June 26, 2024 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants