[pt] Ignore probable English-language toponyms #10674
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I noticed, in the process of giving LT text from English-heavy Wikipedia articles to evaluate how well we were doing with English word detection, that this was a big weakness.
The issue was that we were generally expecting English phrases (or complete sentences), but these are often single, isolated words. Furthermore, they don't often contain 'frequent' words.
The heuristics here might be improved in the future, but for now I think it strikes a decent balance.
We need to note a few things:
_english_toponym_ignore_
?) could be used, and then... some magic on the Java side to get all words tagged with_english_toponym_ignore_
and tag them that way even outside the context initially used by the disambiguator... like, dynamically updating our tagger dictionary.