-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pt] Fixed tons of FPs "nouns -> verbs" in disambiguation.xml #10926
Conversation
WalkthroughThe changes in this pull request focus on the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (5)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml (5)
228-240
: Approved: New rule for disambiguating noun/verb ambiguityThis new rule in the RARE_POS group is a valuable addition for disambiguating words that can be both nouns and verbs. The pattern effectively identifies contexts where the ambiguous word is more likely to be a noun, improving the accuracy of part-of-speech tagging.
Consider adding a comment explaining the rationale behind this rule, including examples of words it's designed to handle. This would improve maintainability and make it easier for other contributors to understand and extend the rule if needed.
Line range hint
1068-1570
: Approved: Comprehensive updates to English word identification rulesThe modifications and additions to the IGNORE_ENGLISH_WORDS rule group significantly enhance LanguageTool's ability to identify and label English words within Portuguese text. The rules cover a wide range of scenarios, including single words, phrases, contractions, and possessives. The use of the IsEnglishWordFilter class adds an extra layer of validation, reducing false positives.
To further improve this rule group, consider implementing a mechanism to handle false positives for words that exist in both English and Portuguese (e.g., "coma", "pasta"). This could involve creating a list of such words and adding exceptions to the relevant rules. Additionally, you might want to add rules to handle common English acronyms and initialisms that might appear in Portuguese text.
Line range hint
1572-1665
: Approved: New rule group for contextual English word identificationThe new IGNORE_WORDS_AROUND_ENGLISH_WORDS rule group is an excellent addition to the disambiguation process. It effectively extends the identification of English words by considering the context around already identified English words. This approach can significantly improve the accuracy of English word detection in Portuguese text, especially for less common words or proper nouns that might not be in the primary English word list.
To further enhance this rule group, consider adding a rule to handle hyphenated words where one part is already identified as English. For example, if you encounter a word like "cloud-based" and "cloud" is already identified as English, it would be beneficial to mark the entire hyphenated word as English. This could help catch compound words or technical terms that are commonly used in English but might not be in the primary English word list.
Line range hint
1667-1675
: Approved: Enhanced handling of words with national prefixesThe modification to the NATIONAL_PREFIXES rule group is a valuable improvement. By ignoring spelling for words that start with national prefixes followed by a hyphen, the system can better handle compound words that may not be in the dictionary but are likely to be correct. This is particularly useful for newly formed words or less common combinations that are nevertheless valid in Portuguese.
To make this rule even more robust, consider adding a check for the second part of the compound word (after the hyphen). You could verify if this part exists as a standalone word in the Portuguese dictionary. This additional check would help prevent false positives where a national prefix is followed by a non-existent or misspelled word. For example:
<rule> <pattern> <marker> <token regexp="yes" postag_regexp="yes" postag="(N|A).+">(&national_prefixes;)-<match regexp="[a-záàâãéêíóôõúç]+"/></token> </marker> </pattern> <filter class="org.languagetool.rules.patterns.DictionaryExistsFilter" args="2"/> <disambig action="ignore_spelling"/> </rule>This modification would ensure that both parts of the compound word are valid, further improving the accuracy of the spell-checking process.
Line range hint
1-1675
: Overall assessment: Significant improvements to Portuguese disambiguation rulesThe changes made to this disambiguation.xml file represent a substantial enhancement to LanguageTool's ability to process Portuguese text. The additions and modifications focus on several key areas:
- Improved disambiguation of words that can be both nouns and verbs.
- Enhanced detection and handling of English words within Portuguese text.
- Better processing of words with national prefixes.
These improvements will likely result in more accurate part-of-speech tagging, better handling of multilingual content, and improved overall performance of LanguageTool for Portuguese texts.
As the rule set continues to grow and become more complex, consider implementing a system for automated testing of these rules. This could involve creating a comprehensive test suite with various edge cases and multilingual scenarios. Such a system would help ensure that future modifications don't inadvertently break existing functionality and would make it easier to validate the effectiveness of new rules.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml (1 hunks)
🧰 Additional context used
Hello @jaumeortola , @susanaboatto and @p-goulart , I have done major fixes in the disambiguator regarding nouns that appear incorrectly as verbs. I had to comment out examples in grammar.xml and style.xml broken by the new changes. This is a major improvement which should fix tons and tons and tons of false positives. Please take your time to review the code. Thanks! Dummy rule to test it:
|
@marcoagpinto What happens with the examples that you commented out? Are they now FPs? Can you fix them? |
They are false positives now, I believe. I have been terribly stressed and spend the nights rewriting rules. Maybe they should be left commented out if they are false positives, and then somewhere in time they are replaced? Right now, I am finishing rewriting and testing another rule. |
I have been basically rewriting pt-PT rules because @p-goulart told me I can approve them myself, so I spend the nights rewriting and approving. |
Anyway, I believe the patch only broke three (3) examples? There is more gain than loss, I believe. |
Thank you! ❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️ I was napping. 😛 😛 😛 😛 😛 😛 😛 I spend the nights awaken and try to sleep during the days. |
@marcoagpinto
We need to revert the disambiguation rule, and do more tests before adding it. |
Please revert it if you can. I have been since 4am fixing the FPs, but they are too many for me to handle all at once. It may take more than a day... my guess is that it may take days to review everything. Thank you! |
I found the "revert" button and created a pull request to revert it. I am looking at the results and I need a few days to fix them. Thanks! |
Heya, @jaumeortola , @susanaboatto and @p-goulart
This is a disambiguator rule to fix nouns appearing as verbs.
I have fixed 4686 false positives while testing against 950 000 sentences.
I just gave a brief look at the .txt, and it appears to be fine, but here it is anyway.
Dummy rule for testing:
_0.txt
Summary by CodeRabbit
New Features
Improvements
Documentation
Formatting