-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and simplify the rule "CONFUSION_ER_E_PAR" #9875
Conversation
Thanks for your contributions. This change is too big to be accepted. Please keep this rulegroup as it is and write a new rulegroup, with a new ID, for example CONFUSION_ER_E_PAR2. Add the attribute
All tests must pass before a commit is merged. There are several ways of running tests. For French, you can run, for example, There are these errors in your commit:
|
@jaumeortola Thank you for your feedback. I'll finish setting up a Java dev environment and fix the issues. Where can I write new tests? Also, I don't think that the current rule should be preserved. Some parts really seem wrong. I think it's better to provide less information than fake information. Don't you agree? |
You can add new tests with
New and modified rules go into production every day automatically. We cannot take any risk. So for new rules, always use |
Ok, but the other rule will interfere with the new one. Is there a way to disable the old one during the test phase? |
78b06a2
to
cd34976
Compare
No, both rules will be testing independently against the test corpus. |
I'd need help here too with generating the suggestions. How can I create a suggestion that would replace "cameras" with "caméras" and "tractes" with "tracts" ? Adding one "suggestion" entry per case doesn't seem to work because, apparently, each suggestion must fix the sentence. Is there a way to add a conditional suggestion? |
We don't have a ready-made solution for this. You will need a rule per word, or a rule per change. You could use one rule for all the suggestions where the change is just e->é, another one for z->s... If the list of words and suggestions is going to grow, then we could put them in an external text file, and add a
|
|
From what I understand of LanguageTool, the rule CAMERA doesn't handle the plural (cameras/caméras), does it? Also, PRONSUJ_NONVERBE suggests that tractes should be replaced with "tractés" or "tractent" while CONFUSION_ER_E_PAR suggests the noun "tracts"
I went to use a rule per word. This way, the message can be very specific and very helpful for the user. |
I'm afraid that your example won't work with "les", but I might be wrong. |
Is this MR ok for you? Is there something I should work on? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Sharcoux,
Thanks for your PR.
Could you kindly explain why there are changes on rules that have nothing to do with CONFUSION_ER_E_PAR?
languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml
Outdated
Show resolved
Hide resolved
languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml
Outdated
Show resolved
Hide resolved
I removed the changes. |
@LucieSteib is it ok now? |
Hey guys, happy new year! Should we do something about this MR? @LucieSteib @jaumeortola ? |
Hi Sharcoux and happy new year! I think we can merge this, you'll be able to take a look at the diff tomorrow and make changes to the rule if you encounter false positives. Link to the diff: https://internal1.languagetool.org/regression-tests/via-http/2024-01-07/fr/index.html (you'll need to change the date to 2024-01-08) Oh, and by the way, the diff from the other PR can be found here: https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html |
Hi @GillouLT , great news. I'd be happy to investigate the changes to make sure that everything is fine, but I'm not sure about how I can read the table at https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html and https://internal1.languagetool.org/regression-tests/via-http/2024-01-08/fr/index.html My main interrogation is about the sentence that generated each row, and also about how to know what comes from the previous version and what comes from the new one. |
Hi @Sharcoux, happy new year!
|
Ok. Then, there are many problems with the previous rules. Most of them are because LanguageTool incorrectly identify the role of some words in the sentence. For instance:
There is also a problem that I should solve about verbs when followed by a proper noun. I'll fix that. But I don't know how to fix the other ones. I'm also not sure about how I should fix the previous rule now the PR has been merged. Any advice, @GillouLT @LucieSteib ? |
You can work on the rules even after they are merged, just work on them until you think you would have handled most of the false positives, and then request a new PR, add us as reviewers. |
@LucieSteib I'm not sure about how the system works. I have this in the rule:
But here is the analysis I get for "classe": From what I understand, "classe" can be a verb or a noun here, which is correct as it is a noun. But I don't understand why the rule is being triggered, as it can be a noun. How should I write the rule to be triggered only if the word can only be a verb? |
@LucieSteib Can you help me a bit with this? I don't know the xml of the project well enough. |
Hi Sharcoux! If I'm not mistaken, you would like to prevent the rule from triggering when the token is a verb but is also tagged as a noun (or else). In this case, you can use the
More info on the exception tag: Hope this will help :) |
There is something suspicious. How can Also, If I want to match only |
The same issue would arise with a rule targeting
It's a common way to handle it, but we don't generally need too many exceptions, in this case
Yes, the rule editor's corpus is very helpful here, to see if we have some edge cases where the token isn't correctly classified for example, and we can improve the rule based on it As for why a token sometimes gets multiple tags: For a practical look, experimenting with sentences that include "classe" or "manque" in the Text Analysis tool shows how context changes tagging. For instance, "Je classe..." is tagged as a verb (and a verb only), while "La classe de troisième B" is seen as a noun (and noun only). Some cases, like "Classe 1ère," get multiple tags due to lack of specific disambiguator rules. In fact, in the Text Analysis tool, you can see the disambiguator log, which is where we can see how the token gets correctly tagged (or not): The disambiguation file for FR is located here in the project: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/disambiguation.xml Some more documentation about the disambiguator: https://dev.languagetool.org/developing-a-disambiguator.html |
@GillouLT I can't access anymore the file at https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html I guess this is normal, but I can't investigate the rule. |
Sure, you can still get access to the full FR log where all matches are stored: https://internal1.languagetool.org/regression-tests/via-http/2024-03-10/fr_full/index.html |
I opened a new PR that should fix most of the problems. However, the German CI is broken, but there is no relation with my changes. If someone can have a look @GillouLT @LucieSteib ? |
The original rule, the way it was written, was summing up a set of 25 rules, ending up with way more complexity than necessary, and leading to false positives, like "L'eau entre par la bouche du poisson et ressort par les opercules." where "entre" would have been flagged as false while it's correct. The new set of rules should achieve the same result, but simplify the cases to limit the risks of false positive.
Notice that my third rule is only checking for "les + xxx" to detect a verb that should have likely been a noun. We could probably improve this by adding "des", "quels", plural adjectives, etc
I didn't understand yet how to test my changes. I'm still trying to figure it out.