Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and simplify the rule "CONFUSION_ER_E_PAR" #9875

Merged
merged 6 commits into from
Jan 8, 2024

Conversation

Sharcoux
Copy link
Contributor

The original rule, the way it was written, was summing up a set of 25 rules, ending up with way more complexity than necessary, and leading to false positives, like "L'eau entre par la bouche du poisson et ressort par les opercules." where "entre" would have been flagged as false while it's correct. The new set of rules should achieve the same result, but simplify the cases to limit the risks of false positive.

Notice that my third rule is only checking for "les + xxx" to detect a verb that should have likely been a noun. We could probably improve this by adding "des", "quels", plural adjectives, etc

I didn't understand yet how to test my changes. I'm still trying to figure it out.

@jaumeortola
Copy link
Member

jaumeortola commented Dec 11, 2023

Thanks for your contributions.

This change is too big to be accepted. Please keep this rulegroup as it is and write a new rulegroup, with a new ID, for example CONFUSION_ER_E_PAR2. Add the attribute default="temp_off" (temporarily off) so that the rule can be tested properly before it goes into production for users.

<rulegroup id="CONFUSION_ER_E_PAR2" name="commencer par (commencé par)" default="temp_off">

All tests must pass before a commit is merged. There are several ways of running tests. For French, you can run, for example, mvn --projects languagetool-language-modules/fr --also-make test

There are these errors in your commit:

[ERROR] org.languagetool.rules.fr.FrenchPatternRuleTest.testRules  Time elapsed: 36.604 s  <<< ERROR!
org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule CONFUSION_ER_E_PAR[2] in file /org/languagetool/rules/fr/grammar.xml (line 54184): Incorrect match position markup (expected match position: 10 - 19, actual: 10 - 23) in sentence: Les cours envisages par cet élève ne paraissent pas pertinents.
        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:502)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:658)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$2(PatternRuleTest.java:537)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

[ERROR] org.languagetool.rules.fr.FrenchPatternRuleTest.testRules  Time elapsed: 36.604 s  <<< ERROR!
org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule CONFUSION_ER_E_PAR[3] in file /org/languagetool/rules/fr/grammar.xml (line 54195): Incorrect suggestions: Expected 'caméras', got: 'caméras|cameras' on input: 'Il fournit les cameras.'
        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:502)
        at org.languagetool.rules.patterns.PatternRuleTest.assertSuggestions(PatternRuleTest.java:769)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:661)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$2(PatternRuleTest.java:537)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

[ERROR] org.languagetool.rules.fr.FrenchPatternRuleTest.testRules  Time elapsed: 36.604 s  <<< ERROR!
org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: 
Test failure for rule CONFUSION_ER_E_PAR[3] in file /org/languagetool/rules/fr/grammar.xml (line 54195): Incorrect input:
  Il fournit les cameras.
Corrected sentence:
  Il fournit les cameras.
The correction triggered an error itself:
  CONFUSION_ER_E_PAR[3]:15-22:Un mot correctement orthographié semble plus approprié.

        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:502)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:671)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$2(PatternRuleTest.java:537)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)


@Sharcoux
Copy link
Contributor Author

@jaumeortola Thank you for your feedback. I'll finish setting up a Java dev environment and fix the issues. Where can I write new tests?

Also, I don't think that the current rule should be preserved. Some parts really seem wrong. I think it's better to provide less information than fake information. Don't you agree?

@jaumeortola
Copy link
Member

jaumeortola commented Dec 11, 2023

Where can I write new tests?

You can add new tests with <example correction="...."></example> for sentences with some error to be detected, and <example></example> for sentences where no error should be detected.

Also, I don't think that the current rule should be preserved. Some parts really seem wrong. I think it's better to provide less information than fake information. Don't you agree?

New and modified rules go into production every day automatically. We cannot take any risk. So for new rules, always use default="temp_off". For a major change in an existing rule, keep the old rule as it is, and create a new one with default="temp_off". We'll decide whether and when to switch to the new version after testing. We run a daily test with a large corpus. You can see the results here: https://internal1.languagetool.org/regression-tests/via-http/2023-12-10/

@Sharcoux
Copy link
Contributor Author

Ok, but the other rule will interfere with the new one. Is there a way to disable the old one during the test phase?

@LucieSteib
Copy link
Collaborator

No, both rules will be testing independently against the test corpus.

@Sharcoux
Copy link
Contributor Author

I'd need help here too with generating the suggestions. How can I create a suggestion that would replace "cameras" with "caméras" and "tractes" with "tracts" ?

Adding one "suggestion" entry per case doesn't seem to work because, apparently, each suggestion must fix the sentence. Is there a way to add a conditional suggestion?

@jaumeortola
Copy link
Member

I'd need help here too with generating the suggestions. How can I create a suggestion that would replace "cameras" with "caméras" and "tractes" with "tracts" ?

We don't have a ready-made solution for this. You will need a rule per word, or a rule per change. You could use one rule for all the suggestions where the change is just e->é, another one for z->s...

If the list of words and suggestions is going to grow, then we could put them in an external text file, and add a <filter> ( a Java extension) to the XML rule. Or maybe we can re-use an existing filter. This is used in the rule PREP_VERBECONJUGUE:

<filter class="org.languagetool.rules.fr.FindSuggestionsFilter" args="wordFrom:2 desiredPostag:[NJZ].*|V.ppa.*|V.*inf"/>

@LucieSteib
Copy link
Collaborator

Cameras and tractes are already taken care of by LT (CAMERA[1] and PRONSUJ_NONVERBE[117]), by the way.

@Sharcoux
Copy link
Contributor Author

Cameras and tractes are already taken care of by LT (CAMERA[1] and PRONSUJ_NONVERBE[117]), by the way.

From what I understand of LanguageTool, the rule CAMERA doesn't handle the plural (cameras/caméras), does it? Also, PRONSUJ_NONVERBE suggests that tractes should be replaced with "tractés" or "tractent" while CONFUSION_ER_E_PAR suggests the noun "tracts"

We don't have a ready-made solution for this. You will need a rule per word, or a rule per change. You could use one rule for all the suggestions where the change is just e->é, another one for z->s...

I went to use a rule per word. This way, the message can be very specific and very helpful for the user.

@LucieSteib
Copy link
Collaborator

Well, another rule will handle the plural... Which doesn't mean there's no room for a unified rule, why not, but usually we avoid adding more rules to an already True Positive with correct suggestion situation ;)
image

@Sharcoux
Copy link
Contributor Author

Well, another rule will handle the plural...

I'm afraid that your example won't work with "les", but I might be wrong.

@Sharcoux
Copy link
Contributor Author

Is this MR ok for you? Is there something I should work on?

Copy link
Collaborator

@LucieSteib LucieSteib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sharcoux,

Thanks for your PR.
Could you kindly explain why there are changes on rules that have nothing to do with CONFUSION_ER_E_PAR?

@Sharcoux
Copy link
Contributor Author

I removed the changes.

@Sharcoux
Copy link
Contributor Author

@LucieSteib is it ok now?

@Sharcoux
Copy link
Contributor Author

Sharcoux commented Jan 8, 2024

Hey guys, happy new year!

Should we do something about this MR? @LucieSteib @jaumeortola ?

@GillouLT
Copy link
Collaborator

GillouLT commented Jan 8, 2024

Hi Sharcoux and happy new year!

I think we can merge this, you'll be able to take a look at the diff tomorrow and make changes to the rule if you encounter false positives.

Link to the diff: https://internal1.languagetool.org/regression-tests/via-http/2024-01-07/fr/index.html (you'll need to change the date to 2024-01-08)

Oh, and by the way, the diff from the other PR can be found here: https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html

@GillouLT GillouLT merged commit e9b317d into languagetool-org:master Jan 8, 2024
2 checks passed
@Sharcoux
Copy link
Contributor Author

Sharcoux commented Jan 8, 2024

Hi Sharcoux and happy new year!

I think we can merge this, you'll be able to take a look at the diff tomorrow and make changes to the rule if you encounter false positives.

Link to the diff: https://internal1.languagetool.org/regression-tests/via-http/2024-01-07/fr/index.html (you'll need to change the date to 2024-01-08)

Oh, and by the way, the diff from the other PR can be found here: https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html

Hi @GillouLT , great news. I'd be happy to investigate the changes to make sure that everything is fine, but I'm not sure about how I can read the table at https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html and https://internal1.languagetool.org/regression-tests/via-http/2024-01-08/fr/index.html

My main interrogation is about the sentence that generated each row, and also about how to know what comes from the previous version and what comes from the new one.

@LucieSteib
Copy link
Collaborator

LucieSteib commented Jan 8, 2024

Hi @Sharcoux, happy new year!

  • There's field in which you can type the ID of the rule(s) you want to check (see screenshot bellow)
  • Everything that's green is now from previous day (OS is checked every night, so it's the date you've selected against the day before). Pink (or red) are matches that are removed (not there anymore).
  • The sentences are coming from our test corpora (the huge text file on which the rules are tested each night)
    image

@Sharcoux
Copy link
Contributor Author

Sharcoux commented Jan 8, 2024

Ok. Then, there are many problems with the previous rules. Most of them are because LanguageTool incorrectly identify the role of some words in the sentence. For instance:

Cycle de vie : Classe 1ère-Durée 3h30 => here, "classe" is a noun, but is identified as a verb.

Frédérique Puissat est 1ère vice présidente du conseil départemental de l'Isère, en charge de la famille, enfance et santé => Here "1ère" is an adjective, but is identified as a noun.

There is also a problem that I should solve about verbs when followed by a proper noun. I'll fix that. But I don't know how to fix the other ones. I'm also not sure about how I should fix the previous rule now the PR has been merged.

Any advice, @GillouLT @LucieSteib ?

@LucieSteib
Copy link
Collaborator

You can work on the rules even after they are merged, just work on them until you think you would have handled most of the false positives, and then request a new PR, add us as reviewers.
As for the Part of Speech tagging, the system is what it is, and it's also largely dependent of the correctness of the sentences in the first place (the tagger is sometime confused by atypical sentences like lists or titles...).
Remember, you can always check the PoS using https://community.languagetool.org/analysis/analyzeText.

@Sharcoux
Copy link
Contributor Author

Sharcoux commented Jan 9, 2024

@LucieSteib I'm not sure about how the system works. I have this in the rule:

<token postag="V.*" postag_regexp="yes" inflected="yes">

But here is the analysis I get for "classe":

image

From what I understand, "classe" can be a verb or a noun here, which is correct as it is a noun. But I don't understand why the rule is being triggered, as it can be a noun. How should I write the rule to be triggered only if the word can only be a verb?

@Sharcoux
Copy link
Contributor Author

@LucieSteib Can you help me a bit with this? I don't know the xml of the project well enough.

@GillouLT
Copy link
Collaborator

Hi Sharcoux! 

If I'm not mistaken, you would like to prevent the rule from triggering when the token is a verb but is also tagged as a noun (or else). In this case, you can use the <exception> tag, and it should look like:

<token postag="V.*" postag_regexp="yes" inflected="yes"><exception postag="N.*" postag_regexp="yes"/></token>

More info on the exception tag:
https://dev.languagetool.org/how-to-avoid-mistakes-when-writing-rules.html
https://forum.languagetool.org/t/about-exception/3673
https://dev.languagetool.org/tips-and-tricks#various-forms-of-negation

Hope this will help :)

@Sharcoux
Copy link
Contributor Author

Sharcoux commented Feb 12, 2024

There is something suspicious. How can postag="V.*" match V.* AND N.* but postag="N.*" would match only N.* but NOT V.*?

Also, If I want to match only V.*, basically, what you're saying is that I should use <token postag="V.*" postag_regexp="yes" inflected="yes"> and then use an exception for all the other possible types? Then what's the point of V.*? If I have only A|B|C as possibilities, saying "I want only A but not B and not C", it's the same as saying "I don't want B nor C". But in any case, that requires to know the list of possible values and list them all, which is quite troublesome...

@GillouLT
Copy link
Collaborator

GillouLT commented Feb 12, 2024

There is something suspicious. How can postag="V." match V. AND N.* but postag="N." would match only N. but NOT V.*?

The same issue would arise with a rule targeting postag="N.*", in some edge cases we would get a match on an inflected verb (or else), like "classe" or "manque". Depending on the issue, we can fix it by adding an exception, making a new antipattern, or by changing the scope of the pattern.

what you're saying is that I should use <token postag="V.*" postag_regexp="yes" inflected="yes"> and then use an exception for all the other possible types?

It's a common way to handle it, but we don't generally need too many exceptions, in this case <exception postag="N.*" postag_regexp="yes"/> would most probably cover the majority of false positives.

But in any case, that requires to know the list of possible values and list them all, which is quite troublesome...

Yes, the rule editor's corpus is very helpful here, to see if we have some edge cases where the token isn't correctly classified for example, and we can improve the rule based on it

As for why a token sometimes gets multiple tags: For a practical look, experimenting with sentences that include "classe" or "manque" in the Text Analysis tool shows how context changes tagging. For instance, "Je classe..." is tagged as a verb (and a verb only), while "La classe de troisième B" is seen as a noun (and noun only). Some cases, like "Classe 1ère," get multiple tags due to lack of specific disambiguator rules. In fact, in the Text Analysis tool, you can see the disambiguator log, which is where we can see how the token gets correctly tagged (or not):

jeclasselog

The disambiguation file for FR is located here in the project: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/disambiguation.xml

Some more documentation about the disambiguator: https://dev.languagetool.org/developing-a-disambiguator.html

@Sharcoux
Copy link
Contributor Author

@GillouLT I can't access anymore the file at https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html I guess this is normal, but I can't investigate the rule.

@GillouLT
Copy link
Collaborator

Sure, you can still get access to the full FR log where all matches are stored: https://internal1.languagetool.org/regression-tests/via-http/2024-03-10/fr_full/index.html
You should be able to search for the rule ID from there and get the same results as last time.

@Sharcoux
Copy link
Contributor Author

I opened a new PR that should fix most of the problems. However, the German CI is broken, but there is no relation with my changes. If someone can have a look @GillouLT @LucieSteib ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants