Fix and simplify the rule "CONFUSION_ER_E_PAR" #9875

Sharcoux · 2023-12-10T14:55:19Z

The original rule, the way it was written, was summing up a set of 25 rules, ending up with way more complexity than necessary, and leading to false positives, like "L'eau entre par la bouche du poisson et ressort par les opercules." where "entre" would have been flagged as false while it's correct. The new set of rules should achieve the same result, but simplify the cases to limit the risks of false positive.

Notice that my third rule is only checking for "les + xxx" to detect a verb that should have likely been a noun. We could probably improve this by adding "des", "quels", plural adjectives, etc

I didn't understand yet how to test my changes. I'm still trying to figure it out.

jaumeortola · 2023-12-11T10:09:11Z

Thanks for your contributions.

This change is too big to be accepted. Please keep this rulegroup as it is and write a new rulegroup, with a new ID, for example CONFUSION_ER_E_PAR2. Add the attribute default="temp_off" (temporarily off) so that the rule can be tested properly before it goes into production for users.

<rulegroup id="CONFUSION_ER_E_PAR2" name="commencer par (commencé par)" default="temp_off">

All tests must pass before a commit is merged. There are several ways of running tests. For French, you can run, for example, mvn --projects languagetool-language-modules/fr --also-make test

There are these errors in your commit:

[ERROR] org.languagetool.rules.fr.FrenchPatternRuleTest.testRules  Time elapsed: 36.604 s  <<< ERROR!
org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule CONFUSION_ER_E_PAR[2] in file /org/languagetool/rules/fr/grammar.xml (line 54184): Incorrect match position markup (expected match position: 10 - 19, actual: 10 - 23) in sentence: Les cours envisages par cet élève ne paraissent pas pertinents.
        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:502)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:658)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$2(PatternRuleTest.java:537)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

[ERROR] org.languagetool.rules.fr.FrenchPatternRuleTest.testRules  Time elapsed: 36.604 s  <<< ERROR!
org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule CONFUSION_ER_E_PAR[3] in file /org/languagetool/rules/fr/grammar.xml (line 54195): Incorrect suggestions: Expected 'caméras', got: 'caméras|cameras' on input: 'Il fournit les cameras.'
        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:502)
        at org.languagetool.rules.patterns.PatternRuleTest.assertSuggestions(PatternRuleTest.java:769)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:661)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$2(PatternRuleTest.java:537)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

[ERROR] org.languagetool.rules.fr.FrenchPatternRuleTest.testRules  Time elapsed: 36.604 s  <<< ERROR!
org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: 
Test failure for rule CONFUSION_ER_E_PAR[3] in file /org/languagetool/rules/fr/grammar.xml (line 54195): Incorrect input:
  Il fournit les cameras.
Corrected sentence:
  Il fournit les cameras.
The correction triggered an error itself:
  CONFUSION_ER_E_PAR[3]:15-22:Un mot correctement orthographié semble plus approprié.

        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:502)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:671)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$2(PatternRuleTest.java:537)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Sharcoux · 2023-12-11T12:40:54Z

@jaumeortola Thank you for your feedback. I'll finish setting up a Java dev environment and fix the issues. Where can I write new tests?

Also, I don't think that the current rule should be preserved. Some parts really seem wrong. I think it's better to provide less information than fake information. Don't you agree?

jaumeortola · 2023-12-11T14:06:29Z

Where can I write new tests?

You can add new tests with <example correction="...."></example> for sentences with some error to be detected, and <example></example> for sentences where no error should be detected.

Also, I don't think that the current rule should be preserved. Some parts really seem wrong. I think it's better to provide less information than fake information. Don't you agree?

New and modified rules go into production every day automatically. We cannot take any risk. So for new rules, always use default="temp_off". For a major change in an existing rule, keep the old rule as it is, and create a new one with default="temp_off". We'll decide whether and when to switch to the new version after testing. We run a daily test with a large corpus. You can see the results here: https://internal1.languagetool.org/regression-tests/via-http/2023-12-10/

Sharcoux · 2023-12-11T15:51:34Z

Ok, but the other rule will interfere with the new one. Is there a way to disable the old one during the test phase?

…t false positive

LucieSteib · 2023-12-11T16:08:08Z

No, both rules will be testing independently against the test corpus.

Sharcoux · 2023-12-11T19:27:12Z

I'd need help here too with generating the suggestions. How can I create a suggestion that would replace "cameras" with "caméras" and "tractes" with "tracts" ?

Adding one "suggestion" entry per case doesn't seem to work because, apparently, each suggestion must fix the sentence. Is there a way to add a conditional suggestion?

jaumeortola · 2023-12-11T22:17:10Z

I'd need help here too with generating the suggestions. How can I create a suggestion that would replace "cameras" with "caméras" and "tractes" with "tracts" ?

We don't have a ready-made solution for this. You will need a rule per word, or a rule per change. You could use one rule for all the suggestions where the change is just e->é, another one for z->s...

If the list of words and suggestions is going to grow, then we could put them in an external text file, and add a <filter> ( a Java extension) to the XML rule. Or maybe we can re-use an existing filter. This is used in the rule PREP_VERBECONJUGUE:

<filter class="org.languagetool.rules.fr.FindSuggestionsFilter" args="wordFrom:2 desiredPostag:[NJZ].*|V.ppa.*|V.*inf"/>

LucieSteib · 2023-12-12T08:06:40Z

Cameras and tractes are already taken care of by LT (CAMERA[1] and PRONSUJ_NONVERBE[117]), by the way.

Sharcoux · 2023-12-12T10:26:36Z

Cameras and tractes are already taken care of by LT (CAMERA[1] and PRONSUJ_NONVERBE[117]), by the way.

From what I understand of LanguageTool, the rule CAMERA doesn't handle the plural (cameras/caméras), does it? Also, PRONSUJ_NONVERBE suggests that tractes should be replaced with "tractés" or "tractent" while CONFUSION_ER_E_PAR suggests the noun "tracts"

We don't have a ready-made solution for this. You will need a rule per word, or a rule per change. You could use one rule for all the suggestions where the change is just e->é, another one for z->s...

I went to use a rule per word. This way, the message can be very specific and very helpful for the user.

LucieSteib · 2023-12-12T10:43:05Z

Well, another rule will handle the plural... Which doesn't mean there's no room for a unified rule, why not, but usually we avoid adding more rules to an already True Positive with correct suggestion situation ;)

Sharcoux · 2023-12-12T14:12:52Z

Well, another rule will handle the plural...

I'm afraid that your example won't work with "les", but I might be wrong.

Sharcoux · 2023-12-18T14:10:35Z

Is this MR ok for you? Is there something I should work on?

LucieSteib

Hi Sharcoux,

Thanks for your PR.
Could you kindly explain why there are changes on rules that have nothing to do with CONFUSION_ER_E_PAR?

languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml

Sharcoux · 2023-12-18T17:10:44Z

I removed the changes.

Sharcoux · 2023-12-21T18:52:55Z

@LucieSteib is it ok now?

Sharcoux · 2024-01-08T11:35:43Z

Hey guys, happy new year!

Should we do something about this MR? @LucieSteib @jaumeortola ?

GillouLT · 2024-01-08T13:40:33Z

Hi Sharcoux and happy new year!

I think we can merge this, you'll be able to take a look at the diff tomorrow and make changes to the rule if you encounter false positives.

Link to the diff: https://internal1.languagetool.org/regression-tests/via-http/2024-01-07/fr/index.html (you'll need to change the date to 2024-01-08)

Oh, and by the way, the diff from the other PR can be found here: https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html

Sharcoux · 2024-01-08T14:59:30Z

Hi Sharcoux and happy new year!

I think we can merge this, you'll be able to take a look at the diff tomorrow and make changes to the rule if you encounter false positives.

Link to the diff: https://internal1.languagetool.org/regression-tests/via-http/2024-01-07/fr/index.html (you'll need to change the date to 2024-01-08)

Oh, and by the way, the diff from the other PR can be found here: https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html

Hi @GillouLT , great news. I'd be happy to investigate the changes to make sure that everything is fine, but I'm not sure about how I can read the table at https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html and https://internal1.languagetool.org/regression-tests/via-http/2024-01-08/fr/index.html

My main interrogation is about the sentence that generated each row, and also about how to know what comes from the previous version and what comes from the new one.

LucieSteib · 2024-01-08T15:14:13Z

Hi @Sharcoux, happy new year!

There's field in which you can type the ID of the rule(s) you want to check (see screenshot bellow)
Everything that's green is now from previous day (OS is checked every night, so it's the date you've selected against the day before). Pink (or red) are matches that are removed (not there anymore).
The sentences are coming from our test corpora (the huge text file on which the rules are tested each night)

Sharcoux · 2024-01-08T16:02:10Z

Ok. Then, there are many problems with the previous rules. Most of them are because LanguageTool incorrectly identify the role of some words in the sentence. For instance:

Cycle de vie : Classe 1ère-Durée 3h30 => here, "classe" is a noun, but is identified as a verb.

Frédérique Puissat est 1ère vice présidente du conseil départemental de l'Isère, en charge de la famille, enfance et santé => Here "1ère" is an adjective, but is identified as a noun.

There is also a problem that I should solve about verbs when followed by a proper noun. I'll fix that. But I don't know how to fix the other ones. I'm also not sure about how I should fix the previous rule now the PR has been merged.

Any advice, @GillouLT @LucieSteib ?

LucieSteib · 2024-01-08T16:33:13Z

You can work on the rules even after they are merged, just work on them until you think you would have handled most of the false positives, and then request a new PR, add us as reviewers.
As for the Part of Speech tagging, the system is what it is, and it's also largely dependent of the correctness of the sentences in the first place (the tagger is sometime confused by atypical sentences like lists or titles...).
Remember, you can always check the PoS using https://community.languagetool.org/analysis/analyzeText.

Sharcoux · 2024-01-09T15:28:55Z

@LucieSteib I'm not sure about how the system works. I have this in the rule:

<token postag="V.*" postag_regexp="yes" inflected="yes">

But here is the analysis I get for "classe":

From what I understand, "classe" can be a verb or a noun here, which is correct as it is a noun. But I don't understand why the rule is being triggered, as it can be a noun. How should I write the rule to be triggered only if the word can only be a verb?

Sharcoux · 2024-02-10T16:02:27Z

@LucieSteib Can you help me a bit with this? I don't know the xml of the project well enough.

GillouLT · 2024-02-12T10:55:12Z

Hi Sharcoux!

If I'm not mistaken, you would like to prevent the rule from triggering when the token is a verb but is also tagged as a noun (or else). In this case, you can use the <exception> tag, and it should look like:

<token postag="V.*" postag_regexp="yes" inflected="yes"><exception postag="N.*" postag_regexp="yes"/></token>

More info on the exception tag:
https://dev.languagetool.org/how-to-avoid-mistakes-when-writing-rules.html
https://forum.languagetool.org/t/about-exception/3673
https://dev.languagetool.org/tips-and-tricks#various-forms-of-negation

Hope this will help :)

Sharcoux · 2024-02-12T11:45:12Z

There is something suspicious. How can postag="V.*" match V.* AND N.* but postag="N.*" would match only N.* but NOT V.*?

Also, If I want to match only V.*, basically, what you're saying is that I should use <token postag="V.*" postag_regexp="yes" inflected="yes"> and then use an exception for all the other possible types? Then what's the point of V.*? If I have only A|B|C as possibilities, saying "I want only A but not B and not C", it's the same as saying "I don't want B nor C". But in any case, that requires to know the list of possible values and list them all, which is quite troublesome...

GillouLT · 2024-02-12T13:27:32Z

There is something suspicious. How can postag="V." match V. AND N.* but postag="N." would match only N. but NOT V.*?

The same issue would arise with a rule targeting postag="N.*", in some edge cases we would get a match on an inflected verb (or else), like "classe" or "manque". Depending on the issue, we can fix it by adding an exception, making a new antipattern, or by changing the scope of the pattern.

what you're saying is that I should use <token postag="V.*" postag_regexp="yes" inflected="yes"> and then use an exception for all the other possible types?

It's a common way to handle it, but we don't generally need too many exceptions, in this case <exception postag="N.*" postag_regexp="yes"/> would most probably cover the majority of false positives.

But in any case, that requires to know the list of possible values and list them all, which is quite troublesome...

Yes, the rule editor's corpus is very helpful here, to see if we have some edge cases where the token isn't correctly classified for example, and we can improve the rule based on it

As for why a token sometimes gets multiple tags: For a practical look, experimenting with sentences that include "classe" or "manque" in the Text Analysis tool shows how context changes tagging. For instance, "Je classe..." is tagged as a verb (and a verb only), while "La classe de troisième B" is seen as a noun (and noun only). Some cases, like "Classe 1ère," get multiple tags due to lack of specific disambiguator rules. In fact, in the Text Analysis tool, you can see the disambiguator log, which is where we can see how the token gets correctly tagged (or not):

The disambiguation file for FR is located here in the project: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/disambiguation.xml

Some more documentation about the disambiguator: https://dev.languagetool.org/developing-a-disambiguator.html

Sharcoux · 2024-03-11T14:27:35Z

@GillouLT I can't access anymore the file at https://internal1.languagetool.org/regression-tests/via-http/2023-12-18/fr/index.html I guess this is normal, but I can't investigate the rule.

GillouLT · 2024-03-11T15:10:45Z

Sure, you can still get access to the full FR log where all matches are stored: https://internal1.languagetool.org/regression-tests/via-http/2024-03-10/fr_full/index.html
You should be able to search for the rule ID from there and get the same results as last time.

Sharcoux · 2024-03-11T18:18:10Z

I opened a new PR that should fix most of the problems. However, the German CI is broken, but there is no relation with my changes. If someone can have a look @GillouLT @LucieSteib ?

Sharcoux mentioned this pull request Dec 10, 2023

How to contribute ? We noticed multiple mistakes in French and would like to fix them and future ones. #9846

Closed

maphjo requested review from LucieSteib and GillouLT December 11, 2023 07:53

Create a new version of "CONFUSION_ER_E_PAR" rule in French to preven…

cd34976

…t false positive

Sharcoux force-pushed the fix-er-e-par branch from 78b06a2 to cd34976 Compare December 11, 2023 15:57

Sharcoux added 3 commits December 11, 2023 18:08

Detect all plurals instead of just "les"

ec19722

Forgotten element in the code

f64bed7

Fix the regexp for 3rd rule

1881568

Create one rule per exception

62b52e9

LucieSteib reviewed Dec 18, 2023

View reviewed changes

languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml Outdated Show resolved Hide resolved

languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/grammar.xml Outdated Show resolved Hide resolved

Revert changes as requested in comments

d03c306

GillouLT approved these changes Jan 8, 2024

View reviewed changes

GillouLT merged commit e9b317d into languagetool-org:master Jan 8, 2024
2 checks passed

Sharcoux mentioned this pull request Jan 15, 2024

Add a grammar rule about verbs followed by a name in french #9876

Merged

Sharcoux mentioned this pull request Mar 11, 2024

[fr] add exception for nouns in V.* tags - Update #10384

Merged

Sharcoux mentioned this pull request Mar 11, 2024

[fr] Fix detection of nouns and verbes that are preceeded with "tu" #10385

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and simplify the rule "CONFUSION_ER_E_PAR" #9875

Fix and simplify the rule "CONFUSION_ER_E_PAR" #9875

Sharcoux commented Dec 10, 2023

jaumeortola commented Dec 11, 2023 •

edited

Loading

Sharcoux commented Dec 11, 2023

jaumeortola commented Dec 11, 2023 •

edited

Loading

Sharcoux commented Dec 11, 2023

LucieSteib commented Dec 11, 2023

Sharcoux commented Dec 11, 2023

jaumeortola commented Dec 11, 2023

LucieSteib commented Dec 12, 2023

Sharcoux commented Dec 12, 2023

LucieSteib commented Dec 12, 2023

Sharcoux commented Dec 12, 2023

Sharcoux commented Dec 18, 2023

LucieSteib left a comment

Sharcoux commented Dec 18, 2023

Sharcoux commented Dec 21, 2023

Sharcoux commented Jan 8, 2024

GillouLT commented Jan 8, 2024

Sharcoux commented Jan 8, 2024

LucieSteib commented Jan 8, 2024 •

edited

Loading

Sharcoux commented Jan 8, 2024

LucieSteib commented Jan 8, 2024

Sharcoux commented Jan 9, 2024

Sharcoux commented Feb 10, 2024

GillouLT commented Feb 12, 2024

Sharcoux commented Feb 12, 2024 •

edited

Loading

GillouLT commented Feb 12, 2024 •

edited

Loading

Sharcoux commented Mar 11, 2024

GillouLT commented Mar 11, 2024

Sharcoux commented Mar 11, 2024

Fix and simplify the rule "CONFUSION_ER_E_PAR" #9875

Fix and simplify the rule "CONFUSION_ER_E_PAR" #9875

Conversation

Sharcoux commented Dec 10, 2023

jaumeortola commented Dec 11, 2023 • edited Loading

Sharcoux commented Dec 11, 2023

jaumeortola commented Dec 11, 2023 • edited Loading

Sharcoux commented Dec 11, 2023

LucieSteib commented Dec 11, 2023

Sharcoux commented Dec 11, 2023

jaumeortola commented Dec 11, 2023

LucieSteib commented Dec 12, 2023

Sharcoux commented Dec 12, 2023

LucieSteib commented Dec 12, 2023

Sharcoux commented Dec 12, 2023

Sharcoux commented Dec 18, 2023

LucieSteib left a comment

Choose a reason for hiding this comment

Sharcoux commented Dec 18, 2023

Sharcoux commented Dec 21, 2023

Sharcoux commented Jan 8, 2024

GillouLT commented Jan 8, 2024

Sharcoux commented Jan 8, 2024

LucieSteib commented Jan 8, 2024 • edited Loading

Sharcoux commented Jan 8, 2024

LucieSteib commented Jan 8, 2024

Sharcoux commented Jan 9, 2024

Sharcoux commented Feb 10, 2024

GillouLT commented Feb 12, 2024

Sharcoux commented Feb 12, 2024 • edited Loading

GillouLT commented Feb 12, 2024 • edited Loading

Sharcoux commented Mar 11, 2024

GillouLT commented Mar 11, 2024

Sharcoux commented Mar 11, 2024

jaumeortola commented Dec 11, 2023 •

edited

Loading

jaumeortola commented Dec 11, 2023 •

edited

Loading

LucieSteib commented Jan 8, 2024 •

edited

Loading

Sharcoux commented Feb 12, 2024 •

edited

Loading

GillouLT commented Feb 12, 2024 •

edited

Loading