Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pt] Fixed FPs in rule ID:TOMAR_ASSUMIR #9247

Merged
merged 2 commits into from
Sep 12, 2023

Conversation

marcoagpinto
Copy link
Member

Heya, @susanaboatto and @p-goulart

I was looking at the nightly results and there were several false positives, so I have fixed them.

Here is the fix.

In the next nightly results, I will see if the FPs are all gone and if so, I will remove the temp_off.

Thanks!

@@ -8316,18 +8316,21 @@ USA

<rule id='TOMAR_ASSUMIR' name="[Universitário][Científico] V. Tomar → V. Assumir" tone_tags="academic" default="temp_off">
<pattern>
<token postag='AQ.+|NC.+|NP.+|CS|CC' postag_regexp='yes'>
<exception postag_regexp='no' postag='RG'/> <!-- Add more exceptions here as they are found -->
</token>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to me that the problem is the double meaning of “tomar”—the rule doesn't work when it is used literally (e.g., for drinks, medication, etc.). So the question is, does this exception fix this issue? Or what does it exclude?

We'd need quite a long list of exceptions in <token postag='AQ.+|NC.+' postag_regexp='yes'/> to avoid the literal false alarms. Some I can think of: drinks|bebidas|cafés|su[mc]os|vinhos|cervejas|sorvetes|gelados|remédios|medicações|medicamentos|shots|chás|táxis|carros|ônibus|autocarros|trens|comboios|voos, etc.

@marcoagpinto
Copy link
Member Author

@susanaboatto

The RG exception fixes this sentence:
Já tomei vários remédios e tem havido vários ajustes.

"já"=RG

https://internal1.languagetool.org/regression-tests/via-http/2023-09-04/pt-BR/result_style_TOMAR_ASSUMIR%5B1%5D.html

All other FPs in the link above were fixed by restricting (removing) the list of possible words and by adding as the start token:
<token postag='AQ.+|NC.+|NP.+|CS|CC' postag_regexp='yes'>

This is the full rule:

        <rule id='TOMAR_ASSUMIR' name="[Universitário][Científico] V. Tomar → V. Assumir" tone_tags="academic" default="temp_off">
            <pattern>
                <token postag='AQ.+|NC.+|NP.+|CS|CC' postag_regexp='yes'>
                    <exception postag_regexp='no' postag='RG'/> <!-- Add more exceptions here as they are found -->
                </token>
                <marker>
                    <token inflected='yes'>tomar
                        <exception scope='previous' postag_regexp='yes' postag='V.+|PP.+'/>
                        <exception scope='previous' regexp='yes'>decis(ão|ões)</exception>
                    </token>
                </marker>
                <token min='0' max='2' postag='SPS00|(SPS00:)?[DP][ADIPRT].+|RG' postag_regexp='yes'/>
                <token regexp='yes'>cert[ao]s?|determinad[ao]s?|diferentes?|divers[ao]s?|enormes?|imens[ao]s?|inúmer[ao]s|múltipl[ao]s|vári[ao]s|variad[ao]s</token>
                <token postag='AQ.+|NC.+' postag_regexp='yes'/>
            </pattern>
            <message>Num contexto formal/científico, é preferível escrever &quot;assumir&quot;.</message>
            <suggestion><match no='2' postag='V.+' postag_regexp='yes'>assumir</match></suggestion>
            <example correction="assume">O crime é perigoso e o seu financiamento <marker>toma</marker> diversas formas.</example>
            <example correction="assume">O crime é perigoso e o seu financiamento <marker>toma</marker> as mais diversas formas.</example>
        </rule>

Could you try one of the drinks example with the rule?

Thanks!

❤️ ❤️ ❤️ ❤️ ❤️ ❤️

@marcoagpinto
Copy link
Member Author

Ahhhh... and if is needed, we can add the most possible usable exceptions.

@susanaboatto
Copy link
Collaborator

I see matches where "tomar" was accurate such as Tomei vários drinks em sua honra esta noite., Tomei um grande café da manhã., and Já tomei vários remédios e tem havido vários ajustes. are not matched anymore, but

<token postag='AQ.+|NC.+|NP.+|CS|CC' postag_regexp='yes'>
                    <exception postag_regexp='no' postag='RG'/>

doesn't address the root cause, which seems to be contextual to me. For example, this still matches cases like:

E tomei vários drinks em sua honra esta noite.

And it fails to detect:

Tomei várias formas naquela noite.

It appears only exceptions to <token postag='AQ.+|NC.+' postag_regexp='yes'/> can address these cases.

@marcoagpinto
Copy link
Member Author

@susanaboatto

What shall I do?

😋

@marcoagpinto
Copy link
Member Author

"Tomei várias formas naquela noite."

This is easy to improve, in the first token I can have SENT_START.

@marcoagpinto
Copy link
Member Author

@susanaboatto

Give me some tips, and tonight I will improve the rule and create another pull request.

@marcoagpinto
Copy link
Member Author

thanks!

@susanaboatto
Copy link
Collaborator

I'd say delete

<token postag='AQ.+|NC.+|NP.+|CS|CC' postag_regexp='yes'>
                    <exception postag_regexp='no' postag='RG'/> <!-- Add more exceptions here as they are found -->
                </token>

...unless this restriction removes false positives that I'm not seeing, other than those related to the verb tomar mentioned above (aka drinks|bebidas|cafés|su[mc]os|vinhos|cervejas|sorvetes|gelados|remédios|medicações|medicamentos|shots|chás|táxis|carros|ônibus|autocarros|trens|comboios|voos|barcos|ferry|canoa|barca etc.)

Then add those words as an exception to the last token.

@marcoagpinto
Copy link
Member Author

@susanaboatto

Heya, Susana,

I have improved the rule:

  1. Without the first block, like you suggested:
Portuguese (Portugal): 8 total matches
Portuguese (Portugal): 811110 total sentences considered
Portuguese (Portugal): ø0.00 rule matches per sentence

0_900000sentences.txt

  1. with just the first line of the first block, without the RG exception, and with all the exceptions you suggested plus numerous others, sorted alphabetically in categories:
Portuguese (Portugal): 5 total matches
Portuguese (Portugal): 811110 total sentences considered
Portuguese (Portugal): ø0.00 rule matches per sentence

1_900000sentences.txt

I had to add the first line to remove false positives, such as:
O The Sun, contudo, tomou diversos cuidados para confirmar a autenticidade dos documentos, incluindo:
… but I added “SENT_START” so that it works with all sentences that start with the verb “tomar”.

So, here is how the rule became:

        <rule id='TOMAR_ASSUMIR' name="[Universitário][Científico] V. Tomar → V. Assumir" tone_tags="academic" default="temp_off">
            <pattern>
                <token postag='SENT_START|AQ.+|NC.+|NP.+|CS|CC' postag_regexp='yes'/>
                <marker>
                    <token inflected='yes'>tomar
                        <exception scope='previous' postag_regexp='yes' postag='V.+|PP.+'/>
                        <exception scope='previous' regexp='yes'>decis(ão|ões)</exception>
                    </token>
                </marker>
                <token min='0' max='2' postag='SPS00|(SPS00:)?[DP][ADIPRT].+|RG' postag_regexp='yes'/>
                <token regexp='yes'>cert[ao]s?|determinad[ao]s?|diferentes?|divers[ao]s?|enormes?|imens[ao]s?|inúmer[ao]s|múltipl[ao]s|vári[ao]s|variad[ao]s</token>
                <token postag='AQ.+|NC.+' postag_regexp='yes'>
                    <exception regexp='yes' inflected='yes'>bebida|café|caneca|cerveja|chá|colher|copo|drink|frasco|garfo|garrafa|garrafão|xícara|shot|su[mc]o|vinho|gelado|sorvete|blíster|caixa|comprimido|contracetivo|embalagem|medicação|medicamento|pílula|remédio|autocarro|automóvel|avião|carrinha|carro|jato|ônibus|táxi|veículo|comboio|trem|voo|barc[ao]|bote|canoa|ferry|banho|duche</exception> <!-- Add more words as they are found -->
                </token>
            </pattern>
            <message>Num contexto formal/científico, é preferível escrever &quot;assumir&quot;.</message>
            <suggestion><match no='2' postag='V.+' postag_regexp='yes'>assumir</match></suggestion>
            <example correction="assume">O crime é perigoso e o seu financiamento <marker>toma</marker> diversas formas.</example>
            <example correction="assume">O crime é perigoso e o seu financiamento <marker>toma</marker> as mais diversas formas.</example>
        </rule>

Thanks!

😋 😛 ❤️ 🤗

@marcoagpinto
Copy link
Member Author

@susanaboatto @p-goulart

???

@susanaboatto
Copy link
Collaborator

I still don't understand the first token limitation. Can you give examples of the sentences they are limiting?

@marcoagpinto marcoagpinto merged commit 665c999 into master Sep 12, 2023
1 check passed
@marcoagpinto marcoagpinto deleted the lt_marcoagpinto_20230905_0704 branch September 12, 2023 07:38
@marcoagpinto
Copy link
Member Author

I still don't understand the first token limitation. Can you give examples of the sentences they are limiting?

@susanaboatto

I created the rule two or three weeks ago, I can't remember about the limiting, but it is better to have a stricter rule than having false positives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants