Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROUGE: fixed sentences calculation and some minor refactoring #272

Merged
merged 8 commits into from
Oct 24, 2023

Conversation

rssdev10
Copy link
Collaborator

Hi, I started looking at the ROUGE-N/L evaluation components and found that the code was sometimes inefficient and sometimes looked like a bug. This is a first commit with some improvements and fixes.

First of all, I opened the original article https://aclanthology.org/W04-1013.pdf and found an incorrect mention of the "jackknifing procedure". In the code https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/utils.jl#L7 there is some implementation with a name jackknife_avg, but this is not the average. The implemented version is looks like argmax as literally mentioned in the paper but has a wrong description. See https://en.wikipedia.org/wiki/Jackknife_resampling - the Jackknife average is the same as average and never used in this role.

Moreover, I found the original implementation ROUGE-1.5.5.pl. And, in the changelog, I see:

#              1.4
#              (1) Remove internal Jackknifing procedure:
#                  Now the ROUGE script will use all the references listed in the
#                  <MODEL></MODEL> section in each <EVAL></EVAL> section and no
#                  automatic Jackknifing is performed. Please see RELEASE-NOTE.txt
#                  for more details.

In the Perl implementation I see a very simple solution:

        if ( $scoreMode eq "A" ) {
            $totalLCSHit    += $lcsHit;
            $totalLCSCount  += $lcsBase;
            $totalLCSCountP += $peer_1grams{"_cn_"};
        }
        elsif ( $scoreMode eq "B" ) {
            if ( $lcsScore > $lcsScoreBest ) {

                # only take a better score (i.e. better match)
                $lcsScoreBest   = $lcsScore;
                $totalLCSHit    = $lcsHit;
                $totalLCSCount  = $lcsBase;
                $totalLCSCountP = $peer_1grams{"_cn_"};
            }
        }
        else {
            # use average mode
            $totalLCSHit    += $lcsHit;
            $totalLCSCount  += $lcsBase;
            $totalLCSCountP += $peer_1grams{"_cn_"};
        }

If there are no objections, my offer is to remove the function with the name jackknife_avg and implement it in a similar way.

I found a very suspicious place in the method rouge_l_summary. There is a loop for ref_sent in ref_sent_list by a sentence in the reference text. But ref_sent is not used in the inner loop. OK, I fixed it, but unit tests fail because the final score is lower than the previous one.

@djokester and @Ayushk4 how did you calculate the values for the unit test?

Next, I found an inefficient implementation of the method weighted_lcs with creating an array of arrays of Float64 instead of a single matrix of Int32. And, I separated this method into two different ones to provide type stability, the initial method weighted_lcs returns either Int64 or String. This is definitely another performance issue.

At the same time, all the methods rouge_n, rouge_l_sentence, rouge_l_summary are type unstable and return a single score or an array of scores depending on arguments. We can specify a type Union{Float64, Vector{Float64}} but do we really need it? For now this is the only non API breaking way to fix it.

An alternative way is to return an array and implement an external average/argmax function. Or return a named tuple with both the single score, the array, and, maybe, a separate recall, precision, F.

Any thoughts on this?

@rssdev10 rssdev10 marked this pull request as draft September 12, 2023 17:18
@aviks
Copy link
Member

aviks commented Oct 4, 2023

If the best way to fix the rouge_n methods is to break api compatibility, I think that should be fine. We can tag a new minor release with that. Or even a 1.0

@rssdev10 rssdev10 force-pushed the fix/ROUGE branch 2 times, most recently from 0d358ae to 177e5d4 Compare October 16, 2023 00:58
@rssdev10
Copy link
Collaborator Author

Added initial changes:

  • added Score structure with precision, recall and fmeasure
  • all functions return an array of Score structures
  • argmax by fmeasure or average maybe taken outside
  • removed jackknife_avg as completely wrong approach
  • found non described in the initial code "novelty"
    increment = (f(k + 1)) - (f(k))
    with calculation of weights with sqrt function. Initial Perl-based implementation ROUGE-1.5.5.pl contains power function:
sub wlcsWeight {
    my $r     = shift;
    my $power = shift;

    return $r**$power;
}

sub wlcsWeightInverse {
    my $r     = shift;
    my $power = shift;

    return $r**( 1 / $power );
}

But Google's implementation doesn't use weights - https://github.com/google-research/google-research/blob/master/rouge/rouge_scorer.py#L210

The method rouge_l_sentence(reference_summaries, candidate_summary, 1) still gives different result with Google implementation. But there is a difference in the tokenizer (dot symbol is a token in our case).

One failed test I see with Julia 1.3 because I used @kwdef for decoration purposes. Do we really need Julia 1.3?...

Any feedback?

@aviks
Copy link
Member

aviks commented Oct 16, 2023

We don't need 1.3. Can do 1.6 minimum.

@rssdev10 rssdev10 force-pushed the fix/ROUGE branch 2 times, most recently from c22bfbb to c1fbdbd Compare October 18, 2023 02:29
@rssdev10
Copy link
Collaborator Author

rouge_l_sentence now gives similar result as Google's implementation. But this is due to the previous weighting being turned off.

And, the last open question - do we need to have sqrt as a function for weighting? Or, should we have the same implementation as ROUGE-1.5.5.pl with the function power and argument 0.5 for this case? ...

@rssdev10
Copy link
Collaborator Author

Added a minor change - the weighting function can now be changed as an argument if necessary. At least both cases - Google's without weighting and previous implementation with fixed sqrt may be used. If there are no other ideas, I'm willing to merge it.

@rssdev10 rssdev10 marked this pull request as ready for review October 24, 2023 01:32
@aviks
Copy link
Member

aviks commented Oct 24, 2023

If there are no other ideas, I'm willing to merge it.

I'm OK with that plan.

@rssdev10 rssdev10 merged commit 14533f0 into JuliaText:master Oct 24, 2023
7 checks passed
rssdev10 added a commit to rssdev10/TextAnalysis.jl that referenced this pull request Oct 24, 2023
…ext#272)

ROUGE: 
* fixed sentences calculation and some minor refactoring
* changed output types, optimized, aligned with ROUGE-1.5.5 and checked with Google Research implementation
* docs updates
* rouge_l_sentence - added an argument with weighted function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants