Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add geometric mean normalization for scores #239

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented Aug 1, 2023

Description

Adding geometric mean technique, that is a generalization of the mean that is based on product and N-th root of N values (more details here). Weights are supported similarly what it's done in arithmetic mean. Example of pipeline with processor config:

{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "normalization-processor": {}, 
            "combination": {
                 "technique": "geometric_mean",
                  "parameters": {
                        "weights": [
                            0.4, 0.7
                        ]
                    }
             }
        }
    ]
}

In addition to main changes there are some refactoring in integ tests. I have to put it to this PR because with few new tests added for geometric mean auto redeploy feature started acting more aggressively and tests became flaky.

  • move from global model id to reading it at the beginning of each test case. this is required because model can be redeployed by the auto redeploy feature of ml-commons
  • split on large class into three smaller ones based on area of responsibility
  • cleanup outdated tests

Issues Resolved

#228, part of solution for #126

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link

codecov bot commented Aug 2, 2023

Codecov Report

Merging #239 (f2fdcbe) into feature/normalization (6ad641a) will increase coverage by 3.80%.
Report is 1 commits behind head on feature/normalization.
The diff coverage is 93.93%.

@@                     Coverage Diff                     @@
##             feature/normalization     #239      +/-   ##
===========================================================
+ Coverage                    82.43%   86.23%   +3.80%     
- Complexity                     323      337      +14     
===========================================================
  Files                           26       28       +2     
  Lines                          979      981       +2     
  Branches                       153      153              
===========================================================
+ Hits                           807      846      +39     
+ Misses                         108       69      -39     
- Partials                        64       66       +2     
Files Changed Coverage Δ
...ation/ArithmeticMeanScoreCombinationTechnique.java 89.47% <88.88%> (-2.84%) ⬇️
...ination/HarmonicMeanScoreCombinationTechnique.java 94.11% <91.66%> (+94.11%) ⬆️
...nation/GeometricMeanScoreCombinationTechnique.java 94.11% <94.11%> (ø)
...ch/processor/combination/ScoreCombinationUtil.java 95.45% <95.45%> (ø)
...processor/combination/ScoreCombinationFactory.java 100.00% <100.00%> (ø)

* Verify score correctness by using alternative formula for geometric mean as n-th root of product of weighted scores,
* more details in here https://en.wikipedia.org/wiki/Weighted_geometric_mean
*/
private float geometricMean(List<Float> scores, List<Double> weights) {
Copy link
Collaborator

@heemin32 heemin32 Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a doubt on the effectiveness of this test code. l believe we don’t need test code based on random number. I would like to hear other opinions though.

Copy link
Collaborator

@navneet1v navneet1v Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any concern as long as test is able to fail if there are changes in formula. Let's just make sure that it doesn't become flaky because of floating point precision losses.

@martin-gaievski martin-gaievski merged commit b867e69 into opensearch-project:feature/normalization Aug 3, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants