Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added validations for score combination weights in Hybrid Search #265

Conversation

martin-gaievski
Copy link
Member

Description

Enforcing checks and limits of weights for score normalization.
Following rules added:

  • each weight must be between 0.0 and 1.0
  • total sum of weights for score combination must be 1.0
  • number of sub-queries and number of weights must match

Weight value and total sum of all weights will be checked during search processor creation. Match between number of sub-queries and weights will be checked in search time.

Issues Resolved

#123

Check List

  • New functionality includes testing.
    • All tests pass
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski added Enhancements Increases software capabilities beyond original client specifications backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.10 backport to 2.10 branch labels Aug 28, 2023
@codecov
Copy link

codecov bot commented Aug 28, 2023

Codecov Report

Merging #265 (67e19ca) into main (75b59cd) will increase coverage by 0.32%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main     #265      +/-   ##
============================================
+ Coverage     85.44%   85.77%   +0.32%     
- Complexity      370      376       +6     
============================================
  Files            30       30              
  Lines          1079     1104      +25     
  Branches        165      168       +3     
============================================
+ Hits            922      947      +25     
  Misses           83       83              
  Partials         74       74              
Files Changed Coverage Δ
...ation/ArithmeticMeanScoreCombinationTechnique.java 90.00% <100.00%> (+0.52%) ⬆️
...nation/GeometricMeanScoreCombinationTechnique.java 94.44% <100.00%> (+0.32%) ⬆️
...ination/HarmonicMeanScoreCombinationTechnique.java 94.44% <100.00%> (+0.32%) ⬆️
...ch/processor/combination/ScoreCombinationUtil.java 97.72% <100.00%> (+2.27%) ⬆️

@martin-gaievski martin-gaievski changed the title Added strong check on number of weights equals number of sub-queries for Hybrid Search Added validations for score combination weights in Hybrid Search Aug 28, 2023
@martin-gaievski martin-gaievski force-pushed the add-checks-weights-must-match-num-of-subqueries branch from 6624246 to 71dd2e7 Compare August 28, 2023 22:45
@heemin32
Copy link
Collaborator

heemin32 commented Aug 28, 2023

I wonder why each weight must be between 0.0 and 1.0. As long as we divide it by sum of weights during calculation, the the effective sum will be 1 and each weight will be between 0.0 and 1.0. Should we have to enforce it from customer input?

@martin-gaievski martin-gaievski force-pushed the add-checks-weights-must-match-num-of-subqueries branch from 71dd2e7 to eeb1af0 Compare August 28, 2023 23:17
@martin-gaievski
Copy link
Member Author

I wonder why each weight must be between 0.0 and 1.0. As long as we divide it by sum of weights during calculation, the the effective sum will be 1 and each weight will be between 0.0 and 1.0. Should we have to enforce it from customer input?

Mostly to keep consistency with other features in OpenSearch, I've been told that we consider that a user expectations that all weights sum up to 1.0. For instance, here https://opensearch.org/docs/latest/search-plugins/search-pipelines/personalize-search-ranking/ we use two weights, one is set to x and other will be automatically calculated as 1.0 - x.

Technically we can work with any number but I guess that can lead to a mess as there is no single standard then.

return;
}
if (scores.length != weights.size()) {
log.error(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need this logging.

scores.length
)
);
throw new IllegalArgumentException("number of weights must match number of sub-queries in hybrid query");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning size of weights and queries should be fine here. So security concern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I do it as a rule of thumb, but I guess that's ok here. Let me push a commit

private void validateWeights(final List<Float> weightsList) {
boolean isOutOfRange = weightsList.stream().anyMatch(weight -> !Range.between(0.0f, 1.0f).contains(weight));
if (isOutOfRange) {
log.error(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Returning submitted weights in exception should be fine rather than logging duplicated message because we already knows the value is float. And float won't cause any security risk by getting returned in exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

}
float sumOfWeights = weightsList.stream().reduce(0.0f, Float::sum);
if (!DoubleMath.fuzzyEquals(1.0f, sumOfWeights, DELTA_FOR_SCORE_ASSERTION)) {
log.error(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. No need an additional logging here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@martin-gaievski martin-gaievski force-pushed the add-checks-weights-must-match-num-of-subqueries branch from c408f44 to 84454f6 Compare August 29, 2023 20:54
@martin-gaievski martin-gaievski force-pushed the add-checks-weights-must-match-num-of-subqueries branch from 84454f6 to 614be2c Compare August 30, 2023 01:17
@martin-gaievski martin-gaievski force-pushed the add-checks-weights-must-match-num-of-subqueries branch from a3dc7bd to 67e19ca Compare August 30, 2023 01:42
@martin-gaievski martin-gaievski merged commit 685d5d6 into opensearch-project:main Aug 30, 2023
16 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 30, 2023
* Added strong check on number of weights equals number of sub-queries

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 685d5d6)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 30, 2023
* Added strong check on number of weights equals number of sub-queries

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 685d5d6)
martin-gaievski added a commit that referenced this pull request Aug 30, 2023
… (#268)

* Added strong check on number of weights equals number of sub-queries

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 685d5d6)

Co-authored-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this pull request Aug 30, 2023
… (#269)

* Added strong check on number of weights equals number of sub-queries

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 685d5d6)

Co-authored-by: Martin Gaievski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.10 backport to 2.10 branch Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants