Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider more robust scoring mechanisms for relevance of multiple keywords #299

Open
joeytakeda opened this issue Apr 23, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@joeytakeda
Copy link
Contributor

The issue of multiple term queries and ranking has come up a few times— E.g. if for a query for X and Y across two documents, A and B, if in Doc A, X = 40 and Y=0, but in Document B, X=15 and Y=10 , then Document A is ranked ahead of document B, even though it doesn't contain both query terms.

Of course, this can be solved by precise matching (+X +Y), but sometimes you don't necessarily want all of the query terms, but you would like to see the documents that have both of them come up first. This is the more intuitive people use searching, I'd argue — if I went to a library catalogue and looked up Keats Byron Shelley, I'd expect to see books about the Romantics above the ones about Keats or Byron or Shelley.

In part, TF-IDF was meant to help with that and it is useful for normalizing relevance in a collection of documents of varying lengths, but the pure sum of the TF-IDF measures doesn't really rectify the issue. So, doing some research has lead me to believe we might want to try implementing a third scoring algorithm—namely, BM25.

This is what ElasticSearch uses as their primary relevance mechanism and, thanks to the aforelinked guide, I was able to quickly put together the algorithm (based on the TF-IDF implementation we already had):

    <xsl:function name="hcmc:returnBM25" as="xs:double">
        <xsl:param name="rawScore" as="xs:integer"/>
        <xsl:param name="stemDocsCount" as="xs:integer"/>
        <xsl:param name="thisDocUri" as="xs:string"/>
        <!--Compute the average document length-->
        <!--TODO: While getTotalTermsInDoc is memoized, 
            this should be moved outside
            of the function-->
        <xsl:variable name="averageDocLength" 
            select="sum($tokenizedUris ! hcmc:getTotalTermsInDoc(.))
                        div 
                    count($tokenizedUris)" as="xs:double"/>
        
        <!--Get the total terms in the document-->
        <xsl:variable name="totalTermsInDoc" 
            select="hcmc:getTotalTermsInDoc($thisDocUri)" as="xs:integer"/>
        
        <!--The ratio of terms in this doc versus the average document length-->
        <xsl:variable name="lengthRatio" select="$totalTermsInDoc div $averageDocLength"/>
        
        <!--Now calculate the modified inverse document frequency —
            see https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
            If you had 10 documents and 3 in which this stem appeared, you'd get:
            ln(1 + ((10 - (3 + 0.5))/(3 + 0.5)))
            => ln(1 + (6.5 / 3.5))
            => ln(1 + 1.857) => ln(2.857) => 1.050
        
            [Note that math:log is, in XSLT, the natural log]
        -->
        <xsl:variable name="idf" 
            as="xs:double"
            select="math:log(1 + 
                (($tokenizedDocsCount - ($stemDocsCount + 0.5))
                    div 
                 ($stemDocsCount + 0.5)
                 ))"/>
        
        <!--For b and k, see          
            https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html.
            -->
        <!--"b" is a constant, which "controls to what degree document length normalizes tf values" -->
        <xsl:variable name="b" select="0.75"/>
        
        <!--"k1" is a constant, which "controls non-linear term frequency normalization (saturation)"-->
        <xsl:variable name="k1" select="1.2"/>
        
        <!--Splitting the numerator and denominator for readability-->
        <xsl:variable name="numerator" 
            select="$rawScore * ($k1 + 1)"/>
        <xsl:variable name="denominator" 
            select="$rawScore + $k1 * (1 - $b + $b * $lengthRatio)"/>
        
        <!--Now the final BM25 relevance score-->
        <xsl:variable name="BM25" select="$idf * ($numerator div $denominator)" as="xs:double"/>
        <xsl:message use-when="$verbose">Calculated BM25: <xsl:sequence select="$BM25"/></xsl:message>
        <xsl:sequence select="$BM25"/>
    </xsl:function>

I'm going to add this to a branch and start playing around with to see how it might improve things — some very quick tests yield good results (at least compared to tf-idf), but it's hard to test without running it on a larger document collection.

@joeytakeda joeytakeda self-assigned this Apr 23, 2024
@joeytakeda joeytakeda added the enhancement New feature or request label Apr 23, 2024
@martindholmes martindholmes added this to the Release 2.0 milestone May 10, 2024
@martindholmes
Copy link
Collaborator

@joeytakeda Is this still working if you merge dev into it? If so, should we add it for 2.0, or punt it to 2.1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants