Consider more robust scoring mechanisms for relevance of multiple keywords #299

joeytakeda · 2024-04-23T01:43:14Z

The issue of multiple term queries and ranking has come up a few times— E.g. if for a query for X and Y across two documents, A and B, if in Doc A, X = 40 and Y=0, but in Document B, X=15 and Y=10 , then Document A is ranked ahead of document B, even though it doesn't contain both query terms.

Of course, this can be solved by precise matching (+X +Y), but sometimes you don't necessarily want all of the query terms, but you would like to see the documents that have both of them come up first. This is the more intuitive people use searching, I'd argue — if I went to a library catalogue and looked up Keats Byron Shelley, I'd expect to see books about the Romantics above the ones about Keats or Byron or Shelley.

In part, TF-IDF was meant to help with that and it is useful for normalizing relevance in a collection of documents of varying lengths, but the pure sum of the TF-IDF measures doesn't really rectify the issue. So, doing some research has lead me to believe we might want to try implementing a third scoring algorithm—namely, BM25.

This is what ElasticSearch uses as their primary relevance mechanism and, thanks to the aforelinked guide, I was able to quickly put together the algorithm (based on the TF-IDF implementation we already had):

    <xsl:function name="hcmc:returnBM25" as="xs:double">
        <xsl:param name="rawScore" as="xs:integer"/>
        <xsl:param name="stemDocsCount" as="xs:integer"/>
        <xsl:param name="thisDocUri" as="xs:string"/>
        <!--Compute the average document length-->
        <!--TODO: While getTotalTermsInDoc is memoized, 
            this should be moved outside
            of the function-->
        <xsl:variable name="averageDocLength" 
            select="sum($tokenizedUris ! hcmc:getTotalTermsInDoc(.))
                        div 
                    count($tokenizedUris)" as="xs:double"/>
        
        <!--Get the total terms in the document-->
        <xsl:variable name="totalTermsInDoc" 
            select="hcmc:getTotalTermsInDoc($thisDocUri)" as="xs:integer"/>
        
        <!--The ratio of terms in this doc versus the average document length-->
        <xsl:variable name="lengthRatio" select="$totalTermsInDoc div $averageDocLength"/>
        
        <!--Now calculate the modified inverse document frequency —
            see https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
            If you had 10 documents and 3 in which this stem appeared, you'd get:
            ln(1 + ((10 - (3 + 0.5))/(3 + 0.5)))
            => ln(1 + (6.5 / 3.5))
            => ln(1 + 1.857) => ln(2.857) => 1.050
        
            [Note that math:log is, in XSLT, the natural log]
        -->
        <xsl:variable name="idf" 
            as="xs:double"
            select="math:log(1 + 
                (($tokenizedDocsCount - ($stemDocsCount + 0.5))
                    div 
                 ($stemDocsCount + 0.5)
                 ))"/>
        
        <!--For b and k, see          
            https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html.
            -->
        <!--"b" is a constant, which "controls to what degree document length normalizes tf values" -->
        <xsl:variable name="b" select="0.75"/>
        
        <!--"k1" is a constant, which "controls non-linear term frequency normalization (saturation)"-->
        <xsl:variable name="k1" select="1.2"/>
        
        <!--Splitting the numerator and denominator for readability-->
        <xsl:variable name="numerator" 
            select="$rawScore * ($k1 + 1)"/>
        <xsl:variable name="denominator" 
            select="$rawScore + $k1 * (1 - $b + $b * $lengthRatio)"/>
        
        <!--Now the final BM25 relevance score-->
        <xsl:variable name="BM25" select="$idf * ($numerator div $denominator)" as="xs:double"/>
        <xsl:message use-when="$verbose">Calculated BM25: <xsl:sequence select="$BM25"/></xsl:message>
        <xsl:sequence select="$BM25"/>
    </xsl:function>

I'm going to add this to a branch and start playing around with to see how it might improve things — some very quick tests yield good results (at least compared to tf-idf), but it's hard to test without running it on a larger document collection.

The text was updated successfully, but these errors were encountered:

martindholmes · 2024-10-07T23:11:06Z

@joeytakeda Is this still working if you merge dev into it? If so, should we add it for 2.0, or punt it to 2.1?

joeytakeda self-assigned this Apr 23, 2024

joeytakeda added the enhancement New feature or request label Apr 23, 2024

joeytakeda added a commit that referenced this issue Apr 23, 2024

Initial implementation of BM25 for #299

b500dc1

martindholmes added this to the Release 2.0 milestone May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider more robust scoring mechanisms for relevance of multiple keywords #299

Consider more robust scoring mechanisms for relevance of multiple keywords #299

joeytakeda commented Apr 23, 2024

martindholmes commented Oct 7, 2024

Consider more robust scoring mechanisms for relevance of multiple keywords #299

Consider more robust scoring mechanisms for relevance of multiple keywords #299

Comments

joeytakeda commented Apr 23, 2024

martindholmes commented Oct 7, 2024