You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue of multiple term queries and ranking has come up a few times— E.g. if for a query for X and Y across two documents, A and B, if in Doc A, X = 40 and Y=0, but in Document B, X=15 and Y=10 , then Document A is ranked ahead of document B, even though it doesn't contain both query terms.
Of course, this can be solved by precise matching (+X +Y), but sometimes you don't necessarily want all of the query terms, but you would like to see the documents that have both of them come up first. This is the more intuitive people use searching, I'd argue — if I went to a library catalogue and looked up Keats Byron Shelley, I'd expect to see books about the Romantics above the ones about Keats or Byron or Shelley.
In part, TF-IDF was meant to help with that and it is useful for normalizing relevance in a collection of documents of varying lengths, but the pure sum of the TF-IDF measures doesn't really rectify the issue. So, doing some research has lead me to believe we might want to try implementing a third scoring algorithm—namely, BM25.
<xsl:functionname="hcmc:returnBM25"as="xs:double">
<xsl:paramname="rawScore"as="xs:integer"/>
<xsl:paramname="stemDocsCount"as="xs:integer"/>
<xsl:paramname="thisDocUri"as="xs:string"/>
<!--Compute the average document length--><!--TODO: While getTotalTermsInDoc is memoized, this should be moved outside of the function-->
<xsl:variablename="averageDocLength"select="sum($tokenizedUris ! hcmc:getTotalTermsInDoc(.)) div count($tokenizedUris)"as="xs:double"/>
<!--Get the total terms in the document-->
<xsl:variablename="totalTermsInDoc"select="hcmc:getTotalTermsInDoc($thisDocUri)"as="xs:integer"/>
<!--The ratio of terms in this doc versus the average document length-->
<xsl:variablename="lengthRatio"select="$totalTermsInDoc div $averageDocLength"/>
<!--Now calculate the modified inverse document frequency — see https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables If you had 10 documents and 3 in which this stem appeared, you'd get: ln(1 + ((10 - (3 + 0.5))/(3 + 0.5))) => ln(1 + (6.5 / 3.5)) => ln(1 + 1.857) => ln(2.857) => 1.050 [Note that math:log is, in XSLT, the natural log]-->
<xsl:variablename="idf"as="xs:double"select="math:log(1 + (($tokenizedDocsCount - ($stemDocsCount + 0.5)) div ($stemDocsCount + 0.5) ))"/>
<!--For b and k, see https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html.--><!--"b" is a constant, which "controls to what degree document length normalizes tf values" -->
<xsl:variablename="b"select="0.75"/>
<!--"k1" is a constant, which "controls non-linear term frequency normalization (saturation)"-->
<xsl:variablename="k1"select="1.2"/>
<!--Splitting the numerator and denominator for readability-->
<xsl:variablename="numerator"select="$rawScore * ($k1 + 1)"/>
<xsl:variablename="denominator"select="$rawScore + $k1 * (1 - $b + $b * $lengthRatio)"/>
<!--Now the final BM25 relevance score-->
<xsl:variablename="BM25"select="$idf * ($numerator div $denominator)"as="xs:double"/>
<xsl:messageuse-when="$verbose">Calculated BM25: <xsl:sequenceselect="$BM25"/></xsl:message>
<xsl:sequenceselect="$BM25"/>
</xsl:function>
I'm going to add this to a branch and start playing around with to see how it might improve things — some very quick tests yield good results (at least compared to tf-idf), but it's hard to test without running it on a larger document collection.
The text was updated successfully, but these errors were encountered:
The issue of multiple term queries and ranking has come up a few times— E.g. if for a query for X and Y across two documents, A and B, if in Doc A, X = 40 and Y=0, but in Document B, X=15 and Y=10 , then Document A is ranked ahead of document B, even though it doesn't contain both query terms.
Of course, this can be solved by precise matching (
+X +Y
), but sometimes you don't necessarily want all of the query terms, but you would like to see the documents that have both of them come up first. This is the more intuitive people use searching, I'd argue — if I went to a library catalogue and looked upKeats Byron Shelley
, I'd expect to see books about the Romantics above the ones about Keats or Byron or Shelley.In part, TF-IDF was meant to help with that and it is useful for normalizing relevance in a collection of documents of varying lengths, but the pure sum of the TF-IDF measures doesn't really rectify the issue. So, doing some research has lead me to believe we might want to try implementing a third scoring algorithm—namely, BM25.
This is what ElasticSearch uses as their primary relevance mechanism and, thanks to the aforelinked guide, I was able to quickly put together the algorithm (based on the TF-IDF implementation we already had):
I'm going to add this to a branch and start playing around with to see how it might improve things — some very quick tests yield good results (at least compared to tf-idf), but it's hard to test without running it on a larger document collection.
The text was updated successfully, but these errors were encountered: