Reranks and expands Solr query returns using filtered clickstream data, providing a simple, flexible collaborative filtering framework. Clickthroughfilter-x.x.x.jar runs the filter as a query parser plugin for Solr/Lucene.
The filter samples click data (from a separate clicks core) for items returned by a query and uses it to
1. | reorganise | boost items in proportion to their click traffic (primary items) |
2. | extend | inject new items not matching the query, but connected to the query's primary items by click traffic (secondary items) |
3. | customise | boost & inject selected item types and boost & inject using selected components of click traffic |
These elements can be used separately or in combination for improving search returns, current awareness, identifying related material, making personal recommendations etc (see example CTF queries below).
To minimise processing times, the filter acts either on the top n items of a query return ("base=matches": faster, but lower improvement), or looks deeper into the return and draws out the top n items in terms of click traffic ("base=clicks": potentially slower, but with potentially higher improvement).
For the most part, the filter can be built straight into many types of complex queries, including conjunction with other parsers, methods, facets etc. Where you find that a CTF query is not directly compatible with other complex queries (e.g. certain joins & group functions), you can usually find a way round by rearranging your input query or writing your own request handler.
A "/ctf" request handler is provided to help with testing and tuning (see attached solrconfig.xml). This quantifies primary and secondary improvements relative to the unmodified return and provides other useful metrics.
For more detail about the CTF plugin go to https://www.slideshare.net/pontneo/better-search-implementation-of-click-through-filter-as-a-query-parser-plugin-for-apache-solr-lucene
1. | search in text field of your documents core for "melon" at your default CTF settings; |
q={!ctf}text:melon or q={!ctf v=$qq} with qq=text:melon | |
2. | weighted search for "melon" in title or body fields using only "botanist" click traffic but otherwise default settings; |
q={!ctf ctp="user:botanist" cts="user:botanist" v=$qq} with qq=title:melon^2 body:melon | |
3. | weighted search for "melon" or "cherry" blending click traffic & publication_date boosting; |
q={!boost b=recip(ms(NOW/HOUR,publication_date),3.16e-11,1,1)}{!ctf cb=5 cx=2 v=$qq} with qq=text:melon^3 text:cherry^2 | |
or q={!ctf cb=5 cx=2 v=$qq} with qq=({!boost b=recip(ms(NOW/HOUR,publication_date),3.16e-11,1,1)}body:melon^3 {!boost b=recip(ms(NOW/HOUR,publication_date),3.16e-11,1,1)}body:cherry^2) etc | |
4. | filtered search for "orange" with all returned items in category fruit; |
q={!ctf v=$qq} with qq=text:orange and fq=category:fruit | |
5. | filtered search for "orange" with just secondary items in category fruit; |
q={!ctf cf="category:fruit" v=$qq} with qq=text:orange | |
6. | filtered search for "orange" with just primary items in category fruit; |
q={!ctf v=$qq} with qq=+text:orange +category:fruit | |
7. | search for "lychee" with a high sensitivity to changes over time, recoiling quickly to the unmodified search return; |
q={!ctf cp=5 cd=2 ctp="time_stamp:[NOW-1DAY TO NOW]" v=$qq} with qq=text:lychee | |
8. | search for "lychee" in title field preserving original sort & including best 10% of secondary items with title word like "lychee"; |
q={!ctf reorder=false cf="title:lychee~0.5" cs=0.1 v=$qq} with qq=title:lychee | |
9. | last week's search for "kiwi" based on the 10 most clicked "kiwi" items, with the best 20% of non-pdf secondary items that are then pushed down the list to help maximise improvement; |
q={!ctf base=clicks cn=10 cz="NOW-7DAY" cs=0.2 cf="-doctype:pdf" cy=0.95 v=$qq} with qq=text:kiwi | |
10. | top 5 most visited "lemon" recipes by user types 2 & 3 over the last 7 days; |
q={!ctf cn=5 base=clicks restrict=true extend=false ctp="+user_type:(2 3) +time_stamp:[NOW-7DAY/DAY TO NOW]" v=$qq} with qq=recipe:*lemon* | |
11. | top 10 recommendations for userID:x, based on userID:x's up to 20 most visited items over the last month and click traffic through those items by any other user with an interest in "fruit" since userID:x's last visit; |
q={!ctf base=clicks only2y=true cn=20 ctp="userID:x AND time_stamp:[NOW-31DAY TO NOW]" cts="-userID:x AND user_interests:fruit AND time_stamp:[NOW-(last_visit)DAY TO NOW]" v=$qq} with qq=docID:* and rows=10 | |
To remove any recommendations that userID:x has visited before, include in q; | |
cf="-({!join from=toDocID to=docID fromIndex=clicks_core}userID:x {!join from=fromDocID to=docID fromIndex=clicks_core}userID:x)" | |
12. | next item in non-repeating discovery query with a simple fallback to avoid blind alleys; |
q={!ctf only2y=true cts="-sessionID:currentSessionID" v=$qq} with qq=docID:currentDocID^10 categoryID:currentCategoryID and rows=1 | |
13. | related material for a given document with a simple fallback query for where there are few direct connections; |
q={!ctf only2y=true base=clicks v=$qq} OR {!ctf base=clicks v=$qqq} with qq=docID:currentDocID^10 and qqq=categoryID:currentCategoryID |
1. | document core(s) with unique docIDs |
2. | a clicks core for storing clicks (see attached schema.xml) with a minimum of following fields; |
timestamp - timestamp of click | |
fromDocID - referrer docID (may be a null value where necessary) | |
toDocID - destination docID | |
other fields (such as userID, usertype, user_interests, to_posn_in_list, user_query etc) are not required but are of course part of the point of using this plugin (see example queries above) |
ctf settings: | ||
solr_host_url | root containing your data cores | |
document_core_to_query | document core name | |
clicks_core_to_query | clicks core name | |
ctf mappings: | ||
document_ID_field_name | document core docID field name | |
click_fromID_field_name | clicks core fromDocID field name | |
click_null_fromID_value | clicks core null fromDocID value (for clicks without a fromID) | |
click_toID_field_name | clicks core toDocID field name | |
click_time_stamp_field_name | clicks core timestamp field name | |
ctf parameters: | ||
base | get primary (1y) items from best query matches or most clicked items (String matches or clicks) | |
restrict | show only items with click boosts (String true or false) | |
reorder | allow click boosts to effect score and sort (String true or false) | |
extend | include secondary (2y) items (String true or false) | |
only2y | show only 2y items (String true or false) | |
cn | number of 1y items to sample (int) | |
cd | average clicks per 1y item (double), controls click sample period | |
cp | time integration (int num clicks >0), low values = responsive, high = stable | |
cb | click boost = cb.(fn clicks)^cx (double cb values >0) | |
cx | click boost = cb.(fn clicks)^cx (double cx values >0) | |
ctp | click traffic type for 1y items - a filter query on click traffic (String use ctp="*" for any, ctp="user_type:2", ctp="userID:xxxx", ctp="time_stamp:[NOW-7DAY TO NOW]" or ctp="some function query" etc) | |
cts | click traffic type for 2y items - a filter query on click traffic (String as ctp) | |
cg | skews 2y scores from popular to specific (double value 0 to 1) | |
cs | proportion of 2y items to allow through, lowest & oldest traffic removed first (double value 0 to 1) | |
cf | 2y item type - a filter query on 2y items (String use cf="*" for any, cf="cat:2", cf="-doc_type:pdf", cf="published:[NOW-31DAY TO NOW]", cf="some function query" etc) | |
cy | position 2y items in list (double value between 0 (next to parent) and 1 (own click boost)) | |
cz | lookback parameter for observing past query returns at a certain time (Solr date use cz="NOW" or e.g. cz="NOW-7DAY" or cz="2015-07-14T11:32:00Z") |