Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use correct tokenizer when building termsetquery #4983

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

trinity-1686a
Copy link
Contributor

Description

fix #4945

How was this PR tested?

added integration test (also fix an integration test not cleaning up behind)

@fulmicoton
Copy link
Contributor

I think we want this to behave such as:

field IN [xxxx yyyy]

should behave exactly like

field:xxxx OR field:yyyy

Copy link

On SSD:

Average search latency is 0.998x that of the reference (lower is better).
Ref run id: 1998, ref commit: 5df205d
Link

On GCS:

Average search latency is 1.03x that of the reference (lower is better).
Ref run id: 1999, ref commit: 5df205d
Link

@@ -42,26 +43,43 @@ impl From<TermQuery> for QueryAst {
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick

fn get_field(field_path: &str, schema: &'a TantivySchema) -> Option<Field> {
    if let (field, _field_path) = schema.find_field(&self.field) {
        return Some(field);
    }
    schema.get_field(DYNAMIC_FIELD_NAME).ok()
}

Comment on lines +47 to +50
let field = schema
.find_field(&self.field)
.or_else(|| schema.find_field(DYNAMIC_FIELD_NAME))?
.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let field = schema
.find_field(&self.field)
.or_else(|| schema.find_field(DYNAMIC_FIELD_NAME))?
.0;
let field = get_field(&self.field, schema)?;

Comment on lines +65 to +71
self.build_ast(
schema,
tokenizer_manager,
self.get_tokenizer(schema)
.or(Some("raw"))
.map(ToString::to_string),
)
Copy link
Contributor

@fulmicoton fulmicoton Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.build_ast(
schema,
tokenizer_manager,
self.get_tokenizer(schema)
.or(Some("raw"))
.map(ToString::to_string),
)
let tokenizer: String = self.get_tokenizer(schema)
.unwrap_or("raw")
.to_string();
self.build_ast(
schema,
tokenizer_manager,
tokenizer
)

&self,
schema: &TantivySchema,
tokenizer_manager: &TokenizerManager,
tokenizer: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is never called with None

_search_fields: &[String],
_with_validation: bool,
) -> Result<TantivyQueryAst, InvalidQuery> {
self.build_ast(schema, tokenizer_manager, Some("raw".to_string()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.build_ast(schema, tokenizer_manager, Some("raw".to_string()))
self.build_ast(schema, tokenizer_manager, "raw".to_string())


use super::{BuildTantivyAst, QueryAst};
use crate::query_ast::utils::DYNAMIC_FIELD_NAME;
Copy link
Contributor

@fulmicoton fulmicoton Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

term_query.rs should not have been modified at all.

This PR is obfuscating what it does with logic that only concerns term_set_query.

The only thing being factorized is build_ast as far as I could tell, and it is definitely not worth it.

@@ -56,15 +63,34 @@ impl TermSetQuery {
field: full_path.to_string(),
value: value.to_string(),
};
let ast =
term_query.build_tantivy_ast_call(schema, tokenizer_manager, &[], false)?;
let ast = term_query.ast_for_term_extraction(schema, tokenizer_manager)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not build a TermQuery at all and add
a function

build_ast_for_term(field_path: &str, value: &str, schema: &Schema, tokenizer_manager: &TokenizerManager)

Comment on lines +73 to +90
let str_term_count = terms.iter().filter(|term| is_term_str(term)).count();
if str_term_count <= 1 {
for term in terms {
all_terms.insert(term);
}
} else {
// we have a string, and it got split into multiple tokens, so we want an
// intersection of them
let mut phrase = Vec::with_capacity(terms.len());
for term in terms {
if is_term_str(&term) {
phrase.push(term);
} else {
all_terms.insert(term);
}
}
intersections.push(phrase);
}
Copy link
Contributor

@fulmicoton fulmicoton Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is rather odd here.

You implicitly rely on the fact that you know that the query ast is a disjunction over different field:

e.g.
(text query with possibly more than one term OR numerical query)

If you want to take that kind of shortcuts (after having considered other solutions), add comments to explain why it works.

Ideally, assert that the query is indeed what you think it is.

Ok(term_set_query.into())
use tantivy::query::{BooleanQuery, Query, TermQuery, TermSetQuery};

let (terms_it, intersections) = self.make_term_iterator(schema, tokenizer_manager)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need a comment here.

let typ = val.json_path_type().unwrap_or_else(|| val.typ());
typ == Type::Str
}

impl TermSetQuery {
fn make_term_iterator(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has a very non trivial purpose and a retrun type. It needs a comment.

Copy link
Contributor

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments in PR.
As far as I can tell, there are no reason to put code in the term_query.rs file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dynamic fields term set query not working
2 participants