Skip to content

Commit

Permalink
Added support for exists query, as defined in Elasticsearch
Browse files Browse the repository at this point in the history
Field exists does not consider types, only field names.
Field capability will have to be handled differently unfortunately.

This works by introducing an internal (but normal) "u64" field
that stores postings list for field existence.

For performance/RAM reasons, the fields full path is not stored
as a string but instead we compute a u64-fnv hash using the
path from root to leaf.

If the hash perfects ideally, even with the anniversary attach, collisions
are very unlikely.

When dealing with complex JSON with the raw tokenizer this feature can
double the number of tokens we deal with, and has an impact on
performance.

For this reason, it is not added as an option in the DocMapper.

Like Elasticsearch, we only store field existence of indexed fields.
Also in order to handle refinement like expand_dots,
we work over the built tantivy Document and reuse the existing
resolution logic.

On 1.4GB of gharchive (which is close to a worst case scenaio),
see the following performance/index size change:

With field_exists enabled
- Indexing Throughput: 41 MB/s
- Index size: 701M

With field_exists disabled
- Indexing Throughput: 46 MB/s
- Index size: 698M
  • Loading branch information
fulmicoton committed Aug 7, 2023
1 parent 8c2caf5 commit 8a59e05
Show file tree
Hide file tree
Showing 29 changed files with 519 additions and 36 deletions.
4 changes: 4 additions & 0 deletions quickwit/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions quickwit/quickwit-common/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ async-trait = { workspace = true }
byte-unit = { workspace = true }
dyn-clone = { workspace = true }
env_logger = { workspace = true }
fnv = { workspace = true }
futures = { workspace = true }
home = { workspace = true }
hostname = { workspace = true }
Expand Down
3 changes: 3 additions & 0 deletions quickwit/quickwit-common/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,13 @@ pub mod io;
mod kill_switch;
pub mod metrics;
pub mod net;
mod path_hasher;
mod progress;
pub mod pubsub;
pub mod rand;
pub mod rendezvous_hasher;
pub mod runtimes;
pub mod shared_consts;
pub mod sorted_iter;

pub mod stream_utils;
Expand All @@ -49,6 +51,7 @@ use std::str::FromStr;

pub use coolid::new_coolid;
pub use kill_switch::KillSwitch;
pub use path_hasher::PathHasher;
pub use progress::{Progress, ProtectedZoneGuard};
pub use stream_utils::{BoxStream, ServiceStream};
use tracing::{error, info};
Expand Down
68 changes: 68 additions & 0 deletions quickwit/quickwit-common/src/path_hasher.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
// Copyright (C) 2023 Quickwit, Inc.
//
// Quickwit is offered under the AGPL v3.0 and as commercial software.
// For commercial licensing, contact us at [email protected].
//
// AGPL:
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as
// published by the Free Software Foundation, either version 3 of the
// License, or (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.

use std::hash::Hasher;

/// Mini wrapper over the FnvHasher to incrementally hash nodes
/// in a tree.
///
/// The wrapper does not do too much. Its main purpose to
/// work around the lack of Clone in the fnv Hasher
/// and enforce a 0 byte separator between segments.
#[derive(Default)]
pub struct PathHasher {
hasher: fnv::FnvHasher,
}

impl Clone for PathHasher {
#[inline(always)]
fn clone(&self) -> PathHasher {
PathHasher {
hasher: fnv::FnvHasher::with_key(self.hasher.finish()),
}
}
}

impl PathHasher {
/// Helper function, mostly for tests.
pub fn hash_path(segments: &[&[u8]]) -> u64 {
let mut hasher = Self::default();
for segment in segments {
hasher.append(segment);
}
hasher.finish()
}

/// Appends a new segment to our path.
///
/// In order to avoid natural collisions, (e.g. &["ab", "c"] and &["a", "bc"]),
/// we add a null byte between each segment as a separator.
#[inline]
pub fn append(&mut self, payload: &[u8]) {
self.hasher.write(payload);
// We use 255 as a separator as all utf8 bytes contain a 0
// in position 0-5.
self.hasher.write(&[255u8]);
}

#[inline]
pub fn finish(&self) -> u64 {
self.hasher.finish()
}
}
21 changes: 21 additions & 0 deletions quickwit/quickwit-common/src/shared_consts.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
// Copyright (C) 2023 Quickwit, Inc.
//
// Quickwit is offered under the AGPL v3.0 and as commercial software.
// For commercial licensing, contact us at [email protected].
//
// AGPL:
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as
// published by the Free Software Foundation, either version 3 of the
// License, or (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.

/// Field name reserved for storing the dynamically indexed fields.
pub const FIELD_PRESENCE_FIELD_NAME: &str = "_field_presence";
4 changes: 4 additions & 0 deletions quickwit/quickwit-config/src/index_config/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ pub struct DocMapping {
#[serde(default)]
pub store_source: bool,
#[serde(default)]
pub index_field_presence: bool,
#[serde(default)]
pub timestamp_field: Option<String>,
#[serde_multikey(
deserializer = Mode::from_parts,
Expand Down Expand Up @@ -433,6 +435,7 @@ impl TestableForRegression for IndexConfig {
)
.unwrap();
let doc_mapping = DocMapping {
index_field_presence: true,
field_mappings: vec![
tenant_id_mapping,
timestamp_mapping,
Expand Down Expand Up @@ -517,6 +520,7 @@ pub fn build_doc_mapper(
) -> anyhow::Result<Arc<dyn DocMapper>> {
let builder = DefaultDocMapperBuilder {
store_source: doc_mapping.store_source,
index_field_presence: doc_mapping.index_field_presence,
default_search_fields: search_settings.default_search_fields.clone(),
timestamp_field: doc_mapping.timestamp_field.clone(),
field_mappings: doc_mapping.field_mappings.clone(),
Expand Down
1 change: 1 addition & 0 deletions quickwit/quickwit-doc-mapper/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ utoipa = { workspace = true }

quickwit-datetime = { workspace = true }
quickwit-macros = { workspace = true }
quickwit-common = { workspace = true }
quickwit-query = { workspace = true }

[dev-dependencies]
Expand Down
Loading

0 comments on commit 8a59e05

Please sign in to comment.