Replace regexp tokenizer with recursion parser

In the current implementation, the RegexTokenizer uses regular expressions to tokenize the format string. However, regular expressions are not well-suited for parsing nested structures like nested brackets because they cannot match patterns with recursive depth. To address this, we'll write a custom recursive parser that can handle nested optional brackets. This parser will process the format string character by character, building the format items and managing the nesting of optional components. That change allowed me to cover `strict_date_optional_time` which has a nested depth according to the examples in [OS repo][2] Additional tests taken from [the doc][1] [1]: https://opensearch.org/docs/latest/field-types/supported-field-types/date/ [2]: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/test/java/org/opensearch/common/time/DateFormattersTests.java
quickwit-oss · Sep 28, 2024 · 0c6a5f4 · 0c6a5f4
1 parent 2bb2f8c
commit 0c6a5f4
Show file tree

Hide file tree

Showing 5 changed files with 249 additions and 201 deletions.
diff --git a/quickwit/Cargo.lock b/quickwit/Cargo.lock
diff --git a/quickwit/quickwit-datetime/Cargo.toml b/quickwit/quickwit-datetime/Cargo.toml
@@ -13,7 +13,6 @@ license.workspace = true
 [dependencies]
 anyhow = { workspace = true }
 itertools = { workspace = true }
-regex = { workspace = true }
 serde = { workspace = true }
 serde_json = { workspace = true }
 tantivy = { workspace = true }