Skip to content

Commit

Permalink
Add multilang tokenizer (#3608)
Browse files Browse the repository at this point in the history
* Add multilang tokenizer.

* Bump lindera and tantivy version.

* Fix cargo and fmt.

* Add tokenizers bench.

* Take into account comments from review. Bump lindera version to 2.6.0.

* Increase postgresql timeout for tests.

* Put multilang in custom tokenizer, avoid copying lindera dictionaries.

* Remove lindera tantivy dep, clean.

* Fix test.

* Fix build.

* Add multilang in tokenizer manager.

* Fix lindera features.
  • Loading branch information
fmassot authored Jul 17, 2023
1 parent 9586c5b commit 32a02a8
Show file tree
Hide file tree
Showing 15 changed files with 1,488 additions and 378 deletions.
34 changes: 34 additions & 0 deletions config/tutorials/wikipedia/multilang-index-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#
# Index config file for multilang wikipedia datasets.
#

version: 0.6

index_id: multilang-wikipedia

doc_mapping:
tokenizers:
- name: multilang
type: multilang
field_mappings:
- name: title
type: text
tokenizer: multilang
record: position
stored: true
fieldnorms: true
- name: body
type: text
tokenizer: multilang
record: position
stored: true
fieldnorms: true
- name: url
type: text
tokenizer: raw

search_settings:
default_search_fields: [title, body]

indexing_settings:
commit_timeout_secs: 10
Loading

0 comments on commit 32a02a8

Please sign in to comment.