[opt](inverted index) optimize the space usage of the inverted index dictionary file and position information #41114

zzzxl1993 · 2024-09-23T03:08:06Z

Proposed changes

Issue Number: close #xxx

…dictionary file and position information

doris-robot · 2024-09-23T03:08:10Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

Unalian · 2024-09-24T07:12:58Z

Has Doris's inverted index been compressed before? Why is it smaller than the inverted index of Lucene/Clucene? Thank you for your ideas.

zzzxl1993 · 2024-09-24T08:05:01Z

@Unalian

Earlier versions of Doris only applied the PFOR compression algorithm to the inverted lists of the inverted index. We plan to use the PFOR algorithm for position information and Zstd compression for dictionary information, which will further reduce the space usage of Doris indexes.
Doris only utilizes column storage and inverted indexes, resulting in less space usage compared to Elasticsearch.

Unalian · 2024-09-24T09:25:14Z

@Unalian

Earlier versions of Doris only applied the PFOR compression algorithm to the inverted lists of the inverted index. We plan to use the PFOR algorithm for position information and Zstd compression for dictionary information, which will further reduce the space usage of Doris indexes.

Doris only utilizes column storage and inverted indexes, resulting in less space usage compared to Elasticsearch.

Thank you for answering!

I see PFOR compression on .frq files from codes. However, .frq (and as your plan, .prx file )files are always tiny compared with .tis(see from clucene). Does PFOR compression algorithm on .frq/.prx have a big impact on the inverted index size?

Or is there a big difference between the inverted index files' sizes generated by Clucene and that generated by Doris?

If we ignore all the data and only consider the size of the inverted index, I use Lucene to build an inverted index for the same data, and Doris uses less space. This makes me curious.

zzzxl1993 · 2024-09-24T09:52:49Z

@Unalian

Did you use the official Clucene for your tests?
Doris has implemented some optimizations and modifications to Clucene. You can refer to this codebase: https://github.com/apache/doris-thirdparty/tree/clucene and the related pull request: [opt](inverted index) optimize the space usage of the inverted index dictionary file and position information doris-thirdparty#238.

Unalian · 2024-09-24T11:38:02Z

@Unalian

Did you use the official Clucene for your tests?

Doris has implemented some optimizations and modifications to Clucene. You can refer to this codebase: https://github.com/apache/doris-thirdparty/tree/clucene and the related pull request: [opt](inverted index) optimize the space usage of the inverted index dictionary file and position information doris-thirdparty#238.

I used this version: git://clucene.git.sourceforge.net/gitroot/clucene/clucene; And I set simple analyzer; Set writer config: SOTRE_NO; INDEX_NONORMS; INDEX_TOKENIZED; Add I set the config to make sure there is only one segment.
Thank you! I am reading the code here. I see you compress the .tis file using ztsd in this pr. This may bring a good improvement.

[opt](inverted index) optimize the space usage of the inverted index …

c67a8be

…dictionary file and position information

zzzxl1993 marked this pull request as draft September 23, 2024 03:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opt](inverted index) optimize the space usage of the inverted index dictionary file and position information #41114

[opt](inverted index) optimize the space usage of the inverted index dictionary file and position information #41114

zzzxl1993 commented Sep 23, 2024

doris-robot commented Sep 23, 2024

Unalian commented Sep 24, 2024

zzzxl1993 commented Sep 24, 2024

Unalian commented Sep 24, 2024

zzzxl1993 commented Sep 24, 2024

Unalian commented Sep 24, 2024 •

edited

Loading

[opt](inverted index) optimize the space usage of the inverted index dictionary file and position information #41114

Are you sure you want to change the base?

[opt](inverted index) optimize the space usage of the inverted index dictionary file and position information #41114

Conversation

zzzxl1993 commented Sep 23, 2024

Proposed changes

doris-robot commented Sep 23, 2024

Unalian commented Sep 24, 2024

zzzxl1993 commented Sep 24, 2024

Unalian commented Sep 24, 2024

zzzxl1993 commented Sep 24, 2024

Unalian commented Sep 24, 2024 • edited Loading

Unalian commented Sep 24, 2024 •

edited

Loading