Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt](inverted index) optimize the space usage of the inverted index dictionary file and position information #41114

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zzzxl1993
Copy link
Contributor

Proposed changes

Issue Number: close #xxx

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@zzzxl1993 zzzxl1993 marked this pull request as draft September 23, 2024 03:08
@Unalian
Copy link

Unalian commented Sep 24, 2024

Has Doris's inverted index been compressed before? Why is it smaller than the inverted index of Lucene/Clucene? Thank you for your ideas.

@zzzxl1993
Copy link
Contributor Author

@Unalian

  1. Earlier versions of Doris only applied the PFOR compression algorithm to the inverted lists of the inverted index. We plan to use the PFOR algorithm for position information and Zstd compression for dictionary information, which will further reduce the space usage of Doris indexes.
  2. Doris only utilizes column storage and inverted indexes, resulting in less space usage compared to Elasticsearch.

@Unalian
Copy link

Unalian commented Sep 24, 2024

@Unalian

  1. Earlier versions of Doris only applied the PFOR compression algorithm to the inverted lists of the inverted index. We plan to use the PFOR algorithm for position information and Zstd compression for dictionary information, which will further reduce the space usage of Doris indexes.
  2. Doris only utilizes column storage and inverted indexes, resulting in less space usage compared to Elasticsearch.

Thank you for answering!

  1. I see PFOR compression on .frq files from codes. However, .frq (and as your plan, .prx file )files are always tiny compared with .tis(see from clucene). Does PFOR compression algorithm on .frq/.prx have a big impact on the inverted index size?
image Or is there a big difference between the inverted index files' sizes generated by Clucene and that generated by Doris?
  1. If we ignore all the data and only consider the size of the inverted index, I use Lucene to build an inverted index for the same data, and Doris uses less space. This makes me curious.

@zzzxl1993
Copy link
Contributor Author

@Unalian

  1. Did you use the official Clucene for your tests?
  2. Doris has implemented some optimizations and modifications to Clucene. You can refer to this codebase: https://github.com/apache/doris-thirdparty/tree/clucene and the related pull request: [opt](inverted index) optimize the space usage of the inverted index dictionary file and position information doris-thirdparty#238.

@Unalian
Copy link

Unalian commented Sep 24, 2024

@Unalian

  1. Did you use the official Clucene for your tests?
  2. Doris has implemented some optimizations and modifications to Clucene. You can refer to this codebase: https://github.com/apache/doris-thirdparty/tree/clucene and the related pull request: [opt](inverted index) optimize the space usage of the inverted index dictionary file and position information doris-thirdparty#238.
  1. I used this version: git://clucene.git.sourceforge.net/gitroot/clucene/clucene; And I set simple analyzer; Set writer config: SOTRE_NO; INDEX_NONORMS; INDEX_TOKENIZED; Add I set the config to make sure there is only one segment.
  2. Thank you! I am reading the code here. I see you compress the .tis file using ztsd in this pr. This may bring a good improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants