-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Feature request) Option to search specified headers #13
Comments
I'm going to give implementing this a shot. Per my comment on Kim Vandry's repository, I'm going to create a create a new table that will include all tokens from any of the headers that don't have dedicated tables in the database. The entries in the new table will be in the form of Even if I don't end up changing how the records are stored, I don't think its impact on the size of the mairix database will be all that bad provided I didn't make a mistake below:
For the sake of estimating, I'll pretend that the tokens account for all of the space being used and that the number of unique tokens in the tokens in the headers will be equal to the number of unique tokens in the body. The average length of the body tokens is ~90B (40MB / 450,000 Tokens), so my database grows from 50MB to 95MB (50MB + 40MB * 100B / 90B) in this scenario. Since I already have 57,000 emails taking up just under 2GB of space, an extra 45MB is a non-issue for me. Indexing arbitrary headers could be toggled with a flag if it turns out that my estimate is too optimistic. As long as search queries are still performant, I personally would gladly sacrifice a couple of hundred megs of space to support searching through arbitrary headers. Questions and constructive criticism welcomed. Script I used for computing the average header length: messages$ (export LC_ALL=C;
find -type f -name '*:*' -exec sh -c 'formail -X "" < "{}"' \; \
| egrep -o '^[^: \t]{2,}:' \
| awk '{ sum += length() }
END { print "N:", NR, "Avg:", sum / NR}')
N: 1511863 Avg: 10.543 CC: @vandry |
My estimate was off; the database with the re-indexed messages takes up ~220 MB (vs ~95 MB expected) with my changes. Aside from that, the preliminary code seems to work, and there's no appreciable difference in speed for my test queries. I still plan on looking into normalizing the data to reduce the size of the database. TODO:
|
Normalizing the header terms to cut down on the size of the database is going to be a fairly involved task, so I'm not going to try to do that for the first implementation of this feature. I will be adding a new configuration option that will let users toggle this feature if the disk space is a concern. |
Just caught up on your work @ericpruitt . Awesome! As per the original feature request, one solution to lighten the load is to only activate alt. header indexing when requested in |
The code I wrote does just that; it supports either searching all headers or a specific set of headers as documented in mairix.c in my pull request: + " h:word : match word in the value of any minor header\n"
+ " h:X:Y:word : match word in the value of minor headers named \"X\" or \"Y\"\n" I included some statistics comparing the size of the database before and after my change in vandry#12. EDIT: Actually, my change doesn't work exactly as you described it because it wouldn't search the other headers like "From", "To", etc., but that could be changed easily enough, or an additional operator (maybe "H") could be added. |
I tried out your fork. Works as expected. The database is large, but not unfeasibly large. Great! Thanks! |
Seeing the size increase and wondering - wouldn't "naive full text indexing" of the headers result in smaller database size? Of course the queries would give less precise results but may be a good enough tradeoff in many cases. Also, it might be possible to do naive full text indexing but story the results for message bodies and headers in different tables to improve specificity of the results. |
Potentially, but I have no intention of spending any more time on this. You're welcome to build off my changes in vandry#12. |
It would be very helpful to have the option to specify (in the config file) particular headers, other than To, From, etc., in which content would also be included in the building of the index. I.e., in
.mairixrc
:Thanks for this great tool! I use it every day (with
gnus
).The text was updated successfully, but these errors were encountered: