-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode segmenter performance #1703
base: master
Are you sure you want to change the base?
Unicode segmenter performance #1703
Conversation
Thank you for submitting this @clipperhouse , we'll review it soon. |
Thanks @abhinavdangeti. I did some more testing of differences between segmenters here: https://github.com/clipperhouse/segmenter-repro |
Updated benchmark, using new sample text (~110K in size, multilingual). Note allocations.
|
@abhinavdangeti I think this is in good shape for your review. Happy to discuss. |
I see that the |
@abhinavdangeti OK, try those workflows again? Thanks. |
89f74ba
to
cf2065d
Compare
I rebased a bit for a cleaner merge. |
cf2065d
to
93cc997
Compare
(Rebased) |
@abhinavdangeti Friendly ping, let me know if you'd like to pursue this. |
a8db884
to
c8e26eb
Compare
d4de8e6
to
2782f9f
Compare
2782f9f
to
f944431
Compare
f944431
to
e7dd25c
Compare
e7dd25c
to
b6bca0a
Compare
Hiya @abhinavdangeti, the above pushes are just rebases, no updates here in a while. Would you like to run checks? |
Thanks @clipperhouse , re-running the checks. |
Looks good, thanks. Ready for review at your convenience. |
b6bca0a
to
9b41d64
Compare
226cf98
to
cccea71
Compare
cccea71
to
8fa8ed9
Compare
8fa8ed9
to
4bfba33
Compare
Replacing blevesearch/segment. ~2x throughput improvement. Refactor allocations, now ~O(1). Add tests & multilingual sample text to ensure identical behavior. Known differences from previous segmenter: - The original segmenter splits runs of spaces into separate tokens; uax29 concatenates runs into a single token. - The original segmenter doesn’t handle emoji skin tone modifiers, the new one does, attributable to Unicode version update.
4bfba33
to
042b2d8
Compare
In this PR, as an experiment I’ve swapped out the Unicode segmenter in Bleve with this one.
Benefits
Testing & compatibility