-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Constant space sort #61
base: master
Are you sure you want to change the base?
Conversation
Previously we used the Binary instance for Text to serialise the event name. This is wrong. We now first encode to UTF-8 and use this in the eventlog encoding.
73ef6f3
to
3cc5c14
Compare
3cc5c14
to
3f7eca6
Compare
Is it necessary to sort all the events as if they were completely unordered? I was assuming, at the time I wrote my patch, that the events of each capability would come in order within each capability, so all that was necessary was merging the capability streams back again. Perhaps there is some inconvenience in splitting the capability streams that I'm missing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much time for today so I'll take a look again tomorrow.
I think facundominguez's point is valid. Also it seems that #14 gets in the way if we're going to replace the existing sortEvents with this constant space sorting. |
This introduces an interface providing constant-space sorting of eventlogs via on-disk merge-sort.
I have confirmed that the implementation indeed maintains constant-space behavior, requiring about 79s seconds to sort a 200MB eventlog while not exceeding 40 MBytes residency (using a chunk size of 1e5 events).
This is in contrast to the old in-memory codepath which requires 35 seconds and nearly 5GBytes of residency to sort this same eventlog.
If I increase the chunk size by an order of magnitude (to 1e6 events) then the constant-space sort has essentially the same runtime as the in-memory codepath but runs in merely 300MBytes.
Note: In testing this I realized that several of the encoding paths treated strings incorrectly. Consequently, this depends upon (and contains commits from) #62.
To-do
sortEvents
?