[FEAT] Monotonically Increasing Id for Swordfish #3180
+230
−26
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implements monotonically increasing id as a streaming sink with
max_concurrency = 1
.I tested multithreaded and single threaded implementations and found that there was no performance gain in multithreaded. This is because monotonically increasing id is a memory bound operator, all it does is allocate an array of ints for the id. Multiple threads trying to do this in parallel are bottlenecked by memory bandwidth.
It is actually also much simpler to implement this as a single threaded operation, as we just need to keep a running count of the lengths of morsels seen so far. This is effectively just
row_number
.Note:
pyfunc_into_table_iter
function, which consumes python iterators in scan tasks (used in read_lance and read_generator), where the consumer only callsnext()
on the iterator once. This PR fixes that.