Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CHORE] Begin integrating Rust Logical Plan with Dataframe API #1207
[CHORE] Begin integrating Rust Logical Plan with Dataframe API #1207
Changes from 5 commits
7c12d52
8ca592c
2f2bfec
c3c2d33
5f1d778
790fa24
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change the type annotation of the
plan
arg and the_plan
property accessor to be a union ofLogicalPlan
andRustLogicalPlanBuilder
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline;
LogicalPlan
has too many methods soUnion
will not work out of the boxCheck warning on line 71 in daft/io/common.py
Codecov / codecov/patch
daft/io/common.py#L71
Check warning on line 74 in daft/io/common.py
Codecov / codecov/patch
daft/io/common.py#L73-L74
Check warning on line 77 in daft/io/common.py
Codecov / codecov/patch
daft/io/common.py#L77
Check warning on line 84 in daft/io/common.py
Codecov / codecov/patch
daft/io/common.py#L84
Check warning on line 87 in daft/io/common.py
Codecov / codecov/patch
daft/io/common.py#L86-L87
Check warning on line 90 in daft/io/common.py
Codecov / codecov/patch
daft/io/common.py#L89-L90
Check warning on line 92 in daft/io/common.py
Codecov / codecov/patch
daft/io/common.py#L92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick questions about the current "switching between all-Python logical plan vs. Rust-based logical plan builder" setup.
We previously talked about exposing a
LogicalPlanBuilder
interface that would have two implementations, the all-Python query planner (PyLogicalPlanBuilder
) and the new Rust query planner (RustLogicalPlanBuilder
), where theDataFrame
API layer would be rewritten to use thatLogicalPlanBuilder
interface and we could limit the number of places we'd need to switch oncontext.use_rust_planner
. E.g. theDataFrame
constructor would take aLogicalPlanBuilder
instance instead of a union of the all-PythonLogicalPlan
and theRustLogicalPlanBuilder
:And
read_parquet()
would look something like this:This should end up being a good bit cleaner than imperatively switching between implementations within each
DataFrame
API function/method, and we should be able to line up the builder interfaces (such asbuilder.scan()
orbuilder.filter()
) since they're adding the same logical op to each underlying logical plan implementation. This does, however, require more upfront changes to theDataFrame
implementation, since each method would need to be ported to thePyLogicalPlanBuilder
interface (although this should be straightforward).Do you agree that this is a better long-term approach, and if so, are you thinking that deferring this refactor is the best choice for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline --
yes and yes, to get rust plan creation from dataframes in our hands ASAP
Check warning on line 55 in daft/io/parquet.py
Codecov / codecov/patch
daft/io/parquet.py#L55
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this type casting, we could make the
DataFrame
constructor take aUnion[LogicalPlan, RustLogicalPlanBuilder]
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Discussed offline; same as above)
Check warning on line 11 in daft/logical/rust_logical_plan.py
Codecov / codecov/patch
daft/logical/rust_logical_plan.py#L11
Check warning on line 15 in daft/logical/rust_logical_plan.py
Codecov / codecov/patch
daft/logical/rust_logical_plan.py#L14-L15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Instead of having aLogicalPlanBuilder
method per read API (e.g. for reading Parquets, CSVs, JSONs, etc.), we could still have asource
method that also takes aFileFormat
enum variant, which would then be incorporated into theSourceInfo
.On second thought, making the
LogicalPlanBuilder
methods 1:1 with theDataFrame
APIs makes more sense to me than 1:1 with the logical operators, since the former leaks less representational details to the Python side and makes the builder abstraction more useful so keeping format-specificread_*
methods seems like the best choice to me! E.g. it would also allow us to hide aFileFormat
enum from the Python side.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this was what I was hoping for as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until there's deviation in what data we're holding in the
SourceInfo
structs across Parquet, CSV, JSON, etc. file types, I think that we should keep it as aFilesInfo
enum variant containing aFilesInfo
struct that contains an additionalFileFormat
enum:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!