-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add arrow adapter #755
Add arrow adapter #755
Conversation
Summarizing chat from this morning:
|
1253248
to
e584c95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on track. I left some suggestions regarding the implementation.
…ppend_stream work
e8e0bf6
to
a605c92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed some things in my last review.
Also, these two files should be removed:
main.py
(empty)test.arrow
(large binary file)
tiled/adapters/arrow.py
Outdated
else: | ||
return pyarrow.ipc.open_file(self._partition_paths[partition]) | ||
|
||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that this is a generator! I think it's not necessary to make it a @property
. Generally, functions that do I/O (or anything that could potentially take awhile) should be normal methods.
tiled/adapters/arrow.py
Outdated
for batch in data: | ||
file_writer.write_batch(batch) | ||
|
||
def read(self, *args: Any, **kwargs: Any) -> pandas.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently accepts any positional or keyword arguments and then ignores them. Notice that in CSV, we accept any arguments but we pass them into another method which accepts specific arguments and applies them:
Lines 203 to 217 in 15abbc9
def read( | |
self, *args: Any, **kwargs: Any | |
) -> Union[pandas.DataFrame, dask.dataframe.DataFrame]: | |
""" | |
Parameters | |
---------- | |
args : | |
kwargs : | |
Returns | |
------- | |
""" | |
return self.dataframe_adapter.read(*args, **kwargs) |
It should accept an optional fields
parameter and apply it if it is not None
.
Lines 226 to 227 in 15abbc9
if fields is not None: | |
df = df[fields] |
tiled/adapters/arrow.py
Outdated
The pyarrow table corresponding to a given partition and batch as pandas dataframe. | ||
""" | ||
df = self.reader_handle_partiton(partition) | ||
return df.read_all().to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to read()
this should apply the fields
parameter, which is currently just ignored.
@@ -0,0 +1,70 @@ | |||
import tempfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if you run pytest
it won't discover this file because it only discovers subpackages and modules that have the word test
in them. The directory _tests/adapters
needs to be renamed _tests/test_adapters
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added also MyPy check for the adapter/test_arrow.py
. I was thinking of migrating unit tests in relevant directories in the future and thought that we can type check them too.
Checklist
closes #744