Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add arrow adapter #755

Merged
merged 15 commits into from
Aug 9, 2024
Merged

Add arrow adapter #755

merged 15 commits into from
Aug 9, 2024

Conversation

skarakuzu
Copy link
Contributor

@skarakuzu skarakuzu commented Jun 17, 2024

Checklist

  • Add a Changelog entry
  • Add the ticket number which this PR closes to the comment section

closes #744

@danielballan
Copy link
Member

Summarizing chat from this morning:

  • Remove the dataframe_adapter property from arrow.py. Instead of adding code to table.py but it directly in the read, read_partition (etc.) methods in arrow.py.
  • Make two classes in arrow.py, one addressing the "file" format (supports random access of row batches) and one addressing "stream" format (supports appending row batches to an existing file).

Copy link
Member

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on track. I left some suggestions regarding the implementation.

tiled/_tests/adapters/test_arrow.py Outdated Show resolved Hide resolved
tiled/adapters/arrow.py Outdated Show resolved Hide resolved
tiled/adapters/arrow.py Outdated Show resolved Hide resolved
tiled/adapters/arrow.py Outdated Show resolved Hide resolved
tiled/adapters/arrow.py Outdated Show resolved Hide resolved
@skarakuzu skarakuzu marked this pull request as ready for review August 9, 2024 17:35
Copy link
Member

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed some things in my last review.

Also, these two files should be removed:

  • main.py (empty)
  • test.arrow (large binary file)

else:
return pyarrow.ipc.open_file(self._partition_paths[partition])

@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that this is a generator! I think it's not necessary to make it a @property. Generally, functions that do I/O (or anything that could potentially take awhile) should be normal methods.

for batch in data:
file_writer.write_batch(batch)

def read(self, *args: Any, **kwargs: Any) -> pandas.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently accepts any positional or keyword arguments and then ignores them. Notice that in CSV, we accept any arguments but we pass them into another method which accepts specific arguments and applies them:

tiled/tiled/adapters/csv.py

Lines 203 to 217 in 15abbc9

def read(
self, *args: Any, **kwargs: Any
) -> Union[pandas.DataFrame, dask.dataframe.DataFrame]:
"""
Parameters
----------
args :
kwargs :
Returns
-------
"""
return self.dataframe_adapter.read(*args, **kwargs)

It should accept an optional fields parameter and apply it if it is not None.

if fields is not None:
df = df[fields]

The pyarrow table corresponding to a given partition and batch as pandas dataframe.
"""
df = self.reader_handle_partiton(partition)
return df.read_all().to_pandas()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to read() this should apply the fields parameter, which is currently just ignored.

@@ -0,0 +1,70 @@
import tempfile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you run pytest it won't discover this file because it only discovers subpackages and modules that have the word test in them. The directory _tests/adapters needs to be renamed _tests/test_adapters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added also MyPy check for the adapter/test_arrow.py . I was thinking of migrating unit tests in relevant directories in the future and thought that we can type check them too.

@danielballan danielballan merged commit b3e47d4 into bluesky:main Aug 9, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an ArrowAdapter
2 participants