Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IO] Improved IO with support for reading data from compressed files #308

Merged
merged 8 commits into from
Jun 19, 2024

Conversation

probberechts
Copy link
Contributor

It is a common practice to store data as compressed files to reduce storage requirements. With this PR it is no longer needed to decompress the file before loading the data with kloppy.

from kloppy import statsperform

dataset = statsperform.load(
    raw_data="ma25_tracking.txt.gz",
    meta_data="ma1_metadata.xml.gz",
)

Whether a file is compressed is derived from the file's extension. Currently supports ".gz", ".xz" and ".bz2".

- Add support for opening a gzip, bzip or lzma-compressed file.
- Additional tests for io.open_as_file function
kloppy/io.py Outdated Show resolved Hide resolved
@koenvo
Copy link
Contributor

koenvo commented Apr 19, 2024

This should also work non-local files, right? Like https://some-url.com/file.xml.gz

@koenvo
Copy link
Contributor

koenvo commented May 27, 2024

Can you merge master in please to make sure tests run again

@probberechts probberechts force-pushed the feat/load-gzip branch 2 times, most recently from 827ebad to a6a288a Compare May 27, 2024 20:39
@probberechts
Copy link
Contributor Author

I couldn't get boto (to mock an S3 bucket) to work on GitHub Actions. In the most recent version, there is this bug and for older versions I can't figure out a set of version constraints between s3fs and boto that works on each Python version. Hence, I propose to disable these tests until the bug is fixed.

I recently also found the xopen library for opening compressed files. We could use it as a more efficient and robust replacement of the _open method that I implemented. Do you think it is worth adding another dependency? It could also be an optional one.

kloppy/io.py Outdated Show resolved Hide resolved
kloppy/io.py Outdated Show resolved Hide resolved
@probberechts probberechts changed the title [IO] Allow reading data from compressed file [IO] Improved IO with support for reading data from compressed files Jun 18, 2024
@probberechts probberechts requested a review from koenvo June 18, 2024 19:38
@koenvo koenvo merged commit a3ca3f3 into PySport:master Jun 19, 2024
19 checks passed
@koenvo
Copy link
Contributor

koenvo commented Jun 19, 2024

Thanks Pieter, great work!

@koenvo koenvo added this to the 3.15 milestone Jun 19, 2024
@probberechts probberechts deleted the feat/load-gzip branch June 20, 2024 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants