Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance find zip archive #1664

Conversation

johandahlberg
Copy link
Contributor

This is an attempt to address #1662. Working on a code base that makes extensive us of zip archives containing parquet files that need to be read on the fly to load data, I noticed that find was a major bottleneck in that process.

I have made a zip specific implementation of find here that relies on the file list of the zip file, rather than explicitly walking.

Here is how I measured the performance:

from fsspec.implementations.zip import ZipFileSystem

# Example achieve with roughly 900 files, in some deep directory structures
file_system = ZipFileSystem("example.zip")
%timeit file_system.find("/")

Performance for find on current master

2.14 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Performance for find with this fix applied:

318 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Let me know what you think.

fsspec/implementations/zip.py Outdated Show resolved Hide resolved
fsspec/implementations/zip.py Outdated Show resolved Hide resolved
fsspec/implementations/zip.py Outdated Show resolved Hide resolved
fsspec/implementations/zip.py Outdated Show resolved Hide resolved
fsspec/implementations/zip.py Outdated Show resolved Hide resolved
fsspec/implementations/zip.py Outdated Show resolved Hide resolved
fsspec/implementations/zip.py Outdated Show resolved Hide resolved
fsspec/implementations/zip.py Outdated Show resolved Hide resolved
@johandahlberg
Copy link
Contributor Author

Thank you for the review @martindurant! I refactored the code quite a bit, to make use of the dir_cache instead of directly accessing the zip file. I think it made the code easier to follow.

Regarding removing the maxdepth that would be a change in the behavior compared to the find method on the AbstractFileSystem - since my initial strategy here was to try to implement something that had parity in the output I kept it in. If you still want me to remove it let me know and I'll do it (or feel free to push any changes you want to this branch if you prefer that).

I implemented it as suggested by you now, but just filtering for it at the end if it is set.

fsspec/implementations/zip.py Outdated Show resolved Hide resolved
fsspec/implementations/zip.py Show resolved Hide resolved
@martindurant
Copy link
Member

I wonder, should this code go up into AbstractArchiveFileSystem, since it may have a similar performance benefit for libarchive or tar filesystems? I'm not sure if find is slow there or if there's anything else to worry about.

@johandahlberg
Copy link
Contributor Author

@martindurant Let me know if you are happy with this or if there is anything else you'd like to see. When it comes to moving to evaluating this libarchive/tar file systems I unfortunately don't think that I will have the bandwidth to take that on.

@martindurant martindurant merged commit 7793ab8 into fsspec:master Aug 22, 2024
11 checks passed
@martindurant
Copy link
Member

I unfortunately don't think that I will have the bandwidth to take that on.

Understood - maybe someone else gets the itch if indeed speed is a problem for the other ones.

@johandahlberg
Copy link
Contributor Author

@martindurant thank you for a very enjoyable review process, and thanks from brining this in. It will help our use-case immensely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants