Improve performance find zip archive #1664

johandahlberg · 2024-08-16T09:53:00Z

This is an attempt to address #1662. Working on a code base that makes extensive us of zip archives containing parquet files that need to be read on the fly to load data, I noticed that find was a major bottleneck in that process.

I have made a zip specific implementation of find here that relies on the file list of the zip file, rather than explicitly walking.

Here is how I measured the performance:

from fsspec.implementations.zip import ZipFileSystem

# Example achieve with roughly 900 files, in some deep directory structures
file_system = ZipFileSystem("example.zip")
%timeit file_system.find("/")

Performance for find on current master

2.14 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Performance for find with this fix applied:

318 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Let me know what you think.

fsspec/implementations/zip.py

@martindurant

Suggestions from @martindurant in code review. Co-authored-by: Martin Durant <[email protected]>

johandahlberg · 2024-08-21T13:38:39Z

Thank you for the review @martindurant! I refactored the code quite a bit, to make use of the dir_cache instead of directly accessing the zip file. I think it made the code easier to follow.

Regarding removing the maxdepth that would be a change in the behavior compared to the find method on the AbstractFileSystem - since my initial strategy here was to try to implement something that had parity in the output I kept it in. If you still want me to remove it let me know and I'll do it (or feel free to push any changes you want to this branch if you prefer that).

I implemented it as suggested by you now, but just filtering for it at the end if it is set.

fsspec/implementations/zip.py

martindurant · 2024-08-21T15:10:27Z

I wonder, should this code go up into AbstractArchiveFileSystem, since it may have a similar performance benefit for libarchive or tar filesystems? I'm not sure if find is slow there or if there's anything else to worry about.

johandahlberg · 2024-08-22T12:28:30Z

@martindurant Let me know if you are happy with this or if there is anything else you'd like to see. When it comes to moving to evaluating this libarchive/tar file systems I unfortunately don't think that I will have the bandwidth to take that on.

martindurant · 2024-08-22T19:59:19Z

I unfortunately don't think that I will have the bandwidth to take that on.

Understood - maybe someone else gets the itch if indeed speed is a problem for the other ones.

johandahlberg · 2024-08-23T07:53:22Z

@martindurant thank you for a very enjoyable review process, and thanks from brining this in. It will help our use-case immensely.

johandahlberg added 4 commits August 16, 2024 10:56

Adding tests find on zip archives

9f74f67

Improved find method on ZipFileSystem

39d5a84

Skip checking external_attr in tests

c274185

Skip checking for create_system

25f306a

martindurant reviewed Aug 20, 2024

View reviewed changes

johandahlberg and others added 5 commits August 21, 2024 07:54

Simplifying code

6629cc4

Suggestions from @martindurant in code review. Co-authored-by: Martin Durant <[email protected]>

Refactor find to use dir_cache

e228176

Make sure find("dir/") == find("dir")

3e2b3fe

Move results to make code a bit easier to follow

92f3aa4

Filter by maxdepth at the end

92132a1

martindurant reviewed Aug 21, 2024

View reviewed changes

fsspec/implementations/zip.py Outdated Show resolved Hide resolved

fsspec/implementations/zip.py Show resolved Hide resolved

Replace check with all

09f5b47

martindurant merged commit 7793ab8 into fsspec:master Aug 22, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance find zip archive #1664

Improve performance find zip archive #1664

johandahlberg commented Aug 16, 2024

johandahlberg commented Aug 21, 2024

martindurant commented Aug 21, 2024

johandahlberg commented Aug 22, 2024

martindurant commented Aug 22, 2024

johandahlberg commented Aug 23, 2024

Improve performance find zip archive #1664

Improve performance find zip archive #1664

Conversation

johandahlberg commented Aug 16, 2024

johandahlberg commented Aug 21, 2024

martindurant commented Aug 21, 2024

johandahlberg commented Aug 22, 2024

martindurant commented Aug 22, 2024

johandahlberg commented Aug 23, 2024