-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing an archive can be inefficient because of seeks #367
Comments
Update: It is sometimes possible for seeks near the current position to be performed within the buffer, and not to cause flushing of the buffer. I made such change in google/riegeli@1d03ff7. But this does not help when the entry crosses a buffer boundary. Also, POSIX |
Comparing the dirent we wrote before writing the file data and only seeking and re-writing it if it changed sounds like a good optimisation. We'll implement it when we get to it, or you could give it a try and submit a pull request. Seeking within the file buffer is beyond what we are willing to do. |
I do not expect libzip to seek within the buffer, since buffering is applied by the implementation of the source, i.e. in general by the client of libzip. For I mentioned this as a partial mitigation which can be performed by the source. This mitigation is useful while libzip does seeks, and also for cases where the seek cannot be eliminated because the initial dirent does not contain all information (e.g. crc). This mitigation is partial because it is infeasible if the file crossed buffer boundaries. The optimization should tentatively trust the client-supplied CRC for the initial dirent. Even if you prefer to recompute CRC for uncompressed sources (per #359), the dirent can be rewritten if the CRC turns out not to match. |
Yes, that's the plan. |
Alternatively, one can write a 'streamed' ZIP, then seeking wouldn't be necessary. See #378. |
Description
A new entries in an archive is written by:
This can be quite inefficient for an archive consisting of many small files if the archive’s
zip_source_t
flushes its buffers for seeking and has high latency of flushing the buffer, e.g. when working with a remote file abstraction.We should be able to do better if
ZIP_SOURCE_STAT
provides enough information to write the correct directory entry immediately.Even if libzip does not trust the crc provided by the entry’s
zip_source_t
, it can write the preliminary entry with the provided crc, and later check if the directory entry needs to be updated or not.Solution
Avoid write seeks when possible (seeking to the current position is OK), i.e. if
ZIP_SOURCE_STAT
provides enough information to write the correct directory entry before writing file contents.The text was updated successfully, but these errors were encountered: