Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Stream Decompression for tar #183

Merged
merged 3 commits into from
Apr 27, 2024
Merged

Conversation

tempusfrangit
Copy link
Member

@tempusfrangit tempusfrangit commented Mar 12, 2024

Implement stream decompression for tar files. This mean that -x becomes smart enough to handle lzw, bzip2, gzip, and xz (common tar compression) formats automatically. This will remove a sharp edge on pget and handle cases of compressed tar files elegantly.

It is recommended to use uncompressed TAR as a weights source as decompression is significantly slower for very minimal space savings.

This code is intended to make pget slightly more forgiving but does not improve performance.

Closes: #1

@tempusfrangit tempusfrangit marked this pull request as draft March 12, 2024 23:16
@tempusfrangit tempusfrangit force-pushed the streaming-decompress-tar branch 2 times, most recently from b38cb8e to 8f69551 Compare March 13, 2024 21:21
@tempusfrangit tempusfrangit marked this pull request as ready for review March 13, 2024 21:22
@tempusfrangit tempusfrangit self-assigned this Mar 13, 2024
@tempusfrangit tempusfrangit requested a review from a team March 13, 2024 22:55
@tempusfrangit tempusfrangit force-pushed the streaming-decompress-tar branch 2 times, most recently from cda1292 to 9c938c1 Compare March 20, 2024 22:54
@tempusfrangit
Copy link
Member Author

We should probably do something where we log a warning about compressed streams being a poor experience.

Implement stream decompression for tar files. This mean that -x becomes
smart enough to handle lzw, bzip2, gzip, and xz (common tar compression)
formats automatically. This will remove a sharp edge on pget and handle
cases of compressed tar files elegantly.
Copy link
Contributor

@philandstuff philandstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified quite a bit but it's good

pkg/extract/compression.go Outdated Show resolved Hide resolved
pkg/extract/tar.go Outdated Show resolved Hide resolved
Instead of implementing the peek reader, wrap everything in bufio.Reader
and lean on it's peek() capabilities. This also allows for a simpler
bytes.HasPrefix use for the magic numbers instead of needing to deal
with the Endian-ness of the magic bytes -- critically this eliminates
the padding for the 48-bit magic bytes header for some compression
types.
Global PAX Headers are a meta-header that applies to subsequent files.
However, in most cases these values are 100% ignorable as the underlying
archive/tar handles merging things. However, the global header stat is
not persisted across headers (as per the spec); notably this is largely
"OK" as most values within the global header are not commonly used or
relevant (i.e. size is unlikely to be relevant as that is file by file
and not global).

We can always add futher PAX Header support if needed, but the reality
is pget is highly optimized for it's use case and doesn't go too far out
of it's way for cases that aren't relevant.
@tempusfrangit tempusfrangit merged commit 14a5144 into main Apr 27, 2024
5 checks passed
@tempusfrangit tempusfrangit deleted the streaming-decompress-tar branch April 27, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancement Request: GZIP support (tar mode)
3 participants