-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4mc codecs should implement SplittableCompressionCodec #24
Comments
Hello there. Is this a new interface coming with a new hadoop version or something like that? |
Nope, it's been around for a while. Take a look at BZip2Codec for an example on how it's intended to be used. |
You say you would like to change compression algo inside 4mc, but it's currently not supported. |
Great to hear that zstd is working on splittable compression format. I'll probably just wait for that. In the mean time, I'm not proposing to change the compression algo inside 4mc. Just a refactor of the code to move where the splits are being adjusted. Currently the splits are being adjusted in the I'm working on a patch to implement this, should be out soon. |
OK perfect let me know. |
The implementation of
Codec
andInputFormat
seems to follow the pattern from Elephantbird. However, this isn't a good pattern in my opinion. In the spirit of Hadoop, the concept of compression and file format should be decoupled. We should be able to change compression formats without needed to change the way those files are read.Currently, if we change the compression from e.g. gz to 4mc, we need to change the
InputFormat
that is used to read the files, and we wouldn't be able to change the compression again. To do this gracefully, we would need to code defensively and dynamically change the InputFormats based on what files are in the input location. I don't think this strategy would work if you have a directory that has files that have been compressed with different formats.In order to support this type of flexibility, the 4mc codecs should implement the
SplittableCompressionCodec
interface. This provides existing formats the ability to gracefully handle the new compression formats.The text was updated successfully, but these errors were encountered: