Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4mc codecs should implement SplittableCompressionCodec #24

Open
pradeepg26 opened this issue Apr 14, 2017 · 5 comments
Open

4mc codecs should implement SplittableCompressionCodec #24

pradeepg26 opened this issue Apr 14, 2017 · 5 comments

Comments

@pradeepg26
Copy link

The implementation of Codec and InputFormat seems to follow the pattern from Elephantbird. However, this isn't a good pattern in my opinion. In the spirit of Hadoop, the concept of compression and file format should be decoupled. We should be able to change compression formats without needed to change the way those files are read.

Currently, if we change the compression from e.g. gz to 4mc, we need to change the InputFormat that is used to read the files, and we wouldn't be able to change the compression again. To do this gracefully, we would need to code defensively and dynamically change the InputFormats based on what files are in the input location. I don't think this strategy would work if you have a directory that has files that have been compressed with different formats.

In order to support this type of flexibility, the 4mc codecs should implement the SplittableCompressionCodec interface. This provides existing formats the ability to gracefully handle the new compression formats.

@carlomedas
Copy link
Collaborator

Hello there.

Is this a new interface coming with a new hadoop version or something like that?

@pradeepg26
Copy link
Author

Nope, it's been around for a while. Take a look at BZip2Codec for an example on how it's intended to be used.

@carlomedas
Copy link
Collaborator

You say you would like to change compression algo inside 4mc, but it's currently not supported.
As matter of fact to provide both lz4 and zstd I created both 4mc and 4mz, dedicated to each of them.
The good news is that a splittable compression format is now discussed in zstandard itself, so it's going to be available at the source itself very soon.

@pradeepg26
Copy link
Author

Great to hear that zstd is working on splittable compression format. I'll probably just wait for that.

In the mean time, I'm not proposing to change the compression algo inside 4mc. Just a refactor of the code to move where the splits are being adjusted. Currently the splits are being adjusted in the FourMcInputFormat and FourMzInputFormat in the getSplits method. If we adjusted the split boundaries inside the SplitCompressionInputStream instead, we wouldn't need the specialized input formats.

I'm working on a patch to implement this, should be out soon.

@carlomedas
Copy link
Collaborator

OK perfect let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants