Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdfs: use 4mc instead of bzip2/gzip #52

Open
robmaz opened this issue Feb 15, 2018 · 3 comments
Open

hdfs: use 4mc instead of bzip2/gzip #52

robmaz opened this issue Feb 15, 2018 · 3 comments

Comments

@robmaz
Copy link
Owner

robmaz commented Feb 15, 2018

Compression is one of the bigger bottlenecks of the pipeline right now. 4mc is nearly 20x faster than bzip2 or bgzip and may offer a reasonable trade-off between compression delay and transfer speed.

@magicDGS
Copy link
Collaborator

If we are going to use ReadTools, there is maybe not such an improvement. As we discussed in the ReadTools repository, on-the-fly compression might not be a bottleneck and it should be profiled properly (there are java tools for that, such as https://www.ej-technologies.com/products/jprofiler/overview.html, that can help to check where the low speed hotspot).

Another option is to profile some upload/download using ReadTools with different compression (several times and taking average, maximum and minimum). I am still not sure that the compression is the major bottleneck: before the on-the-fly upload existed, the pipeline was taking even more due to compression locally (adding IO overhead on the local filesystem and disk usage), and uploading using hdfs (network bottleneck and IO in HDFS). The improvement was huge, but it might be that the limiting factor is compression now (there is going to be a limit of improvement at some point).

If people is complaining about speed, they should have been at the institute 3 years ago! That's one of the reasons that ReadTools have Distmap support! Hahaha

@magicDGS
Copy link
Collaborator

I added the ReadTools label, because it is kind of related (unless you remove the dependency of it on upload/download).

@magicDGS magicDGS added this to the Long term changes milestone Feb 27, 2018
@magicDGS
Copy link
Collaborator

ReadTools already support hadoop-plugins for compression in the classpath, so this should be ready to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants