-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel bgzip cram #13
base: master
Are you sure you want to change the base?
Conversation
I did test with the test data included in Eagle tables and example folders. I got the same output from Minimac4 |
@Santy-8128 @jonathonl Can you take a look at this pull request, and commit if passes muster. |
This will likely conflict with sav-support-2 branch, which is being tested by Imputation Server team. Once sav-support-2 branch is merged into master, I can potentially review this for inclusion. |
@daheise What is the value add here? We don't anticipate a great gain. Much is dependent on |
We're running imputation on data from 400K+ individuals. Minimac does an extremely long running single threaded compressed write to disk. These changes reduce the run time of our longest running chunk from 111 hours to 70 hours. For reference, this is also related to resolving issue #11. I should have mentioned that in the PR. |
We are working on a different solution for the problem you are experiencing and this PR is in conflict with that vision. If you are satisfied with what you've implemented, then I recommend you continue using your fork. You could also try the sav-support-2 branch and add --sav to your command. This reduces run time by using the SAV storage format for temp and output files. The other alternative is to spread the samples you are imputing across multiple minimac4 processes. |
I will see if I can accomplish a test run on our data using SAV, can you provide additional information regarding that format? Do you provide a conversion to m3vcf when using that format? We have tried spreading the samples across many processes -- the 111 hour number is from one chunk out of 154. |
SAV is only for output from minimac4, so you would continue using m3vcf for the reference panel input. You can get the sav CLI from https://github.com/statgen/savvy/releases, which allows you to export the imputed output to vcf. |
Is SAV already the output format of Minimac4 on master, or are you saying that will be the only output file once the I did pull
|
Is it correct that the |
The master branch only outputs vcf files. The sav-support-2 branch is a first step towards better file output, which adds support for sav. Eventually, the savvy library will be used to output sav, vcf and bcf files. But for now there is a hybrid of savvy and libStatGen.
The --sav option produces the equivalent of HDS. Storing the other 3 format fields is redundant because they can all be calculated from HDS. The sav cli use the |
eb4f3b1
to
3c97961
Compare
This pull request may belong on a branch other than
master
since it depends on a PR being accepted in another project, namely this one.If
libStatGen
accepts the above PR, then this PR modifies Minimac to use the parallel bgzip functionality, and adds additional parallelism in the final append.I do not have a way to perform regression tests for correctness tests at this time. Let me know if there are any changes to the PR needed to aid acceptance.
Known Issue
The cget requirements file will need to be tweaked to point to the correct
libStatGen
prior to acceptance, if libStatGen accepts the dependent PR.