Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel bgzip cram #13

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

daheise
Copy link

@daheise daheise commented Oct 11, 2018

This pull request may belong on a branch other than master since it depends on a PR being accepted in another project, namely this one.

If libStatGen accepts the above PR, then this PR modifies Minimac to use the parallel bgzip functionality, and adds additional parallelism in the final append.

I do not have a way to perform regression tests for correctness tests at this time. Let me know if there are any changes to the PR needed to aid acceptance.

Known Issue
The cget requirements file will need to be tweaked to point to the correct libStatGen prior to acceptance, if libStatGen accepts the dependent PR.

@daheise
Copy link
Author

daheise commented Oct 11, 2018

I did test with the test data included in Eagle tables and example folders. I got the same output from Minimac4 master and this PR.

@avsmith
Copy link

avsmith commented Mar 21, 2019

@Santy-8128 @jonathonl Can you take a look at this pull request, and commit if passes muster.

@jonathonl
Copy link
Contributor

This will likely conflict with sav-support-2 branch, which is being tested by Imputation Server team. Once sav-support-2 branch is merged into master, I can potentially review this for inclusion.

@avsmith
Copy link

avsmith commented Mar 21, 2019

@daheise What is the value add here? We don't anticipate a great gain. Much is dependent on libStatGen and we are concerned with this making breaking changes in other tools.

@daheise
Copy link
Author

daheise commented Mar 21, 2019

We're running imputation on data from 400K+ individuals. Minimac does an extremely long running single threaded compressed write to disk. These changes reduce the run time of our longest running chunk from 111 hours to 70 hours.

For reference, this is also related to resolving issue #11. I should have mentioned that in the PR.

@jonathonl
Copy link
Contributor

We are working on a different solution for the problem you are experiencing and this PR is in conflict with that vision. If you are satisfied with what you've implemented, then I recommend you continue using your fork. You could also try the sav-support-2 branch and add --sav to your command. This reduces run time by using the SAV storage format for temp and output files. The other alternative is to spread the samples you are imputing across multiple minimac4 processes.

@daheise
Copy link
Author

daheise commented Mar 21, 2019

I will see if I can accomplish a test run on our data using SAV, can you provide additional information regarding that format? Do you provide a conversion to m3vcf when using that format?

We have tried spreading the samples across many processes -- the 111 hour number is from one chunk out of 154.

@jonathonl
Copy link
Contributor

SAV is only for output from minimac4, so you would continue using m3vcf for the reference panel input. You can get the sav CLI from https://github.com/statgen/savvy/releases, which allows you to export the imputed output to vcf.

@daheise
Copy link
Author

daheise commented Mar 26, 2019

Is SAV already the output format of Minimac4 on master, or are you saying that will be the only output file once the sav-support-2 branch in merged?

I did pullsav-support-2 into my parallel-bgzip-cram, and got one significant merge conflict:

++<<<<<<< HEAD
 +    IFILE vcfdosepartial = ifopen(PartialVcfFileName.c_str(), bgzf_mode, InputFile::BGZF);
 +    IFILE vcfLoodosepartial = NULL;
++=======
+     IFILE vcfdosepartial = nullptr;
+     IFILE vcfLoodosepartial = nullptr;
+     std::unique_ptr<savvy::sav::writer> temp_sav_file = nullptr;
+ 
+ 
+     savvy::site_info sav_variant_anno;
+     if (MyAllVariables->myOutFormat.savOutput)
+     {
+         std::vector<std::pair<std::string,std::string>> empty_vec;
+         savDoseBuf.resize(2 * BuffernumSamples);
+         temp_sav_file = savvy::detail::make_unique<savvy::sav::writer>(PartialVcfFileName,
+           individualName.begin() + ((NovcfParts - 1) * BuffernumSamples), individualName.begin() + std::min(individualName.size(), std::size_t(NovcfParts * 
+           empty_vec.begin(), empty_vec.end(),
+           savvy::fmt::hds);
+     }
+     else
+     {
+         vcfdosepartial = ifopen(PartialVcfFileName.c_str(), "wb", InputFile::BGZF);
+     }
+ 
++>>>>>>> 38f4f1695598fd262f5a8b1585adce8a70fbbcf8

@daheise
Copy link
Author

daheise commented Mar 28, 2019

Is it correct that the --sav option produces the equivalent of --format GT? In our use case we're using --format GT,DS,HDS,GP. Will SAV support additional format options?

@jonathonl
Copy link
Contributor

Is SAV already the output format of Minimac4 on master, or are you saying that will be the only output file once the sav-support-2 branch in merged?

The master branch only outputs vcf files. The sav-support-2 branch is a first step towards better file output, which adds support for sav. Eventually, the savvy library will be used to output sav, vcf and bcf files. But for now there is a hybrid of savvy and libStatGen.

Is it correct that the --sav option produces the equivalent of --format GT? In our use case we're using --format GT,DS,HDS,GP. Will SAV support additional format options?

The --sav option produces the equivalent of HDS. Storing the other 3 format fields is redundant because they can all be calculated from HDS. The sav cli use the -d, --data-format option to specify how to view the data (e.g., sav export -d HDS file.sav, sav export -d DS file.sav, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants