-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-adjacent blocks from --compress-reference #69
Comments
Minimac v4.0.x expected first variant in a block to be a duplicate of the last variant in the previous block. Minimac v4.1.x does not have this expectation and will filter out the duplicates if they are encountered (https://github.com/statgen/Minimac4/blob/master/src/unique_haplotype.cpp#L522-L524). What command did you run to generate the b38 M3VCF? I suspect that, if you looked at the non-block records in the M3VCF file, you will see ERR and RECOM INFO fields. These fields will not exist in the "--compress-reference" version. These are parameter estimates that will improve the accuracy of imputation when using 1KG as a reference panel. Otherwise, can you elaborate on the discordance you are seeing? There are expected to be small differences between v4.0.x and v4.1.x. |
Thanks for the quick reply Jonathan. Glad to hear this format is not unexpected for MSAV. I created the M3VCF using this command: Minimac3 --processReference --refHaps ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.vcf.gz --prefix ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons And I do see the parameter estimates in the M3VCF-sourced MSAV: $ zgrep -v '^#' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAV-from-M3VCF.vcf.gz | cut -f 1-9 | head -n 3
1 10416 1:10416 CCCTAA <BLOCK> . . END=62157;VARIANTS=27;REPS=178 UHM
1 10416 1:10416 CCCTAA C . . AC=240;AN=5096;ERR=0.0054688;RECOM=0.00050835;UHA=0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1 16103 1:16103 T G . . AC=118;AN=5096;ERR=0.0071291;RECOM=0.00050835;UHA=0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 Is there a reason these are not calculated/included in the MSAV created by The first instance of discordance I found was with a TYPED variant where one of the assayed genotypes was overwritten by 4.1.6 using the I'll plan to use the M3VCF with parameter estimates for our imputation jobs. |
Minimac4 was designed for large reference panels (>100,000 samples) and the parameter estimation is less beneficial at this scale and not tractable. By the way, you can look at ER2 INFO field for the TYPED sites to see how correlated the imputation dosages are with the assayed genotypes. |
I'm attempting to replicate the Michigan Imputation Server locally. Since we will be focusing on a few loci in small sample sizes, seems best to just run Eagle and Minimac directly instead of through Cloudgene. I downloaded the latest Minimac4 release (4.1.6) but wanted to sanity check that it produced the same results as the version being reported by the Imputation Server (4-1.0.2).
We're using GRCh38 and I could not find any existing M3VCF/MSAV files so created from the 1000 Genomes release using Minimac3 to create the M3VCF and Minimac4.1.6 to create the MSAV (
--compress-reference
). I found some discordance in the genotypes imputed from the two Minimac4 versions and, out of curiosity, created a new MSAV from the M3VCF via--update-m3vcf
in Minimac4.1.6. Using this M3VCF-sourced MSAV, the imputation results between 4.1.6 and 4-1.0.2 are much more similar. I used savvy to export the 2 MSAV files to VCF and found the difference seems to be in the block structure:The M3VCF-sourced MSAV has adjacent blocks (the END INFO field is the POS value of the next block) but the MSAV created directly from the genotype VCF does not. And since the blocks are not adjacent, there is no common overlapping variant:
Perhaps the MSAV format does not require the same block adjacency as the M3VCF format does? But the differing imputation results seem to indicate this differing format has an effect. Is this expected?
The text was updated successfully, but these errors were encountered: