Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-adjacent blocks from --compress-reference #69

Open
andrew-slater opened this issue Apr 8, 2024 · 3 comments
Open

Non-adjacent blocks from --compress-reference #69

andrew-slater opened this issue Apr 8, 2024 · 3 comments

Comments

@andrew-slater
Copy link

I'm attempting to replicate the Michigan Imputation Server locally. Since we will be focusing on a few loci in small sample sizes, seems best to just run Eagle and Minimac directly instead of through Cloudgene. I downloaded the latest Minimac4 release (4.1.6) but wanted to sanity check that it produced the same results as the version being reported by the Imputation Server (4-1.0.2).

We're using GRCh38 and I could not find any existing M3VCF/MSAV files so created from the 1000 Genomes release using Minimac3 to create the M3VCF and Minimac4.1.6 to create the MSAV (--compress-reference). I found some discordance in the genotypes imputed from the two Minimac4 versions and, out of curiosity, created a new MSAV from the M3VCF via --update-m3vcf in Minimac4.1.6. Using this M3VCF-sourced MSAV, the imputation results between 4.1.6 and 4-1.0.2 are much more similar. I used savvy to export the 2 MSAV files to VCF and found the difference seems to be in the block structure:

$ zgrep $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAV-from-M3VCF.vcf.gz | cut -f 1-10 | head
1       10416   1:10416 CCCTAA  <BLOCK> .       .       END=62157;VARIANTS=27;REPS=178  UHM     106
1       62157   1:62157 G       <BLOCK> .       .       END=80454;VARIANTS=34;REPS=200  UHM     12
1       80454   1:80454 G       <BLOCK> .       .       END=84139;VARIANTS=25;REPS=180  UHM     0
1       84139   1:84139 A       <BLOCK> .       .       END=89744;VARIANTS=44;REPS=165  UHM     0
1       89744   1:89744 A       <BLOCK> .       .       END=494542;VARIANTS=22;REPS=138 UHM     22
1       494542  1:494542        G       <BLOCK> .       .       END=600536;VARIANTS=26;REPS=125 UHM     0
1       600536  1:600536        A       <BLOCK> .       .       END=638772;VARIANTS=37;REPS=145 UHM     0
1       638772  1:638772        A       <BLOCK> .       .       END=771717;VARIANTS=17;REPS=92  UHM     0
1       771717  1:771717        C       <BLOCK> .       .       END=785674;VARIANTS=58;REPS=103 UHM     11
1       785674  1:785674        T       <BLOCK> .       .       END=790100;VARIANTS=51;REPS=122 UHM     6
$ zgrep $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAVvia--compress-reference.vcf.gz | cut -f 1-10 | head
1       10416   .       CCCTAA  <BLOCK> .       .       END=66264;VARIANTS=30;REPS=206  UHM     0,5
1       66269   .       A       <BLOCK> .       .       END=80454;VARIANTS=30;REPS=183  UHM     0,4
1       82163   .       G       <BLOCK> .       .       END=86028;VARIANTS=30;REPS=274  UHM     0,131
1       86065   .       G       <BLOCK> .       .       END=101550;VARIANTS=50;REPS=204 UHM     0,0
1       120983  .       C       <BLOCK> .       .       END=597799;VARIANTS=30;REPS=239 UHM     0,1
1       597818  .       C       <BLOCK> .       .       END=666203;VARIANTS=40;REPS=203 UHM     0,0
1       668135  .       C       <BLOCK> .       .       END=785203;VARIANTS=70;REPS=243 UHM     0,6
1       785417  .       G       <BLOCK> .       .       END=790082;VARIANTS=50;REPS=124 UHM     0,109
1       790099  .       A       <BLOCK> .       .       END=798782;VARIANTS=50;REPS=115 UHM     0,23
1       798941  .       G       <BLOCK> .       .       END=809630;VARIANTS=50;REPS=118 UHM     0,5

The M3VCF-sourced MSAV has adjacent blocks (the END INFO field is the POS value of the next block) but the MSAV created directly from the genotype VCF does not. And since the blocks are not adjacent, there is no common overlapping variant:

zgrep -C 1 $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAVvia--compress-reference.vcf.gz | cut -f 1-10 | tail -n +5 | head -n 11
1       66264   .       A       T       .       .       AC=29;AN=5096;UHA=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1
1       66269   .       A       <BLOCK> .       .       END=80454;VARIANTS=30;REPS=183  UHM     0,4
1       66269   .       A       T       .       .       AC=25;AN=5096;UHA=0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
--
1       80454   .       G       C       .       .       AC=6;AN=5096;UHA=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
1       82163   .       G       <BLOCK> .       .       END=86028;VARIANTS=30;REPS=274  UHM     0,131
1       82163   .       G       A       .       .       AC=170;AN=5096;UHA=0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
--
1       86028   .       T       C       .       .       AC=195;AN=5096;UHA=0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1
1       86065   .       G       <BLOCK> .       .       END=101550;VARIANTS=50;REPS=204 UHM     0,0
1       86065   .       G       C       .       .       AC=196;AN=5096;UHA=0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0

Perhaps the MSAV format does not require the same block adjacency as the M3VCF format does? But the differing imputation results seem to indicate this differing format has an effect. Is this expected?

@jonathonl
Copy link
Contributor

Minimac v4.0.x expected first variant in a block to be a duplicate of the last variant in the previous block. Minimac v4.1.x does not have this expectation and will filter out the duplicates if they are encountered (https://github.com/statgen/Minimac4/blob/master/src/unique_haplotype.cpp#L522-L524).

What command did you run to generate the b38 M3VCF? I suspect that, if you looked at the non-block records in the M3VCF file, you will see ERR and RECOM INFO fields. These fields will not exist in the "--compress-reference" version. These are parameter estimates that will improve the accuracy of imputation when using 1KG as a reference panel.
zgrep -v $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAV-from-M3VCF.vcf.gz | cut -f 1-9 | head

Otherwise, can you elaborate on the discordance you are seeing? There are expected to be small differences between v4.0.x and v4.1.x.

@andrew-slater
Copy link
Author

Thanks for the quick reply Jonathan. Glad to hear this format is not unexpected for MSAV. I created the M3VCF using this command:

Minimac3 --processReference --refHaps ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.vcf.gz --prefix ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons

And I do see the parameter estimates in the M3VCF-sourced MSAV:

$ zgrep -v '^#' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAV-from-M3VCF.vcf.gz  | cut -f 1-9 | head -n 3
1       10416   1:10416 CCCTAA  <BLOCK> .       .       END=62157;VARIANTS=27;REPS=178  UHM
1       10416   1:10416 CCCTAA  C       .       .       AC=240;AN=5096;ERR=0.0054688;RECOM=0.00050835;UHA=0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1       16103   1:16103 T       G       .       .       AC=118;AN=5096;ERR=0.0071291;RECOM=0.00050835;UHA=0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Is there a reason these are not calculated/included in the MSAV created by --compress-reference?

The first instance of discordance I found was with a TYPED variant where one of the assayed genotypes was overwritten by 4.1.6 using the --compress-reference MSAV but maintained by 4-1.0.2 using the M3VCF and 4.1.6 using the M3VCF-derived MSAV. Since the assayed data is from deep NGS and thus likely accurate (I haven't dug into the underlying read data to inspect further), seems this supports your hypothesis of the improved accuracy from the M3VCF parameter estimates. It is a rarish variant (AF 0.005) which perhaps benefits more from this accuracy improvement?

I'll plan to use the M3VCF with parameter estimates for our imputation jobs.

@jonathonl
Copy link
Contributor

Is there a reason these are not calculated/included in the MSAV created by --compress-reference?

Minimac4 was designed for large reference panels (>100,000 samples) and the parameter estimation is less beneficial at this scale and not tractable.

By the way, you can look at ER2 INFO field for the TYPED sites to see how correlated the imputation dosages are with the assayed genotypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants