Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New minimac4 release v4.1.0 produces vcf.gz which can't be read by DosageConvertor #53

Open
sereeena opened this issue Oct 19, 2022 · 14 comments

Comments

@sereeena
Copy link

Previously, we were able to use minimac dosage vcf files with DosageConvertor, to convert output files to a plink dosage format. However, when using the new minimac4 version to produce outputs in vcf.gz format, DosageConvertor can no longer read these files properly and reports the following error : Tag DS does not exist for . (See full trace below)

Here is the example of how I call minimac4.1 and DosageConvertor now (using DosageConvertor built from the latest codebase):
minimac4
chr18.1000g.Phase3.v5.With.Parameter.Estimates.msav
23andme_chr18_eagle.vcf.gz
-o 23andme_chr18_minimac4.dose.vcf.gz
--output-format vcf.gz

DosageConvertor --vcfDos 23andme_chr18_minimac4.dose.vcf.gz
--prefix 23andme_chr18
--type plink
--format 1
--tag DS

Previously, I called minimac like this:
minimac4
--refHaps chr18.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz
--haps 23andme_chr18_eagle.vcf.gz
--prefix 23andme_chr18_minimac4

And here is the full log excerpt from the DosageConvertor part:

Version: 1.0.0;
 Built: Mon Oct 17 04:33:09 UTC 2022 by root

 Command Line Options: 
        Input Dosage : --vcfDose [23andme_chr18_minimac4.dose.vcf.gz],
                       --info [], --tag [DS]
   Output Parameters : --prefix [23andme_chr18], --format [1], --type [plink],
                       --nobgzip
    Other Parameters : --buffer [10000], --allDiploid, --trimNames,
                       --trimLength [100], --SexFile [], --idDelimiter [],
                       --help, --params
           PhoneHome : --noPhoneHome, --phoneHomeThinning [50]


 URL = http://genome.sph.umich.edu/wiki/DosageConvertor
 GIT = https://github.com/Santy-8128/DosageConvertor
 WARNING !!! Parameter --buffer will be ignored when be used with --type "plink" !

 ------------------------------------------------------------------------------
                             INPUT VCF DOSAGE FILE                             
 ------------------------------------------------------------------------------

 Detecting Dosage File Type ... 

 Format = VCF (Variant Call Format) 

 Calculating number of Samples in VCF File ...
 Calculating number Markers in VCF File ...


 Number of Samples read from VCF Dosage File              : 39
 Number of Markers read from VCF Dosage File              : 1333625



 Number of Imputed Variants                               : 1333624
 Number of Genotyped and Imputed Variants                 : 1
 Number of Genotyped Only Variants                        : 0

 Loading Dosage Data from VCF File       : 23andme_chr18_minimac4.dose.vcf.gz

 Reading and Importing Data from VCF File ...

    Writing to PLINK output file ...


 ------------------------------------------------------------------------------
                                    ERROR !!!                                  
 ------------------------------------------------------------------------------

 Tag DS does NOT exist for .
 Please verify the input file(s) ! 
 Type --help for usage details 
 Contact author [[email protected]] if you still need help ... 
 Program Aborting ... 

Is there a problem with the dosage file that the new minimac version creates? I also created an issue in the DosageConvertor codebase in case it is something which needs to be changed there. Thanks for your assistance.

@jonathonl
Copy link
Contributor

I'm surprised that DosageConverter doesn't support HDS fields (which is now the default format field for Minimac v4.1), but you can get around this by adding -f DS to your minimac4 command.

@sereeena
Copy link
Author

Great, adding this flag does let DosageConvertor read the file now!

I have another problem though, the dosage file that minimac4 produces now (with that flag -f DS) does not export the ID field, they are all set as '.' which causes problems for me later in the analysis. Is there a way to force that to be exported - it used to set the ID to chr:pos:alt:ref etc?

Thanks again

@jonathonl
Copy link
Contributor

Can you elaborate on the use case for your downstream analysis. This information is redundant with what is already in the VCF records. Note: Minimac v4.1 uses the ID column from the reference panel in the imputed results.

@sereeena
Copy link
Author

Yes of course.
Previously, the dosage file produced from minimac 1.0.3:

head -n 20 23andme_chr5_minimac4.dose.vcf 
##fileformat=VCFv4.1
##filedate=2022.10.21
##source=Minimac4.v1.0.3
...
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##minimac4_Command=minimac4 --cpus 2 --refHaps chr5.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz --haps 23andme_chr5_eagle.vcf.gz --minRatio 0.01 --prefix 23andme_chr5_minimac4
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	1_01A	2_01B	3_02A	4_02B	5_03A	6_03B	7_04A	8_04B	9_05A	10_05B	12_06B	13_07A	14_07B	15_08A	16_08B	17_09A	18_09B	19_10A	20_10B	21_11A	22_11B	23_12A	24_12B25_13A	26_13B	27_14A	28_14B	29_15A	30_15B	31_16A	32_16B	33_17A	34_17B	35_18A	36_18B	37_19A	38_19B	39_20A	40_20B
5	10043	5:10043:T:A	T	A	.	PASS	AF=0.00242;MAF=0.00242;R2=0.01670;IMPUTED	GT:DS	0|0:0.010	0|0:0.010	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.040	0|0:0.040	0|0:0.001	0|0:0.004	0|0:0.004	0|0:0.002	0|0:0.002	0|0:0.007	0|0:0.007	0|0:0.010	0|0:0.010	0|0:0.002	0|0:0.002	0|0:0.001	0|0:0.001	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.001	0|0:0.001	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.001	0|0:0.001	0|0:0	0|0:0	0|0:0.003	0|0:0.003
5	10055	5:10055:T:A	T	A	.	PASS	AF=0.00112;MAF=0.00112;R2=0.00050;IMPUTED	GT:DS	0|0:0.004	0|0:0.004	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.003	0|0:0.003	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.004	0|0:0.004	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.003	0|0:0.003	0|0:0.001	0|0:0.001	0|0:0.003	0|0:0.003	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.002	0|0:0.003	0|0:0.003	0|0:0.002	0|0:0.002	0|0:0.001	0|0:0.001	0|0:0	0|0:0	0|0:0.003	0|0:0.0

Now v4.1.0:

head -n 20 23andme_chr9_minimac4.dose.vcf
##fileformat=VCFv4.2
##filedate=20221020
##source=Minimac v4.1.0
...
##INFO=<ID=IMPUTED,Number=0,Type=Flag,Description="Marker was imputed">
##INFO=<ID=TYPED,Number=0,Type=Flag,Description="Marker was genotyped">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	1_01A	2_01B	3_02A	4_02B	5_03A	6_03B	7_04A	8_04B	9_05A	10_05B	12_06B	13_07A	14_07B	15_08A	16_08B	17_09A	18_09B	19_10A	20_10B	21_11A	22_11B23_12A	24_12B	25_13A	26_13B	27_14A	28_14B	29_15A	30_15B	31_16A	32_16B	33_17A	34_17B	35_18A	36_18B	37_19A	38_19B	39_20A	40_20B
9	10163	.	CT	G	.	.	AF=0;MAF=0;AVG_CS=1;R2=0;IMPUTED	DS	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	00	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
9	10273	.	AAC	G	.	.	AF=0;MAF=0;AVG_CS=1;R2=0;IMPUTED	DS	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	00	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
9	10327	.	G	G	.	.	AF=0;MAF=0;AVG_CS=1;R2=0;IMPUTED	DS	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	00	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Looking here you can see that minimac previously used to fill in the ID column in the vcf. This is needed for me because when I use DosageConvertor to convert the dosage file to plink format, it uses the ID column from the vcf to create the plink map file with the marker IDs, and so now all the IDs in the map file (2nd column below) are empty ('.') (DosageConvertor produces a .plink.dosage, .plink.fam and .plink.map file)

head 23andme_chr9.plink.map 
9	.	0	10163
9	.	0	10273
9	.	0	10327
9	.	0	10329
9	.	0	10362
9	.	0	10469

I am unable to use the plink files due to not having unique variant identifiers.
Hope that helps to clarify, and I appreciate your help, thank you.

@jonathonl
Copy link
Contributor

I forked DosageCoverter and made a fix for plink conversions. Please try https://github.com/jonathonl/DosageConvertor. If this works for you, I will create a pull request in the main repo.

@sereeena
Copy link
Author

Thanks for responding so quickly - I can confirm the changes you made to DosageConvertor work and I am able to create valid plink files, so creating a pull request ought to be useful to others as well.

I've been looking into the results that minimac v4.1 is generating compared to the previous version though (with the same input data), and I can't understand it but the new version is producing a dosage vcf where many snp alleles are inconsistent with the reference data. The majority of alleles in the dosage file are different, such that in later stages of my analysis when I filter snps which match the reference, hardly anything matches and my results are thus very different/incorrect.

I can't say for sure without spending a lot more time on this (which I don't have at the moment) but often one or both of ref and alt are flipped/complementary in the new minimac version. So at the moment I am going to have to stay with the previous minimac version unless you have any good ideas why this is happening/what I should look for?

Thanks again

@jonathonl
Copy link
Contributor

I'm very skeptical that Minimac4.1 is changing the reference allele. If this is happening, it is likely happening somewhere else in your pipeline. But if you can provide an example, I'll look into it.

@sereeena
Copy link
Author

Yes I agree it seems unlikely, but I've put together some files here for you to reproduce it, using the same input file and references. My pipeline runs on Centos7, so I've included 2 Dockerfiles which show how I build minimac v1.0.3 and v4.1.

It should be shared here: https://drive.google.com/file/d/1Zy01T1f4DzcaQVDjq4wan80rrXRPN4Y3/view?usp=sharing

Create a directory and extract the files, then build the docker images:

docker build -t test_103 -f Dockerfile_103 .
docker build -t test_4_1 -f Dockerfile_4_1 .

Run the docker container, bind mounting the current directory with the extracted files, and call minimac v1.0.3:

docker run --rm -it --mount type=bind,source=$(pwd),target=/home/root test_103 /bin/bash
cd /home/root
minimac4 --cpus 8 \
   --refHaps chr13.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz \
   --haps 23andme_chr13_eagle.vcf.gz \
   --minRatio 0.01 \
   --prefix 23andme_chr13_minimac1_0_3

Exit and run the other docker container, bind mounting the current directory with the extracted files, and call minimac v4.1:

docker run --rm -it --mount type=bind,source=$(pwd),target=/home/root test_4_1 /bin/bash
cd /home/root
minimac4 chr13.1000g.Phase3.v5.With.Parameter.Estimates.msav \
   23andme_chr13_eagle.vcf.gz \
   -o 23andme_chr13_minimac4_1.dose.vcf.gz \
   --output-format vcf.gz \
   --threads 8

If you now exit and compare 23andme_chr13_minimac1_0_3.dose.vcf.gz and 23andme_chr13_minimac4_1.dose.vcf.gz in the current directory, you can see the differences in ref/alt.

I generated the .msav files as instructed, using

minimac4 --update-m3vcf chr13.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz > chr13.1000g.Phase3.v5.With.Parameter.Estimates.msav

If you could take a look and tell me if there is something wrong with how I have built minimac/libraries or the msav reference that would be great.

@jonathonl
Copy link
Contributor

Ok, I will check this out sometime this week.

@Santy-8128
Copy link
Collaborator

Santy-8128 commented Oct 26, 2022 via email

@sereeena
Copy link
Author

The .msav reference files are not text files, so I'm unable to check whether the .m3vcf.gz reference files are converted correctly?

@Santy-8128
Copy link
Collaborator

Santy-8128 commented Oct 26, 2022 via email

@jonathonl
Copy link
Contributor

Your input target genotypes have the alleles switched when comparing against all 3 reference panels (see below). I'm guessing that v4.0.3 is allowing for this (maybe intentionally) (@Santy-8128 might be able to comment) whereas v4.1.0 is expecting the alleles to exactly match what is in the reference.

$ bcftools view ALL.chr13.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf | grep -v "^#" | head | grep 19020095 | cut -f1-5
13	19020095	rs140871821	C	T
$ zcat chr13.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz | grep -v "^#" | head | grep 19020095 | cut -f1-5
13	19020095-19021250	<BLOCK:0-26>	.	.
13	19020095	13:19020095	C	T
$ cget/bin/sav export chr13.1000g.Phase3.v5.With.Parameter.Estimates.msav | grep -v "^#" | head | grep 19020095 | cut -f1-5
13	19020095	.	C	<BLOCK>
13	19020095	.	C	T
$ zcat 23andme_chr13_eagle.vcf.gz | grep -v "^#" | head | grep 19020095 | cut -f1-5
13	19020095	rs140871821	T	C

@sereeena
Copy link
Author

Thanks for helping with this! So it looks like there was always something wrong with our pipeline and that minimac 1.0.3 may have taken the alleles from the references while minimac 4.1 is taking it from the eagle input file, where alleles have been flipped?

I suspect it has something to do with plink, I have a bed/bim/fam where everything looks correct, then I convert to vcf with plink (which then goes on to eagle as a bcf, then eagle output goes to minimac). It is in the conversion to vcf where although I use --keep-allele-order with plink, it is still flipping them in the output vcf. So that will take time to investigate, but as this is not a minimac issue you can go ahead and close it, thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants