Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent ploidy and segmentation fault when creating 1000GP chrX.PAR[1-2].msav #74

Open
youngchanpark opened this issue Jul 1, 2024 · 1 comment

Comments

@youngchanpark
Copy link

Hi,

I split the 1000GP chrX VCF into nonPAR, PAR1, and PAR2 regions then tried creating .msav files with these but I encountered some issues. Below is the code I used.

#/usr/bin/env bash
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz.tbi

# Borders for nonPAR checked manually
# PAR1 - nonPAR border variant
# chrX    2781514
# nonPAR PAR2 border variant
# chrX    155700882
# The above two variants are nonPAR variants. So the nonPAR vcf should start and end with these two variants included.

echo -e "chrX\t2531036\t2531036\tHGSV_248479\nchrX\t2777488\t2777488\tHGSV_249395" > exclude_ids.bed
# these two <DUP> and <DEL> variants in PAR1 sneaks into the nonPAR if we don't specifically exclude
bcftools view -r chrX:2781514-155700882 -T ^exclude_ids.bed 1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz -Oz -o chrX.nonPAR.vcf.gz
tabix 1kGP_high_coverage_Illumina.chrX.nonPAR.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

bcftools view -r chrX:1-2781513 1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz -Oz -o chrX.PAR1.vcf.gz
tabix chrX.PAR1.vcf.gz

bcftools view -r chrX:155700883- 1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz -Oz -o chrX.PAR2.vcf.gz
tabix chrX.PAR2.vcf.gz

# chrX nonPAR (OK)
minimac4 --compress-reference chrX.nonPAR.vcf.gz > chrX.nonPAR.msav

# chrX PAR1 (Inconsistent ploidy)
minimac4 --compress-reference chrX.PAR1.vcf.gz > chrX.PAR1.msav

# chrX PAR2 (Segfault)
minimac4 --compress-reference chrX.PAR2.vcf.gz > chrX.PAR2.msav

The outputs of minimac4 were:

minimac v4.1.6

Error: sample ploidy is not consistent
Error: compressing variant failed
minimac v4.1.6

Mean Compression Ratio: 0.0399954
Min Compression Ratio: 0.0240718
Max Compression Ratio: 0.0915209
minimac v4.1.6

./create_xchr_msavs.sh: line 32: 3385400 Segmentation fault      (core dumped) 

msav for the nonPAR region was created as expected.

All samples in the PAR1 and PAR2 vcfs are either haploid or diploid for all variants. I assumed this would qualify as consistent ploidy, but PAR1 failed with inconsistent ploidy message. Is my understanding wrong? Does all samples within the VCF need to be either haploid or diploid?

I'm also getting a Segmentation fault for PAR2.

@jonathonl
Copy link
Contributor

Thanks for this thorough report. When you extract nonPAR and PAR2 from this VCF, you are getting some structural variants that overlap the PAR boundaries. This is causing mixed ploidy. The PAR1 compression seems to be running fine. You will need to add -i 'POS>=2781514 to the nonPAR bcftools command and -i 'POS>=155700883' to the PAR2 bcftools command.

I will need to fix the segfault so that it instead prints an error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants