Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allelic analysis HiC-Pro G2 genome not mapping #652

Open
luckysardar opened this issue Oct 19, 2024 · 4 comments
Open

Allelic analysis HiC-Pro G2 genome not mapping #652

luckysardar opened this issue Oct 19, 2024 · 4 comments

Comments

@luckysardar
Copy link

luckysardar commented Oct 19, 2024

Dear
I am trying to do allelic analysis Hic data getting same error G2 is mapping reads are zero and iced normalisation step getting error. I used bellow command to run analysis. non allelic analysis running completely good . I tried many publically available data with different strain cross like 129S vs CAST and B6 vs PWK got same error I am unable to find solution where its going wrong please help me

bowtie2 allespe.stat file looks like below
HiC-Pro_3.1.0/scripts/markAllelicStatus.py
bam=bowtie_results/bwt2/Sample1/subset_masked_mm10.bwt2pairs.bam
snpFile=mm10_snps_C57b6_PWK_PhJ.vcf
tag=XA
output=bowtie_results/bwt2/Sample1/subset_masked_mm10.bwt2pairs_allspe.bam
verbose=True

Total number of snps loaded 17046154.0

Total number of reads 353850 100
Number of reads with at least one 'N' 368870 104.245
Number of reads assigned to ref genome 91967 25.99
Number of reads assigned to alt genome 0 0.0
Number of conflicting reads 0 0.0
Number of unassigned reads 261883 74.01

code used generate VCF and masked genome

`extract_snps.py -i mgp.v5.merged.snps_all.dbSNP142.vcf -a PWK_PhJ> mm10_snps_C57b6_PWK_PhJ.vcf

bedtools maskfasta -fi mm10.fa -bed mm10_snps_C57b6_PWK_PhJ.vcf -fo masked_mm10.fa

bowtie2-build masked_mm10.fa masked_mm10 --threads 6

HiC-Pro -c config_test_as.txt -i 01_Fastq/ -o 03_output2`

please suggest me where I am going wrong

Thank you

@nservant
Copy link
Owner

Hi,
Can I see the first few lines of your VCF ?
As well as the chromosome name in your fasta masked file ?
Thanks

@luckysardar
Copy link
Author

luckysardar commented Oct 23, 2024

Thank you for response, here is the masked genome chromosome name and vcf file

chr names

chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrMT

vcf file(https://drive.google.com/file/d/1-18ej_EoZpTL4KEvkiukEv-exqz6Mz3G/view?usp=sharing)

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.13+htslib-1.13
##bcftoolsCommand=mpileup -f Mus_musculus.GRCm39.dna.toplevel.fa.gz -b samples -g 10 -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/SP,INFO/AD -E -Q 0 -pm 3 -F 0.25 -d 500
##reference= mgp_REL2021_snps.vcf [ CAST_EiJ ]
##contig=<ID=1,length=195154279>
##contig=<ID=2,length=181755017>
##contig=<ID=3,length=159745316>
##contig=<ID=4,length=156860686>
##contig=<ID=5,length=151758149>
##contig=<ID=6,length=149588044>
##contig=<ID=7,length=144995196>
##contig=<ID=8,length=130127694>
##contig=<ID=9,length=124359700>
##contig=<ID=10,length=130530862>
##contig=<ID=11,length=121973369>
##contig=<ID=12,length=120092757>
##contig=<ID=13,length=120883175>
##contig=<ID=14,length=125139656>
##contig=<ID=15,length=104073951>
##contig=<ID=16,length=98008968>
##contig=<ID=17,length=95294699>
##contig=<ID=18,length=90720763>
##contig=<ID=19,length=61420004>
##contig=<ID=X,length=169476592>
##ALT=<ID=,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)">
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths (high-quality bases)">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=MinDP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred-scaled Genotype Quality">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callCommand=call -mAv -f GQ,GP -p 0.99; Date=Wed Aug 11 21:20:03 2021
##bcftools_normCommand=norm --fasta-ref Mus_musculus.GRCm39.dna.toplevel.fa.gz -m +indels; Date=Fri Aug 13 11:11:49 2021
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="High confidence (1) or low confidence (0) based on soft filtering values">
##FILTER=<ID=LowQual,Description="Low quality variants">
##VEP="v104" time="2021-08-30 23:27:00" cache="mus_musculus/104_GRCm39" ensembl-funcgen=104.59ae779 ensembl-variation=104.6154f8b ensembl=104.1af1dce ensembl-io=104.1d3bb6e assembly="GRCm39" dbSNP="150" gencode="GENCODE M27" regbuild="1.0" sift="sift"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|SIFT|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS">
##bcftools_viewVersion=1.13+htslib-1.13
##bcftools_viewCommand=view -i 'FORMAT/FI[
] = 1' mgp_REL2021_snps.vcf.gz; Date=Sat Dec 18 19:08:09 2021
##bcftools_annotateVersion=1.13+htslib-1.13
##bcftools_annotateCommand=annotate -x INFO/VDB,INFO/SGB,INFO/RPBZ,INFO/MQBZ,INFO/MQBZ,INFO/MQSBZ,INFO/BQBZ,INFO/SCBZ,INFO/FS,INFO/MQOF,INFO/AC,INFO/AN,FORMAT/SP,FORMAT/ADF,FORMAT/ADR,FORMAT/GP; Date=Sat Dec 18 19:08:09 2021
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewCommand=view -a -Oz -o final_mgp_REL2021_snps.vcf.gz; Date=Sat Dec 18 19:08:09 2021
##bcftools_annotateCommand=annotate -x INFO/MQ0F -Oz -o final_mgp_REL2021_snps.vcf.gz mgp_REL2021_snps.vcf.gz; Date=Mon Dec 20 07:12:23 2021
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CAST_EiJ-129S1_SvImJ-F1
chr1 3055820 . T A 5754.96 PASS DP=5080;AD=3968,1110;DP4=2007,1961,556,556;MQ=59;CSQ=A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,C|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,G|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=48;AN=104 GT 0|1
chr1 3055997 . C A 5108.92 PASS DP=5097;AD=3980,1106;DP4=2151,1829,569,548;MQ=59;CSQ=A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765615|open_chromatin_region||||||||||||||SNV||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765615|open_chromatin_region||||||||||||||SNV||||||||,T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765615|open_chromatin_region||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,G|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=41;AN=104 GT 0|1
chr1 3056933 . G T 4964.34 PASS DP=4922;AD=3793,1126;DP4=2018,1775,579,550;MQ=59;CSQ=T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=41;AN=102 GT 0|1
chr1 3057907 . G T 6104.57 PASS DP=5627;AD=4404,1214;DP4=2300,2104,556,667;MQ=59;CSQ=T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=48;AN=104 GT 0|1

Thank you

@nservant
Copy link
Owner

Looking at your VCF file header, it seems that the VCF was built using mm39 genome, while you used mm10 to build your N-mask genome.

##bcftoolsCommand=mpileup -f Mus_musculus.GRCm39.dna.toplevel.fa.gz -b samples -g 10 -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/SP,INFO/AD -E -Q 0 -pm 3 -F 0.25 -d 500

and it makes sense with the stats, because all your reads are unassigned

Number of unassigned reads 261883 74.01

It means that for a given SNP, HiC-Pro is exacting the nucleotide at that position in the BAM file, but that the latter is not corresponding to the expected allele ... which match the fact that you are not using the same genome version

@luckysardar
Copy link
Author

luckysardar commented Oct 23, 2024

Thanks, I think one mistake I have made is when I raised issue initially worked on mm10 later I switch to mm39 based on issue. downloaded latest vcf from mouse genome project and downloaded reference genome with same version from ensembl (Mus_musculus.GRCm39.dna.toplevel.fa (104 version)). Changing genome version and vcf did not resolve my issue

allelic stat based on Hi-pro subset_masked_GRCm39.bwt2pairs_allspe.allelstat

/usr/local/bin//HiC-Pro_3.1.0/scripts/markAllelicStatus.py
ibam=bowtie_results/bwt2/Sample/subset_masked_GRCm39.bwt2pairs.bam
snpFile=/mnt/HI_data/VCF/snps_CASTEiJ_129S1_.vcf
tag=XA
output=bowtie_results/bwt2/Sample/subset_masked_GRCm39.bwt2pairs_allspe.bam
verbose=True

Total number of snps loaded 21312339.0

Total number of reads 56614 100
Number of reads with at least one 'N' 21965 38.798
Number of reads assigned to ref genome 8023 14.171
Number of reads assigned to alt genome 0 0.0
Number of conflicting reads 0 0.0
Number of unassigned reads 48591 85.829

Used subset_masked_GRCm39.bwt2pairs.bam file input for SNPsplit tool

Allele-specific paired-end sorting report

Read pairs/singletons processed in total: 28307
thereof were read pairs: 28307
thereof were singletons: 0
Reads were unassignable (not overlapping SNPs): 14537 (51.35%)
thereof were read pairs: 14537
thereof were singletons: 0
Reads were specific for genome 1: 6464 (22.84%)
thereof were read pairs: 6464
thereof were singletons: 0
Reads were specific for genome 2: 6868 (24.26%)
thereof were read pairs: 6868
thereof were singletons: 0
Reads contained conflicting SNP information: 438 (1.55%)
thereof were read pairs: 438
thereof were singletons: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants