Allelic analysis HiC-Pro G2 genome not mapping #652

luckysardar · 2024-10-19T06:50:00Z

Dear
I am trying to do allelic analysis Hic data getting same error G2 is mapping reads are zero and iced normalisation step getting error. I used bellow command to run analysis. non allelic analysis running completely good . I tried many publically available data with different strain cross like 129S vs CAST and B6 vs PWK got same error I am unable to find solution where its going wrong please help me

bowtie2 allespe.stat file looks like below
HiC-Pro_3.1.0/scripts/markAllelicStatus.py
bam=bowtie_results/bwt2/Sample1/subset_masked_mm10.bwt2pairs.bam
snpFile=mm10_snps_C57b6_PWK_PhJ.vcf
tag=XA
output=bowtie_results/bwt2/Sample1/subset_masked_mm10.bwt2pairs_allspe.bam
verbose=True

Total number of snps loaded 17046154.0

Total number of reads 353850 100
Number of reads with at least one 'N' 368870 104.245
Number of reads assigned to ref genome 91967 25.99
Number of reads assigned to alt genome 0 0.0
Number of conflicting reads 0 0.0
Number of unassigned reads 261883 74.01

code used generate VCF and masked genome

`extract_snps.py -i mgp.v5.merged.snps_all.dbSNP142.vcf -a PWK_PhJ> mm10_snps_C57b6_PWK_PhJ.vcf

bedtools maskfasta -fi mm10.fa -bed mm10_snps_C57b6_PWK_PhJ.vcf -fo masked_mm10.fa

bowtie2-build masked_mm10.fa masked_mm10 --threads 6

HiC-Pro -c config_test_as.txt -i 01_Fastq/ -o 03_output2`

please suggest me where I am going wrong

Thank you

nservant · 2024-10-23T14:11:07Z

Hi,
Can I see the first few lines of your VCF ?
As well as the chromosome name in your fasta masked file ?
Thanks

luckysardar · 2024-10-23T15:43:39Z

Thank you for response, here is the masked genome chromosome name and vcf file

chr names

chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrMT

vcf file(https://drive.google.com/file/d/1-18ej_EoZpTL4KEvkiukEv-exqz6Mz3G/view?usp=sharing)

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.13+htslib-1.13
##bcftoolsCommand=mpileup -f Mus_musculus.GRCm39.dna.toplevel.fa.gz -b samples -g 10 -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/SP,INFO/AD -E -Q 0 -pm 3 -F 0.25 -d 500
##reference= mgp_REL2021_snps.vcf [ CAST_EiJ ]
##contig=<ID=1,length=195154279>
##contig=<ID=2,length=181755017>
##contig=<ID=3,length=159745316>
##contig=<ID=4,length=156860686>
##contig=<ID=5,length=151758149>
##contig=<ID=6,length=149588044>
##contig=<ID=7,length=144995196>
##contig=<ID=8,length=130127694>
##contig=<ID=9,length=124359700>
##contig=<ID=10,length=130530862>
##contig=<ID=11,length=121973369>
##contig=<ID=12,length=120092757>
##contig=<ID=13,length=120883175>
##contig=<ID=14,length=125139656>
##contig=<ID=15,length=104073951>
##contig=<ID=16,length=98008968>
##contig=<ID=17,length=95294699>
##contig=<ID=18,length=90720763>
##contig=<ID=19,length=61420004>
##contig=<ID=X,length=169476592>
##ALT=<ID=,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)">
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths (high-quality bases)">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=MinDP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred-scaled Genotype Quality">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callCommand=call -mAv -f GQ,GP -p 0.99; Date=Wed Aug 11 21:20:03 2021
##bcftools_normCommand=norm --fasta-ref Mus_musculus.GRCm39.dna.toplevel.fa.gz -m +indels; Date=Fri Aug 13 11:11:49 2021
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="High confidence (1) or low confidence (0) based on soft filtering values">
##FILTER=<ID=LowQual,Description="Low quality variants">
##VEP="v104" time="2021-08-30 23:27:00" cache="mus_musculus/104_GRCm39" ensembl-funcgen=104.59ae779 ensembl-variation=104.6154f8b ensembl=104.1af1dce ensembl-io=104.1d3bb6e assembly="GRCm39" dbSNP="150" gencode="GENCODE M27" regbuild="1.0" sift="sift"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|SIFT|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS">
##bcftools_viewVersion=1.13+htslib-1.13
##bcftools_viewCommand=view -i 'FORMAT/FI[] = 1' mgp_REL2021_snps.vcf.gz; Date=Sat Dec 18 19:08:09 2021
##bcftools_annotateVersion=1.13+htslib-1.13
##bcftools_annotateCommand=annotate -x INFO/VDB,INFO/SGB,INFO/RPBZ,INFO/MQBZ,INFO/MQBZ,INFO/MQSBZ,INFO/BQBZ,INFO/SCBZ,INFO/FS,INFO/MQOF,INFO/AC,INFO/AN,FORMAT/SP,FORMAT/ADF,FORMAT/ADR,FORMAT/GP; Date=Sat Dec 18 19:08:09 2021
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewCommand=view -a -Oz -o final_mgp_REL2021_snps.vcf.gz; Date=Sat Dec 18 19:08:09 2021
##bcftools_annotateCommand=annotate -x INFO/MQ0F -Oz -o final_mgp_REL2021_snps.vcf.gz mgp_REL2021_snps.vcf.gz; Date=Mon Dec 20 07:12:23 2021
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CAST_EiJ-129S1_SvImJ-F1
chr1 3055820 . T A 5754.96 PASS DP=5080;AD=3968,1110;DP4=2007,1961,556,556;MQ=59;CSQ=A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,C|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,G|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=48;AN=104 GT 0|1
chr1 3055997 . C A 5108.92 PASS DP=5097;AD=3980,1106;DP4=2151,1829,569,548;MQ=59;CSQ=A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765615|open_chromatin_region||||||||||||||SNV||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765615|open_chromatin_region||||||||||||||SNV||||||||,T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765615|open_chromatin_region||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,G|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=41;AN=104 GT 0|1
chr1 3056933 . G T 4964.34 PASS DP=4922;AD=3793,1126;DP4=2018,1775,579,550;MQ=59;CSQ=T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=41;AN=102 GT 0|1
chr1 3057907 . G T 6104.57 PASS DP=5627;AD=4404,1214;DP4=2300,2104,556,667;MQ=59;CSQ=T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=48;AN=104 GT 0|1

Thank you

nservant · 2024-10-23T17:28:24Z

Looking at your VCF file header, it seems that the VCF was built using mm39 genome, while you used mm10 to build your N-mask genome.

##bcftoolsCommand=mpileup -f Mus_musculus.GRCm39.dna.toplevel.fa.gz -b samples -g 10 -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/SP,INFO/AD -E -Q 0 -pm 3 -F 0.25 -d 500

and it makes sense with the stats, because all your reads are unassigned

Number of unassigned reads 261883 74.01

It means that for a given SNP, HiC-Pro is exacting the nucleotide at that position in the BAM file, but that the latter is not corresponding to the expected allele ... which match the fact that you are not using the same genome version

luckysardar · 2024-10-23T19:19:51Z

Thanks, I think one mistake I have made is when I raised issue initially worked on mm10 later I switch to mm39 based on issue. downloaded latest vcf from mouse genome project and downloaded reference genome with same version from ensembl (Mus_musculus.GRCm39.dna.toplevel.fa (104 version)). Changing genome version and vcf did not resolve my issue

allelic stat based on Hi-pro subset_masked_GRCm39.bwt2pairs_allspe.allelstat

/usr/local/bin//HiC-Pro_3.1.0/scripts/markAllelicStatus.py
ibam=bowtie_results/bwt2/Sample/subset_masked_GRCm39.bwt2pairs.bam
snpFile=/mnt/HI_data/VCF/snps_CASTEiJ_129S1_.vcf
tag=XA
output=bowtie_results/bwt2/Sample/subset_masked_GRCm39.bwt2pairs_allspe.bam
verbose=True

Total number of snps loaded 21312339.0

Total number of reads 56614 100
Number of reads with at least one 'N' 21965 38.798
Number of reads assigned to ref genome 8023 14.171
Number of reads assigned to alt genome 0 0.0
Number of conflicting reads 0 0.0
Number of unassigned reads 48591 85.829

Used subset_masked_GRCm39.bwt2pairs.bam file input for SNPsplit tool

Allele-specific paired-end sorting report

Read pairs/singletons processed in total: 28307
thereof were read pairs: 28307
thereof were singletons: 0
Reads were unassignable (not overlapping SNPs): 14537 (51.35%)
thereof were read pairs: 14537
thereof were singletons: 0
Reads were specific for genome 1: 6464 (22.84%)
thereof were read pairs: 6464
thereof were singletons: 0
Reads were specific for genome 2: 6868 (24.26%)
thereof were read pairs: 6868
thereof were singletons: 0
Reads contained conflicting SNP information: 438 (1.55%)
thereof were read pairs: 438
thereof were singletons: 0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allelic analysis HiC-Pro G2 genome not mapping #652

Allelic analysis HiC-Pro G2 genome not mapping #652

luckysardar commented Oct 19, 2024 •

edited

Loading

nservant commented Oct 23, 2024

luckysardar commented Oct 23, 2024 •

edited

Loading

nservant commented Oct 23, 2024

luckysardar commented Oct 23, 2024 •

edited

Loading

Allelic analysis HiC-Pro G2 genome not mapping #652

Allelic analysis HiC-Pro G2 genome not mapping #652

Comments

luckysardar commented Oct 19, 2024 • edited Loading

bowtie2 allespe.stat file looks like below HiC-Pro_3.1.0/scripts/markAllelicStatus.py bam=bowtie_results/bwt2/Sample1/subset_masked_mm10.bwt2pairs.bam snpFile=mm10_snps_C57b6_PWK_PhJ.vcf tag=XA output=bowtie_results/bwt2/Sample1/subset_masked_mm10.bwt2pairs_allspe.bam verbose=True

Total number of snps loaded 17046154.0

nservant commented Oct 23, 2024

luckysardar commented Oct 23, 2024 • edited Loading

nservant commented Oct 23, 2024

luckysardar commented Oct 23, 2024 • edited Loading

allelic stat based on Hi-pro subset_masked_GRCm39.bwt2pairs_allspe.allelstat

/usr/local/bin//HiC-Pro_3.1.0/scripts/markAllelicStatus.py ibam=bowtie_results/bwt2/Sample/subset_masked_GRCm39.bwt2pairs.bam snpFile=/mnt/HI_data/VCF/snps_CASTEiJ_129S1_.vcf tag=XA output=bowtie_results/bwt2/Sample/subset_masked_GRCm39.bwt2pairs_allspe.bam verbose=True

Total number of snps loaded 21312339.0

Used subset_masked_GRCm39.bwt2pairs.bam file input for SNPsplit tool

Allele-specific paired-end sorting report

luckysardar commented Oct 19, 2024 •

edited

Loading

bowtie2 allespe.stat file looks like below
HiC-Pro_3.1.0/scripts/markAllelicStatus.py
bam=bowtie_results/bwt2/Sample1/subset_masked_mm10.bwt2pairs.bam
snpFile=mm10_snps_C57b6_PWK_PhJ.vcf
tag=XA
output=bowtie_results/bwt2/Sample1/subset_masked_mm10.bwt2pairs_allspe.bam
verbose=True

luckysardar commented Oct 23, 2024 •

edited

Loading

luckysardar commented Oct 23, 2024 •

edited

Loading

/usr/local/bin//HiC-Pro_3.1.0/scripts/markAllelicStatus.py
ibam=bowtie_results/bwt2/Sample/subset_masked_GRCm39.bwt2pairs.bam
snpFile=/mnt/HI_data/VCF/snps_CASTEiJ_129S1_.vcf
tag=XA
output=bowtie_results/bwt2/Sample/subset_masked_GRCm39.bwt2pairs_allspe.bam
verbose=True