Skip to content
/ APMG Public template

A pipeline for optimizing ultra-large genome analysis by removing transposons or other repetitive elements

Notifications You must be signed in to change notification settings

pk-zhu/APMG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 

Repository files navigation

A pipeline for masking a genome, starting from a fasta file or a gff file.

By Pengkai Zhu

Institution: Fujian Agriculture and Forestry University

Email: [email protected]

Cite: Zhu, P., He, T., Zheng, Y., and Chen, L. (2023). The need for masked genomes in gymnosperms. Frontiers in Plant Science 14. doi: 10.3389/fpls.2023.1309744.


Ultra-Large genomes often strain computational resources during alignment or indexing, leading to analysis issues. However, some analyses focus on specific genome regions, like exons, introns, UTRs, and key loci, which may represent only 50% or less of the total genome size. Aligning the entire genome results in unnecessary resource usage. Therefore, I propose masking repetitive regions to shrink the reference genome, making the analysis more efficient and lowering resource demands for large genome alignments.

1.Software

  1. Red
  2. BEDOPS
  3. bedtools2

2. Workflow (begin with a fasta file)

1. Creating Directory to Store Output

mkdir -p OUTPUT

2. Predicting Repetitive Sequences from genome

Red -gnm /path/to/genome/dir/ -msk ./OUTPUT -rpt ./OUTPUT

3. Converting Soft-Masked Genome to Hard-Masked Genome

awk '!/>/ {gsub(/[atcg]/,"N")} 1' ./OUTPUT/genome.msk > ./OUTPUT/genome.hardmasked.fa

3. Workflow (begin with a fasta file and a repeats anotation file)

1. Convert gfffile to bedfile

gff2bed < LTR.gff3 > LTR.bed

2. Masked genome.fa

bedtools maskfasta -fi genome.fa -bed LTR.bed -fo genome.hardmasked.fasta

About

A pipeline for optimizing ultra-large genome analysis by removing transposons or other repetitive elements

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages