-
Notifications
You must be signed in to change notification settings - Fork 1
array blast
-
Purpose: find protospacers of mobile elements integrated into a genome. Direct repeats can be blasted also in order to screen out spacer blast hits to CRISPR arrays in the subject genome (determined by adjacent direct repeat blast hits to the spacer blast hit).
-
blastn-short is used by default.
-
Warning! Spacers should be oriented (potentially reverse-complemented) by identified leaders before spacer blasting! This prevents hitting the complement of the actual protospacer, which leads to PAMs oriented in the wrong direction and other havoc.
-
It may be best to blast all DRs at once for subsequent spacer array screening (spacers that have adjacent DR hits are considered to be located in CRISPR arrays and are thus not protospacers).
$ CLdb_array2fasta.pl -d CLdb.sqlite -g -l > spacer_groups.fasta
- The '-l' flag orients spacers based on the identified leader for the array (you must have identified the leaders prior to this step!).
$ CLdb_spacerBlast.pl -da CLdb.sqlite -q spacer_groups.fasta > spacer_groups_blast.txt
-
This command will blast all of the spacers against every genome in CLdb. You can refine which genomes to blast with various optional flags.
-
The blast table will have comment lines (starting with "#"). They are needed for downstream analyses. Don't remove them! But if you don't want to listen to me, then can be removed with: "grep -v '#' spacer_groups_blast.txt"
$ makeblastdb -parse_seqids -dbtype nucl -in YOUR_GENOME.fasta
-
MAKE SURE that there are not spacers in any of the sequence names! If there are, type
perl -pi -e 's/ /_/g' YOUR_GENOME.fasta
$ blastn -task 'blastn-short' -query spacer_groups.fasta -db YOUR_GENOME -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen btop' >YOUR_GENOME_blast.txt
-
You can change the evalue or other cutoffs if desired. Just make sure to use the specified output format!
-
You can also blast nt using the same basic command
Blast hits to CRISPR arrays can be filtered out by BLASTing the direct repeat sequences and seeing if any of those hits fall next to spacer blast hits. If yes, then that spacer blast hit is probably hitting a CRISPR array.
So, you will need to blast the direct repeats using the same basic steps that you used for blasting the spacers.
$ CLdb_array2fasta.pl -d CLdb.sqlite -g -l -r > DR_groups.fasta
BLAST option 1) If you want to blast genomes already in CLdb: ###
$ CLdb_spacerBlast.pl -da CLdb.sqlite -q DR_groups.fasta > DR_groups_blast.txt
BLAST option 2) If you want to blast another genome not in CLdb: ###
$ makeblastdb -dbtype nucl -in YOUR_GENOME.fasta
`$ blastn -task 'blastn-short' -query DR_groups.fasta -db YOUR_GENOME -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen btop' > YOUR_GENOME_blast.txt
Filtering spacer blasts:
$ CLdb_spacerBlastDRFilter.pl spacer_groups_blast.txt DR_groups_blastn.txt > spacer_groups_filtered_blast.txt