Skip to content
This repository has been archived by the owner on Aug 10, 2022. It is now read-only.

array blast

Nick Youngblut edited this page Jun 4, 2014 · 1 revision

This is a workflow for conducting BLASTs of spacers against a blast database

  • Purpose: find protospacers of mobile elements integrated into a genome. Direct repeats can be blasted also in order to screen out spacer blast hits to CRISPR arrays in the subject genome (determined by adjacent direct repeat blast hits to the spacer blast hit).

  • blastn-short is used by default.

  • Warning! Spacers should be oriented (potentially reverse-complemented) by identified leaders before spacer blasting! This prevents hitting the complement of the actual protospacer, which leads to PAMs oriented in the wrong direction and other havoc.

  • It may be best to blast all DRs at once for subsequent spacer array screening (spacers that have adjacent DR hits are considered to be located in CRISPR arrays and are thus not protospacers).

Get all unique spacers (spacer groups) from CLdb

$ CLdb_array2fasta.pl -d CLdb.sqlite -g -l > spacer_groups.fasta

  • The '-l' flag orients spacers based on the identified leader for the array (you must have identified the leaders prior to this step!).

BLAST option 1) If you want to blast genomes already in CLdb:

$ CLdb_spacerBlast.pl -da CLdb.sqlite -q spacer_groups.fasta > spacer_groups_blast.txt

  • This command will blast all of the spacers against every genome in CLdb. You can refine which genomes to blast with various optional flags.

  • The blast table will have comment lines (starting with "#"). They are needed for downstream analyses. Don't remove them! But if you don't want to listen to me, then can be removed with: "grep -v '#' spacer_groups_blast.txt"

BLAST option 2) If you want to blast another genome not in CLdb:

$ makeblastdb -parse_seqids -dbtype nucl -in YOUR_GENOME.fasta

  • MAKE SURE that there are not spacers in any of the sequence names! If there are, type perl -pi -e 's/ /_/g' YOUR_GENOME.fasta

$ blastn -task 'blastn-short' -query spacer_groups.fasta -db YOUR_GENOME -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen btop' >YOUR_GENOME_blast.txt

  • You can change the evalue or other cutoffs if desired. Just make sure to use the specified output format!

  • You can also blast nt using the same basic command

Many blast hits will probably be to other CRISPR arrays. You need to filter these out!

Blast hits to CRISPR arrays can be filtered out by BLASTing the direct repeat sequences and seeing if any of those hits fall next to spacer blast hits. If yes, then that spacer blast hit is probably hitting a CRISPR array.

So, you will need to blast the direct repeats using the same basic steps that you used for blasting the spacers.

$ CLdb_array2fasta.pl -d CLdb.sqlite -g -l -r > DR_groups.fasta

BLAST option 1) If you want to blast genomes already in CLdb: ###

$ CLdb_spacerBlast.pl -da CLdb.sqlite -q DR_groups.fasta > DR_groups_blast.txt

BLAST option 2) If you want to blast another genome not in CLdb: ###

$ makeblastdb -dbtype nucl -in YOUR_GENOME.fasta

`$ blastn -task 'blastn-short' -query DR_groups.fasta -db YOUR_GENOME -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen btop' > YOUR_GENOME_blast.txt

Filtering spacer blasts:

$ CLdb_spacerBlastDRFilter.pl spacer_groups_blast.txt DR_groups_blastn.txt > spacer_groups_filtered_blast.txt

What do now?