-
Notifications
You must be signed in to change notification settings - Fork 1
Initial data loading
making the sqlite tables & loading loci table
spacers and direct repeats
- [loading arrays](####loading arrays and direct repeats to their respective tables)
- [grouping spacers and direct repeats](####grouping spacers and direct repeats (groups with same sequence))
$ CLdb_makeDB.pl
By default, the database file is named 'CLdb.sqlite'
The 'loci' table has all of the basic info on each CRISPR locus (for formatting, see Database setup).
$ CLdb_loadLoci.pl -d CLdb.sqlite < loci.txt
This is only needed if loci table has genomes containing >1 sequence (multiple scaffolds, chromosomes, etc).
$ CLdb_addScaffoldCount.pl -d CLdb.sqlite
$ CLdb_loadArrays.pl -d CLdb.sqlite
$ CLdb_groupArrayElements.pl -d CLdb.sqlite -s -r
$ CLdb_hclusterArrays.pl -d CLdb.sqlite -s -r
$ CLdb_loadDRConsensus.pl -d CLdb.sqlite
$ CLdb_spacerPairwiseBlast.pl -d CLdb.sqlite
- used for plotting & checking for paritial overlap of spacers
$ CLdb_getGenesInLoci.pl -d CLdb.sqlite > gene_table.txt
- manually currate the 'gene_alias' column values
$ CLdb_loadGenes.pl -d CLdb.sqlite < gene_table.txt
CLdb_getLeaderRegions.pl -d CLdb.sqlite > possible_leaders.fna
CLdb_getLeaderRegions.pl -d CLdb.sqlite -q "AND subtype='I-B'" > leaders_IB.fna
mafft --adjustdirection leaders_IB.fna > leaders_IB_aln.fna
-
if 2 leaders written for a locus, remove the 1 that does not align
-
determine where leader conservation ends
-
for example: conservation ends 50bp from end of alignment
-
this will be trimmed off of the leader region when added to the database
-
CLdb_loadLeaders.pl -d CLdb.sqlite -t 50 test_leader_Ib.fna test_leader_Ib_aln.fna
-
'-t 50' = trim off the last 50bp of unconserved sequence in the alignment
- 50bp trimmed from side farthest from the array
-
both the aligned and unaligned sequenced are needed because mafft can alter orientation during alignment (--adjustdirect)
CLdb_groupLeaders.pl -da CLdb.sqlite