Initial data loading

Initial DB construction

Workflow

making the sqlite tables & loading loci table

making the tables
load the loci table
adding scaffold count

spacers and direct repeats

[loading arrays](####loading arrays and direct repeats to their respective tables)
[grouping spacers and direct repeats](####grouping spacers and direct repeats (groups with same sequence))

making the database tables

$ CLdb_makeDB.pl

By default, the database file is named 'CLdb.sqlite'

loading the loci table

The 'loci' table has all of the basic info on each CRISPR locus (for formatting, see Database setup).

$ CLdb_loadLoci.pl -d CLdb.sqlite < loci.txt

adding number of scaffolds to the loci table

This is only needed if loci table has genomes containing >1 sequence (multiple scaffolds, chromosomes, etc).

$ CLdb_addScaffoldCount.pl -d CLdb.sqlite

Spacers and direct repeats

loading arrays and direct repeats to their respective tables

$ CLdb_loadArrays.pl -d CLdb.sqlite

grouping spacers and direct repeats (groups with same sequence)

$ CLdb_groupArrayElements.pl -d CLdb.sqlite -s -r

pseudo-hierarchical clustering of spacers & DRs (good for plotting loci)

$ CLdb_hclusterArrays.pl -d CLdb.sqlite -s -r

calculating direct repeat consensus sequences

$ CLdb_loadDRConsensus.pl -d CLdb.sqlite

pairwise blast of all spacers

$ CLdb_spacerPairwiseBlast.pl -d CLdb.sqlite

used for plotting & checking for paritial overlap of spacers

CRISPR-associated genes

getting genes in CRISPR locus region (defined in Loci table)

$ CLdb_getGenesInLoci.pl -d CLdb.sqlite > gene_table.txt

manually currate the 'gene_alias' column values

loading genes into the Genes table

$ CLdb_loadGenes.pl -d CLdb.sqlite < gene_table.txt

Leader region

getting potential leader regions

CLdb_getLeaderRegions.pl -d CLdb.sqlite > possible_leaders.fna

getting potential leader regions for just 1 subtype

CLdb_getLeaderRegions.pl -d CLdb.sqlite -q "AND subtype='I-B'" > leaders_IB.fna

identifying leaders

mafft --adjustdirection leaders_IB.fna > leaders_IB_aln.fna

if 2 leaders written for a locus, remove the 1 that does not align
determine where leader conservation ends
- for example: conservation ends 50bp from end of alignment
- this will be trimmed off of the leader region when added to the database

loading identified leader regions

CLdb_loadLeaders.pl -d CLdb.sqlite -t 50 test_leader_Ib.fna test_leader_Ib_aln.fna

'-t 50' = trim off the last 50bp of unconserved sequence in the alignment
- 50bp trimmed from side farthest from the array
both the aligned and unaligned sequenced are needed because mafft can alter orientation during alignment (--adjustdirect)

grouping leaders (100% sequence identity)

CLdb_groupLeaders.pl -da CLdb.sqlite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly