Skip to content
This repository has been archived by the owner on Aug 10, 2022. It is now read-only.

Initial data loading

nyoungb2 edited this page Sep 24, 2013 · 13 revisions

Initial DB construction

Workflow

making the sqlite tables & loading loci table

spacers and direct repeats

  • [loading arrays](####loading arrays and direct repeats to their respective tables)
  • [grouping spacers and direct repeats](####grouping spacers and direct repeats (groups with same sequence))

making the database tables

$ CLdb_makeDB.pl

By default, the database file is named 'CLdb.sqlite'

loading the loci table

The 'loci' table has all of the basic info on each CRISPR locus (for formatting, see Database setup).

$ CLdb_loadLoci.pl -d CLdb.sqlite < loci.txt

adding number of scaffolds to the loci table

This is only needed if loci table has genomes containing >1 sequence (multiple scaffolds, chromosomes, etc).

$ CLdb_addScaffoldCount.pl -d CLdb.sqlite


Spacers and direct repeats

loading arrays and direct repeats to their respective tables

$ CLdb_loadArrays.pl -d CLdb.sqlite

grouping spacers and direct repeats (groups with same sequence)

$ CLdb_groupArrayElements.pl -d CLdb.sqlite -s -r 

pseudo-hierarchical clustering of spacers & DRs (good for plotting loci)

$ CLdb_hclusterArrays.pl -d CLdb.sqlite -s -r 

calculating direct repeat consensus sequences

$ CLdb_loadDRConsensus.pl -d CLdb.sqlite

pairwise blast of all spacers

$ CLdb_spacerPairwiseBlast.pl -d CLdb.sqlite
  • used for plotting & checking for paritial overlap of spacers

CRISPR-associated genes

getting genes in CRISPR locus region (defined in Loci table)

$ CLdb_getGenesInLoci.pl -d CLdb.sqlite > gene_table.txt
  • manually currate the 'gene_alias' column values

loading genes into the Genes table

$ CLdb_loadGenes.pl -d CLdb.sqlite < gene_table.txt

Leader region

getting potential leader regions

CLdb_getLeaderRegions.pl -d CLdb.sqlite > possible_leaders.fna
getting potential leader regions for just 1 subtype
CLdb_getLeaderRegions.pl -d CLdb.sqlite -q "AND subtype='I-B'" > leaders_IB.fna
identifying leaders
mafft --adjustdirection leaders_IB.fna > leaders_IB_aln.fna
  • if 2 leaders written for a locus, remove the 1 that does not align

  • determine where leader conservation ends

    • for example: conservation ends 50bp from end of alignment

    • this will be trimmed off of the leader region when added to the database

loading identified leader regions

CLdb_loadLeaders.pl -d CLdb.sqlite -t 50 test_leader_Ib.fna test_leader_Ib_aln.fna
  • '-t 50' = trim off the last 50bp of unconserved sequence in the alignment

    • 50bp trimmed from side farthest from the array
  • both the aligned and unaligned sequenced are needed because mafft can alter orientation during alignment (--adjustdirect)

grouping leaders (100% sequence identity)

CLdb_groupLeaders.pl -da CLdb.sqlite
Clone this wiki locally