Skip to content
This repository has been archived by the owner on Aug 10, 2022. It is now read-only.

Database Setup

nyoungb2 edited this page Jan 11, 2014 · 21 revisions

Files required for initial loading of CRISPR arrays

1) Loci.txt table (tab-delimited)

The loci table designates specific metadata for each CRISPR array and/or CAS gene operon. This table is filled out after identifying CRISPRs with CRISPRFinder and optionally identifying the subtype.

This table can easily be made in Excel and then saved as a tab-delimited file (example).

columns needed

  • Locus_ID
  • Taxon_ID
  • Taxon_Name
  • Subtype*
  • Locus_Start
  • Locus_End
  • CAS_Start*
  • CAS_End*
  • Array_Start*
  • Array_End*
  • Array_status**
  • CAS_status***
  • Genbank_file
  • Array_File*
  • Author
  • File_Creation_Date*

* blank values allowed

** Possible values: "present", "absent"

*** Possible values: "intact", "absent", "broken", "shuffled"

"broken" = some genes missing "shuffled" = gene order

"Taxon_ID" = FIG_ID or any other unique identifier for the genome

"locus_ID" field must have unique locus identifiers (e.g. locus1, locus2, locus3, etc.)

optional columns

  • fasta_file
  • scaffold_name
  • leader_start
  • leader_end
  • leader_sequence
  • PAM_start
  • PAM_end

"fasta_file" only needed if genbank files do not have genome nuleotide sequence

"scaffold_name" specifies the scaffold that the locus is on. Without this field value, 'CLDB_ONE_CHROMOSOME' will be used.

"leader_*" fields are for specifying known leader regions for loci

"PAM_*" fields are for specifying known PAM regions for loci

2) Array.txt table files

The array files designate the spacers and direct repeats in each CRISPR array, which were identified by CRISPRFinder. To make the tables, just copy and paste the array tables from the CRISPRFinder (example).

columns needed

  • Start position
  • Direct repeat sequence
  • Spacer sequence
  • End position

3) Genbank files for each genome containing the CRISPR loci loaded into CLdb

These are needed to identify locations of CAS genes (determined by 'operon_start' and 'operon_end' fields in loci.txt table).

4) Fasta files of each genome (only needed genome sequence information is not included in the genbank files).

The genome sequence is needed to extract sequence information on the entire CRISPR array, leader regions, or spacer blast hits. If the fasta files are not provided, scripts that require them will try to make them automatically from the genbank files.


Directory setup for a new CRISPR database

The CLdb directory name for this example will be 'CLdb_test' in your home directory. Use the following commands:

$ CLdb_HOME="$HOME/CLdb_test/ $ mkdir $CLdb_HOME $ cd $CLdb_HOME $ mkdir genbank

  • move/copy/symlink genbank files in the $CLdb_home/genbank/ directory

$ mkdir array

  • move/copy/symlink array files in the $CLdb_home/array/ directory

(optional) $ mkdir fasta

  • (optional) move/copy/symlink genome fasta files in the $CLdb_home/fasta/ directory

WARNINGs

  • If no scaffold names are provided in the loci.txt table, the 'CLDB_ONE_CHROMOSOME' is used for the 'Scaffold' field.

  • Scaffold names in genbank LOCUS IDs (e.g 'LOCUS scaffold72_1_size14107-refined') and genome fasta files should match the scaffold names in the loci.txt table!

Clone this wiki locally