Database Setup

Files required for initial loading of CRISPR arrays

1) Loci.txt table (tab-delimited)

The loci table designates specific metadata for each CRISPR array and/or CAS gene operon. This table is filled out after identifying CRISPRs with CRISPRFinder and optionally identifying the subtype.

This table can easily be made in Excel and then saved as a tab-delimited file (example).

columns needed

Locus_ID
Taxon_ID
Taxon_Name
Subtype*
Locus_Start
Locus_End
CAS_Start*
CAS_End*
Array_Start*
Array_End*
Array_status**
CAS_status***
Genbank_file
Array_File*
Author
File_Creation_Date*

* blank values allowed

** Possible values: "present", "absent"

*** Possible values: "intact", "absent", "broken", "shuffled"

"broken" = some genes missing "shuffled" = gene order

"Taxon_ID" = FIG_ID or any other unique identifier for the genome

"locus_ID" field must have unique locus identifiers (e.g. locus1, locus2, locus3, etc.)

optional columns

fasta_file
scaffold_name
leader_start
leader_end
leader_sequence
PAM_start
PAM_end

"fasta_file" only needed if genbank files do not have genome nuleotide sequence

"scaffold_name" specifies the scaffold that the locus is on. Without this field value, 'CLDB_ONE_CHROMOSOME' will be used.

"leader_*" fields are for specifying known leader regions for loci

"PAM_*" fields are for specifying known PAM regions for loci

2) Array.txt table files

The array files designate the spacers and direct repeats in each CRISPR array, which were identified by CRISPRFinder. To make the tables, just copy and paste the array tables from the CRISPRFinder (example).

columns needed

Start position
Direct repeat sequence
Spacer sequence
End position

3) Genbank files for each genome containing the CRISPR loci loaded into CLdb

These are needed to identify locations of CAS genes (determined by 'operon_start' and 'operon_end' fields in loci.txt table).

4) Fasta files of each genome (only needed genome sequence information is not included in the genbank files).

The genome sequence is needed to extract sequence information on the entire CRISPR array, leader regions, or spacer blast hits. If the fasta files are not provided, scripts that require them will try to make them automatically from the genbank files.

Directory setup for a new CRISPR database

The CLdb directory name for this example will be 'CLdb_test' in your home directory. Use the following commands:

$ CLdb_HOME="$HOME/CLdb_test/ $ mkdir $CLdb_HOME $ cd $CLdb_HOME $ mkdir genbank

move/copy/symlink genbank files in the $CLdb_home/genbank/ directory

$ mkdir array

move/copy/symlink array files in the $CLdb_home/array/ directory

(optional) $ mkdir fasta

(optional) move/copy/symlink genome fasta files in the $CLdb_home/fasta/ directory

WARNINGs

If no scaffold names are provided in the loci.txt table, the 'CLDB_ONE_CHROMOSOME' is used for the 'Scaffold' field.
Scaffold names in genbank LOCUS IDs (e.g 'LOCUS scaffold72_1_size14107-refined') and genome fasta files should match the scaffold names in the loci.txt table!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database Setup

Files required for initial loading of CRISPR arrays

Directory setup for a new CRISPR database

WARNINGs

Clone this wiki locally