Database Setup

Files required for initial loading of CRISPR arrays

1) Loci.txt table (tab-delimited)

The loci table designates specific metadata for each CRISPR array and/or CAS gene operon. This table is filled out after identifying CRISPRs with CRISPRFinder and optionally identifying the subtype.

columns needed

Taxon_ID
Taxon_Name
Subtype*
Locus_Start
Locus_End
Operon_Start*
Operon_End*
Array_Start*
Array_End*
Array_status**
Operon_status***
Genbank_file
Array_File*
Fasta_File*
Author
File_Creation_Date*

* blank values allowed

** Possible values: "present", "absent"

*** Possible values: "intact", "absent", "broken", "shuffled"

"broken" = some genes missing "shuffled" = gene order

This table can easily be made in Excel and then saved as a tab-delimited file (example).

2) Array.txt table files

The array files designate the spacers and direct repeats in each CRISPR array, which were identified by CRISPRFinder. To make the tables, just copy and paste the array tables from the CRISPRFinder (example).

columns needed

Start position
Direct repeat sequence
Spacer sequence
End position

3) Genbank files for each genome containing the CRISPR loci loaded into CLdb

These are needed to identify locations of CAS genes (determined by 'operon_start' and 'operon_end' fields in loci.txt table).

4) Fasta files of each genome (only needed genome sequence information is not included in the genbank files).

The genome sequence is needed to extract sequence information on the entire CRISPR array, leader regions, or spacer blast hits. If the fasta files are not provided, scripts that require them will try to make them automatically from the genbank files.

Directory setup

The directory name for this example: './CLdb/'

The example loci table: 'loci.txt'

$ mkdir CLdb
$ cd CLdb
$ mkdir genbank

I will refer to this directory as '$CLdb_home'

place/symlink genbank files in this directory

$ mkdir array

place/symlink array files in this directory

$ mkdir fasta

place/symlink genome fasta files in the directory (optional)

WARNINGs

The if no scaffold names are provided in the loci.txt table, the 'CLDB_ONE_CHROMOSOME' is used for the 'Scaffold' field.
Scaffold names in the genome fasta files should match the scaffold names in the loci.txt table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database Setup

Files required for initial loading of CRISPR arrays

Directory setup

WARNINGs

Clone this wiki locally