Skip to content
This repository has been archived by the owner on Aug 10, 2022. It is now read-only.

Database Setup

nyoungb2 edited this page Oct 1, 2013 · 21 revisions

Files required for initial loading of CRISPR arrays

1) Loci.txt table (tab-delimited)

The loci table designates specific metadata for each CRISPR array and/or CAS gene operon. This table is filled out after identifying CRISPRs with CRISPRFinder and optionally identifying the subtype.

columns needed

  • Taxon_ID
  • Taxon_Name
  • Subtype*
  • Locus_Start
  • Locus_End
  • Operon_Start*
  • Operon_End*
  • Array_Start*
  • Array_End*
  • Array_status**
  • Operon_status***
  • Genbank_file
  • Array_File*
  • Fasta_File*
  • Author
  • File_Creation_Date*

* blank values allowed

** Possible values: "present", "absent"

*** Possible values: "intact", "absent", "broken", "shuffled"

"broken" = some genes missing "shuffled" = gene order

This table can easily be made in Excel and then saved as a tab-delimited file (example).

2) Array.txt table files

The array files designate the spacers and direct repeats in each CRISPR array, which were identified by CRISPRFinder. To make the tables, just copy and paste the array tables from the CRISPRFinder (example).

columns needed

  • Start position
  • Direct repeat sequence
  • Spacer sequence
  • End position

3) Genbank files for each genome containing the CRISPR loci loaded into CLdb

These are needed to identify locations of CAS genes (determined by 'operon_start' and 'operon_end' fields in loci.txt table).

4) Fasta files of each genome (only needed genome sequence information is not included in the genbank files).

The genome sequence is needed to extract sequence information on the entire CRISPR array, leader regions, or spacer blast hits. If the fasta files are not provided, scripts that require them will try to make them automatically from the genbank files.


Directory setup

The directory name for this example: './CLdb/'

The example loci table: 'loci.txt'

$ mkdir CLdb
$ cd CLdb
$ mkdir genbank

I will refer to this directory as '$CLdb_home'

  • place/symlink genbank files in this directory

$ mkdir array

  • place/symlink array files in this directory

$ mkdir fasta

  • place/symlink genome fasta files in the directory (optional)

WARNINGs

  • The if no scaffold names are provided in the loci.txt table, the 'CLDB_ONE_CHROMOSOME' is used for the 'Scaffold' field.

  • Scaffold names in the genome fasta files should match the scaffold names in the loci.txt table.

Clone this wiki locally