Skip to content

Latest commit

 

History

History
executable file
·
51 lines (41 loc) · 4.69 KB

README.md

File metadata and controls

executable file
·
51 lines (41 loc) · 4.69 KB

JASPAR UCSC tracks

This repository contains the data and code used to generate the JASPAR UCSC Genome Browser track data hub.
For more information visit the JASPAR website.

News

01/07/2018 To speed-up TFBS predictions, we switched from MEME and the Perl TFBS package to PWMScan.

Content

  • The genomes folder contains scripts to download and process different genome assemblies
  • The profiles folder contains the output from the script get-profiles.py, which downloads the JASPAR CORE profiles for different taxons
  • The file environment.yml, within the conda folder, contains the conda environment used to generate the genomic tracks for JASPAR 2022 (see installation)
  • The script install-pwmscan.sh downloads and installs PWMscan and places its binaries in the in the bin folder.
  • The script scan-sequence.py takes as its input the profiles folder and a nucleotide sequence in FASTA format
    (e.g. a genome), and outputs TFBS predictions
  • The script scans2bigBed creates a bigBed track file from TFBS predictions

The original scripts used for the publication of JASPAR 2018 have been placed in the folder version-1.0.

Dependencies

Note that for running scan_sequence.py, only the Python dependencies and PWMScan are required.

Installation

To install PWMScan, execute the script install-pwmscan.sh.

The remaining dependencies can be installed through the conda package manager:

conda env create -f ./conda/environment.yml

Availability

Genomic tracks and TFBS predictions for human and seven other model organisms, covering 11 genome assemblies, are available online:

Usage

To illustrate how the genomic tracks are generated, we provide an example for the baker's yeast genome:

./scan-sequence.py --fasta-file ./genomes/sacCer3/sacCer3.fa --profiles-dir ./profiles/ \
    --output-dir ./tracks/sacCer3/ --threads 4 --latest --taxon fungi

For this example, the scanning step should take no longer than a minute. For human and other similar genomes, this step is usually finished within a few hours (the final amount of time will depend on the number of --threads specified).

  • Create the genomic track
./scans2bigBed -c ./genomes/sacCer3/sacCer3.fa.sizes -i ./tracks/sacCer3/ -o ./tracks/sacCer3.bb -t 4

TFBS predictions from the previous step are merged into a bigBed track file. In column five, we use as scores the p-values from PWMScan (scaled between 0-1000, where 0 corresponds to p-value = 1 and 1000 to p-value ≤ 10-10). This allows for comparison of prediction confidence across TFBSs. Again, for this example, this step should be completed within a few minutes, while for larger genomes it can take a few hours.

Important note: disk space requirements for large genomes (i.e. danRer11, hg19, hg38, mm10, and mm39) are substantial. In these cases, we highly recommend allocating at least 1Tb of disk space.