JASPAR UCSC tracks

This repository contains the data and code used to generate the JASPAR UCSC Genome Browser track data hub.
For more information visit the JASPAR website.

News

01/07/2018 To speed-up TFBS predictions, we switched from MEME and the Perl TFBS package to PWMScan.

Content

The genomes folder contains scripts to download and process different genome assemblies
The profiles folder contains the output from the script get-profiles.py, which downloads the JASPAR CORE profiles for different taxons
The file environment.yml, within the conda folder, contains the conda environment used to generate the genomic tracks for JASPAR 2022 (see installation)
The script install-pwmscan.sh downloads and installs PWMscan and places its binaries in the in the bin folder.
The script scan-sequence.py takes as its input the profiles folder and a nucleotide sequence in FASTA format
(e.g. a genome), and outputs TFBS predictions
The script scans2bigBed creates a bigBed track file from TFBS predictions

The original scripts used for the publication of JASPAR 2018 have been placed in the folder version-1.0.

Dependencies

Python 3.7 with the following libraries: Biopython (<1.74), NumPy, pyfaidx and tqdm
PWMScan
UCSC binaries for standalone command-line use

Note that for running scan_sequence.py, only the Python dependencies and PWMScan are required.

Installation

To install PWMScan, execute the script install-pwmscan.sh.

The remaining dependencies can be installed through the conda package manager:

conda env create -f ./conda/environment.yml

Availability

Genomic tracks and TFBS predictions for human and seven other model organisms, covering 11 genome assemblies, are available online:

http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/

Usage

To illustrate how the genomic tracks are generated, we provide an example for the baker's yeast genome:

Download the genome sequence and chromosome sizes (automated in this script)
Scan the genome sequence using all fungi profiles from the JASPAR CORE

./scan-sequence.py --fasta-file ./genomes/sacCer3/sacCer3.fa --profiles-dir ./profiles/ \
    --output-dir ./tracks/sacCer3/ --threads 4 --latest --taxon fungi

For this example, the scanning step should take no longer than a minute. For human and other similar genomes, this step is usually finished within a few hours (the final amount of time will depend on the number of --threads specified).

Create the genomic track

./scans2bigBed -c ./genomes/sacCer3/sacCer3.fa.sizes -i ./tracks/sacCer3/ -o ./tracks/sacCer3.bb -t 4

TFBS predictions from the previous step are merged into a bigBed track file. In column five, we use as scores the p-values from PWMScan (scaled between 0-1000, where 0 corresponds to p-value = 1 and 1000 to p-value ≤ 10-10). This allows for comparison of prediction confidence across TFBSs. Again, for this example, this step should be completed within a few minutes, while for larger genomes it can take a few hours.

Important note: disk space requirements for large genomes (i.e. danRer11, hg19, hg38, mm10, and mm39) are substantial. In these cases, we highly recommend allocating at least 1Tb of disk space.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

JASPAR UCSC tracks

News

Content

Dependencies

Installation

Availability

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

JASPAR UCSC tracks

News

Content

Dependencies

Installation

Availability

Usage