An input JSON file is a file which must include all the information needed to run this pipeline. Hence, it must include the absolute paths to all the control and experimental fastq files; paths to all the genomic data files needed for this pipeline, and it must also specify the parameters and the metadata needed for running this pipeline. If the parameters are not specified in an input JSON file, default values will be used. We provide a set of template JSON files: minimum and full. We recommend to use a minimum template instead of full one. A full template includes all parameters of the pipeline with default values defined.
IMPORTANT: ALWAYS USE ABSOLUTE PATHS.
Mandatory parameters.
-
Pipeline type
chip.pipeline_type
:tf
for TF ChIP-seq orhistone
for histone ChIP-seq. One major difference between two types is thattf
usesspp
peak caller with controls buthistone
usesmacs2
peak caller without controls.
-
Experiment title/description
chip.title
: experiment title for a final HTML report.chip.description
: experiment description for a final HTML report.
-
Read endedness
chip.paired_end
:true
if ALL replicates are paired-ended.- (Optional)
chip.paired_ends
: For samples with mixed read ends, you can define read endedness for each biological replicate (e.g.[true, false]
means paired-ended biorep-1 and single-ended biorep-2). chip.ctl_paired_end
:true
if ALL controls are paired-ended. If not defined thenchip.paired_end
will be used.- (Optional)
chip.ctl_paired_ends
: For controls with mixed read ends, you can define read endedness for each biological replicate (e.g.[true, false]
means paired-ended biorep-1 and single-ended biorep-2). If not defined thenchip.paired_ends
will be used.
-
Reference genome
chip.genome_tsv
: Usehttps://storage.googleapis.com/encode-pipeline-genome-data/genome_tsv/v1/[GENOME]_caper.tsv
.- Supported
GENOME
s: are hg38, mm10, hg19 and mm9. - We provide a genome TSV file that defines all genome-specific parameters and reference data files. Caper will automatically download big reference data files from our ENCODE repository.
- However, we also have reference data mirrors for some platforms (GCP, AWS, Sherlock, SCG, ...). On these platforms, you can use a different TSV file to prevent downloading such big reference data.
- To build a new TSV file from use your own FASTA (
.fa
and.2bit
) see this.
-
Input files and adapters
-
Important parameters
chip.always_use_pooled_ctl
: (For TF ChIP-seq only) Always use a pooled control to compare with each replicate. If a single control is given then use it. It is disabled by default.chip.ctl_depth_ratio
: (For TF ChIP-seq only) If ratio of depth between controls is higher than this. then always use a pooled control for all replicates. It's 1.2 by default.
-
- If your FASTQs/BAMs are big (>10GB) then try with higher resource settings, especially for memory (
chip.[TASK_NAME]_mem_mb
).
- If your FASTQs/BAMs are big (>10GB) then try with higher resource settings, especially for memory (
Optional parameters.
-
Useful parameters
chip.subsample_reads
: Subsample experimet reads. This will affect all downsteam analyses including peak-calling. It's 0 by default, which means no subsampling.chip.ctl_subsample_reads
: Subsample control reads. This will affect all downsteam analyses including peak-calling. It's 0 by default, which means no subsampling.chip.fraglen
: Array of Integers. Fragment length for each bio replicate. If you start from FASTQs then our pipeline automatically estimate it from cross-correlation analysis (taskxcor
) result since such analysis requires a special treamtment for FASTQs. It is possible that fragment length is not estimated correctly (or pipeline can fail due to negative fraglen) if you start from different types (BAM/TAG-ALIGN). For such case, you can manually define fragment length for each bio rep. (e.g.[200, 150]
means 200 for rep1 and 150 for rep2).
-
Flags
chip.align_only
: Peak calling and its downstream analyses will be disabled. Useful if you just want to align your FASTQs into filtered BAMs/TAG-ALIGNs and don't want to call peaks on them.chip.true_rep_only
: Disable pseudo replicate generation and all related analyses
IMPORTANT: Our pipeline considers a replicate (
rep
) as a biological replicate. You can still define technical replicates for each bio replicate. Tech replicates will be merged together to make a single FASTQ for each bio replicate. Controls can also have technical replicates.
IMPORTANT: Our pipeline supports up to 10 bio replicates and 10 controls.
IMPORTANT: Our pipeline has cross-validation analyses (IDR/overlap) comparing every pair of all replicates. Number of tasks for such analyses will be like nC2. This number will be 45 for 10 bio replicates. It's recommended to keep number of replicates <= 4.
Pipeline can start from any of the following data types (FASTQ, BAM, NODUP_BAM and TAG-ALIGN).
-
Starting from FASTQs
-
Technical replicates for each bio-rep will be MERGED in the very early stage of the pipeline. Each read end R1 and R2 have separate arrays
chip.fastqs_repX_R1
andchip.fastqs_repX_R2
. Do not define R2 array for single-ended replicates. -
Example of 3 paired-ended biological replicates and 2 technical replicates for each bio rep. Two technical replicates
BIOREPX_TECHREP1.R1.fq.gz
andBIOREPX_TECHREP2.R1.fq.gz
for each bio replicate will be merged.{ "chip.paired_end" : true, "chip.fastqs_rep1_R1" : ["BIOREP1_TECHREP1.R1.fq.gz", "BIOREP1_TECHREP2.R1.fq.gz"], "chip.fastqs_rep1_R2" : ["BIOREP1_TECHREP1.R2.fq.gz", "BIOREP1_TECHREP2.R2.fq.gz"], "chip.fastqs_rep2_R1" : ["BIOREP2_TECHREP1.R1.fq.gz", "BIOREP2_TECHREP2.R1.fq.gz"], "chip.fastqs_rep2_R2" : ["BIOREP2_TECHREP1.R2.fq.gz", "BIOREP2_TECHREP2.R2.fq.gz"], "chip.fastqs_rep3_R1" : ["BIOREP3_TECHREP1.R1.fq.gz", "BIOREP3_TECHREP2.R1.fq.gz"], "chip.fastqs_rep3_R2" : ["BIOREP3_TECHREP1.R2.fq.gz", "BIOREP3_TECHREP2.R2.fq.gz"] }
-
-
Starting from BAMs
- Define a BAM for each replicate. Our pipeline does not determine read endedness from a BAM file. You need to explicitly define read endedness.
- Example of 3 singled-ended replicates.
{ "chip.paired_end" : false, "chip.bams" : ["rep1.bam", "rep2.bam", "rep3.bam"] }
-
Starting from filtered/deduped BAMs
- Define a filtered/deduped BAM for each replicate. Our pipeline does not determine read endedness from a BAM file. You need to explicitly define read endedness. These BAMs should not have unmapped reads or duplicates.
- Example of 2 singled-ended replicates.
{ "chip.paired_end" : false, "chip.nodup_bams" : ["rep1.nodup.bam", "rep2.nodup.bam"] }
-
Starting from TAG-ALIGN BEDs
-
Define a TAG-ALIGN for each replicate. Our pipeline does not determine read endedness from a TAG-ALIGN file. You need to explicitly define read endedness.
-
Example of 4 paired-ended replicates.
{ "chip.paired_end" : true, "chip.tas" : ["rep1.tagAlign.gz", "rep2.tagAlign.gz", "rep3.tagAlign.gz", "rep3.tagAlign.gz"] }
-
You need to define controls for TF ChIP-seq pipeline. Skip this if you want to run histone ChIP-seq pipelines. You can define controls similarly to experiment IP replicates. Just add ctl_
prefix to parameter names.
-
Control FASTQs
-
Technical replicates for each bio-rep will be MERGED in the very early stage of the pipeline. Each read end R1 and R2 have separate arrays
chip.ctl_fastqs_repX_R1
andchip.ctl_fastqs_repX_R2
. Do not define R2 array for single-ended replicates. -
Example of 3 paired-ended biological replicates and 2 technical replicates for each bio rep. Two technical replicates
BIOREPX_TECHREP1.R1.fq.gz
andBIOREPX_TECHREP2.R1.fq.gz
for each bio replicate will be merged.{ "chip.ctl_paired_end" : true, "chip.ctl_fastqs_rep1_R1" : ["BIOREP1_TECHREP1.R1.fq.gz", "BIOREP1_TECHREP2.R1.fq.gz"], "chip.ctl_fastqs_rep1_R2" : ["BIOREP1_TECHREP1.R2.fq.gz", "BIOREP1_TECHREP2.R2.fq.gz"], "chip.ctl_fastqs_rep2_R1" : ["BIOREP2_TECHREP1.R1.fq.gz", "BIOREP2_TECHREP2.R1.fq.gz"], "chip.ctl_fastqs_rep2_R2" : ["BIOREP2_TECHREP1.R2.fq.gz", "BIOREP2_TECHREP2.R2.fq.gz"], }
-
-
Control BAMs
-
Define a BAM for each replicate. Our pipeline does not determine read endedness from a BAM file. You need to explicitly define read endedness.
-
Example of 3 singled-ended replicates.
{ "chip.ctl_paired_end" : false, "chip.ctl_bams" : ["ctl1.bam", "ctl2.bam", "ctl3.bam"] }
-
-
Control BAMs
- Define a filtered/deduped BAM for each replicate. Our pipeline does not determine read endedness from a BAM file. You need to explicitly define read endedness. These BAMs should not have unmapped reads or duplicates.
- Example of 2 singled-ended replicates.
{ "chip.ctl_paired_end" : false, "chip.ctl_nodup_bams" : ["ctl1.nodup.bam", "ctl2.nodup.bam"] }
-
Control TAG-ALIGN BEDs
-
Define a TAG-ALIGN for each replicate. Our pipeline does not determine read endedness from a TAG-ALIGN file. You need to explicitly define read endedness.
-
Example of 4 paired-ended replicates.
{ "chip.ctl_paired_end" : true, "chip.ctl_tas" : ["ctl1.tagAlign.gz", "ctl2.tagAlign.gz", "ctl3.tagAlign.gz", "ctl4.tagAlign.gz"] }
-
You can also mix up different data types for individual bio replicate and control. For example, pipeline can start from FASTQs for rep1 (SE) and rep3 (PE), BAMs for rep2 (SE), NODUP_BAMs for rep4 (SE) and TAG-ALIGNs for rep5 (PE). This example has two controls (ctl1: SE BAM, ctl2: PE FASTQs).
{
"chip.paired_ends" : [false, false, true, false, true],
"chip.fastqs_rep1_R1" : ["rep1.fastq.gz"],
"chip.fastqs_rep3_R1" : ["rep3.R1.fastq.gz"],
"chip.fastqs_rep3_R2" : ["rep3.R2.fastq.gz"],
"chip.bams" : [null, "rep2.bam", null, null, null],
"chip.nodup_bams" : [null, null, null, "rep4.nodup.bam", null],
"chip.tas" : [null, null, null, null, "rep5.tagAlign.gz"],
"chip.ctl_paired_ends" : [false, true],
"chip.ctl_fastqs_rep2_R1" : ["ctl2.R1.fastq.gz"],
"chip.ctl_fastqs_rep2_R2" : ["ctl2.R2.fastq.gz"],
"chip.ctl_bams" : ["ctl1.bam", null],
}
WARNING: It is recommened not to change the following parameters unless you get resource-related errors for a certain task and you want to increase resources for such task. The following parameters are provided for users who want to run our pipeline with Caper's
local
on HPCs and 2).
Resources defined here are PER BIO REPLICATE. Therefore, total number of cores will be approximately chip.align_cpu
x NUMBER_OF_BIO_REPLICATES
because align
is a bottlenecking task of the pipeline. This total number of cores will be useful ONLY when you use a local
backend of Caper and manually qsub
or sbatch
your job. disks
is used for Google Cloud and DNAnexus only.
Parameter | Default |
---|---|
chip.align_cpu |
4 |
chip.align_mem_mb |
20000 |
chip.align_time_hr |
48 |
chip.align_disks |
local-disk 400 HDD |
Parameter | Default |
---|---|
chip.filter_cpu |
2 |
chip.filter_mem_mb |
20000 |
chip.filter_time_hr |
24 |
chip.filter_disks |
local-disk 400 HDD |
Parameter | Default |
---|---|
chip.bam2ta_cpu |
2 |
chip.bam2ta_mem_mb |
10000 |
chip.bam2ta_time_hr |
6 |
chip.bam2ta_disks |
local-disk 100 HDD |
Parameter | Default |
---|---|
chip.spr_mem_mb |
16000 |
Parameter | Default |
---|---|
chip.jsd_cpu |
2 |
chip.jsd_mem_mb |
12000 |
chip.jsd_time_hr |
6 |
chip.jsd_disks |
local-disk 200 HDD |
Parameter | Default |
---|---|
chip.xcor_cpu |
2 |
chip.xcor_mem_mb |
16000 |
chip.xcor_time_hr |
24 |
chip.xcor_disks |
local-disk 100 HDD |
Parameter | Default |
---|---|
chip.call_peak_cpu |
2 |
chip.call_peak_mem_mb |
16000 |
chip.call_peak_time_hr |
24 |
chip.call_peak_disks |
local-disk 200 HDD |
Parameter | Default |
---|---|
chip.macs2_signal_track_mem_mb |
16000 |
chip.macs2_signal_track_time_hr |
24 |
chip.macs2_signal_track_disks |
local-disk 400 HDD |
IMPORTANT: If you see Java memory errors, check the following resource parameters.
There are special parameters to control maximum Java heap memory (e.g. java -Xmx4G
) for Picard tools. They are strings including size units. Such string will be directly appended to Java's parameter -Xmx
.
Parameter | Default |
---|---|
chip.filter_picard_java_heap |
4G |
chip.gc_bias_picard_java_heap |
6G |