You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An input JSON file includes all genomic data files, parameters and metadata for running pipelines. Our pipeline will use default values if they are not defined in an input JSON file. We provide a set of template JSON files: minimum and full. We recommend to use a minimum template instead of full one. A full template includes all parameters of the pipeline with default values defined.
Please read through the following step-by-step instruction to compose a input JSON file.
IMPORTANT: ALWAYS USE ABSOLUTE PATHS.
Pipeline metadata
Parameter
Description
chip.title
Title for experiment which will be shown in a final HTML report
chip.description
Description for experiment which will be shown in a final HTML report
Pipeline parameters
Parameter
Default
Description
chip.pipeline_type
tf
tf for TF ChIP-seq or histone for Histone ChIP-seq.
chip.align_only
false
Peak calling and its downstream analyses will be disabled. Useful if you just want to map your FASTQs into filtered BAMs/TAG-ALIGNs and don't want to call peaks on them.
chip.true_rep_only
false
Disable pseudo replicate generation and all related analyses
Reference genome
All reference genome specific reference files/parameters can be defined in a single TSV file chip.genome_tsv. However, you can also individally define each file/parameter instead of a TSV file. If both a TSV file and individual parameters are defined, then individual parameters will override those defined in a TSV file. For example, if you define both chip.genome_tsv and chip.blacklist, then chip.blacklist will override that is defined in chip.genome_tsv. This is useful when you want to use your own for a specific parameter while keeping all the other parameters same as original.
Parameter
Type
Description
chip.genome_tsv
File
Choose one of the TSV files listed below or build your own
chip.genome_name
String
Name of genome (e.g. hg38, hg19, ...)
chip.ref_fa
File
Reference FASTA file
chip.ref_mito_fa
File
Mito-only reference FASTA file
chip.bowtie2_idx_tar
File
Bowtie2 index TAR file (uncompressed) built from FASTA file
chip.bowtie2_mito_idx_tar
File
Mito-only Bowtie2 index TAR file (uncompressed) built from FASTA file
chip.bwa_idx_tar
File
BWA index TAR file (uncompressed) built from FASTA file with bwa index
chip.bwa_mito_idx_tar
File
Mito-only BWA index TAR file (uncompressed) built from FASTA file with bwa index
2-col chromosome sizes file built from FASTA file with faidx
chip.blacklist
File
3-col BED file. Peaks overlapping these regions will be filtered out
chip.gensz
String
MACS2's genome sizes (hs for human, mm for mouse or sum of 2nd col in chrsz)
chip.mito_chr_name
String
Name of mitochondrial chromosome (e.g. chrM)
chip.regex_bfilt_peak_chr_name
String
Perl style reg-ex to keep peaks on selected chromosomes only matching with this pattern (default: chr[\dXY]+. This will keep chr1, chr2, ... chrX and chrY in .bfilt. peaks file. chrM is not included here)
Find a TSV file on the destination directory and use it for "chip.genome_tsv" in your input JSON.
Input genomic data
Choose endedness of your dataset first.
Parameter
Description
chip.paired_end
Boolean to define endedness for ALL IP replicates. This will override per-replicate definition in chip.paired_ends
chip.ctl_paired_end
Boolean to define endedness for ALL control replicates. This will override per-replicate definition in chip.ctl_paired_ends
chip.paired_ends
Array of Boolean to define endedness for each replicate
chip.ctl_paired_ends
Array of Boolean to define endedness for each control replicate
Define chip.paired_end and chip.ctl_paired_end if all replicates (or control replicates) in your dataset has the same endedness. You can also individually define endedness for each replicate and control replicate. For example, rep1, rep2 are PE and rep3 is SE. control rep1 is SE and control rep2 is PE.
Pipeline can start from any of the following data type (FASTQ, BAM, NODUP_BAM and TAG-ALIGN).
Parameter
Description
chip.fastqs_repX_R1
Array of R1 FASTQ files for replicate X. These files will be merged into one FASTQ file for rep X.
chip.fastqs_repX_R2
Array of R2 FASTQ files for replicate X. These files will be merged into one FASTQ file for rep X. Do not define for single ended dataset.
chip.bams
Array of BAM file for each replicate. (e.g. ["rep1.bam", "rep2.bam", ...])
chip.nodup_bams
Array of filtered/deduped BAM file for each replicate.
chip.tas
Array of TAG-ALIGN file for each replicate.
For controls:
Parameter
Description
chip.ctl_fastqs_repX_R1
Array of R1 FASTQ files for control replicate X. These files will be merged into one FASTQ file for rep X.
chip.ctl_fastqs_repX_R2
Array of R2 FASTQ files for control replicate X. These files will be merged into one FASTQ file for rep X. Do not define for single ended dataset.
chip.ctl_bams
Array of BAM file for each control replicate. (e.g. ["ctl_rep1.bam", "ctl_rep2.bam", ...])
chip.ctl_nodup_bams
Array of filtered/deduped BAM file for each control replicate.
chip.ctl_tas
Array of TAG-ALIGN file for each control replicate.
You can mix up different data types for individual replicate/control replicate. For example, pipeline can start from FASTQs for rep1 and rep3, BAMs for rep2, NODUP_BAMs for rep4 and TAG-ALIGNs for rep5. You can define similarly for control replicates.
Threshold for mapped reads quality (samtools view -q). If not defined, automatically determined according to aligner.
chip.dup_marker
picard
Choose a dup marker between picard and sambamba. picard is recommended, use sambamba only when picard fails.
chip.no_dup_removal
false
Skip dup removal in a BAM filtering stage.
Optional subsampling parameters
Parameter
Default
Description
chip.subsample_reads
0
Subsample reads (0: no subsampling). Subsampled reads will be used for all downsteam analyses including peak-calling
chip.ctl_subsample_reads
0
Subsample control reads.
chip.xcor_subsample_reads
15000000
Subsample reads for cross-corr. analysis only (0: no subsampling). Subsampled reads will be used for cross-corr. analysis only
Optional cross-correlation analysis parameters
Parameter
Default
Description
chip.xcor_pe_trim_bp
50
Trim R1 of paired ended fastqs for cross-correlation analysis only. Trimmed fastqs will not be used for any other analyses
chip.use_filt_pe_ta_for_xcor
false
Use filtered PE BAM/TAG-ALIGN for cross-correlation analysis ignoring the above trimmed R1 fastq
chip.xcor_exclusion_range_min
-500
Exclusion minimum for cross-corr. analysis. See description for -x=<min>:<max> for details. Make sure that it's consistent with default strand shift -s=-500:5:1500 in phantompeakqualtools.
chip.xcor_exclusion_range_max
Exclusion minimum for cross-corr. analysis. If not defined default value of max(read length + 10, 50) for TF and max(read_len + 10, 100) for histone are used.
Optional control parameters
Parameter
Default
Description
chip.always_use_pooled_ctl
false
Choosing an appropriate control for each replicate. Always use a pooled control to compare with each replicate. If a single control is given then use it.
chip.ctl_depth_ratio
1.2
If ratio of depth between controls is higher than this. then always use a pooled control for all replicates.
Optional peak-calling parameters
Parameter
Default
Description
chip.peak_caller
spp for tf type macs2 for histone type
spp or macs2. spp requires control macs2 can work without controls
chip.macs2_cap_num_peak
500000
Cap number of peaks called from a peak-caller (MACS2)
chip.pval_thresh
0.01
P-value threshold for MACS2 (macs2 callpeak -p)
chip.idr_thresh
0.05
Threshold for IDR (irreproducible discovery rate)
chip.spp_cap_num_peak
300000
Cap number of peaks called from a peak-caller (SPP)
Our pipeline automatically estimate fragment lengths (required for TF ChIP-Seq) from cross-correlation (task xcor) anaylses, but chip.fraglen will override those estimated ones. Use this if your pipeline fails due to invalid (negative) fragment length estimated from the cross-correlation analysis.
Parameter
Type
Description
chip.fraglen
Array[Int]
Fragment length for each replicate.
Other optional parameters
Parameter
Default
Description
chip.filter_chrs
[] (empty array of string)
Array of chromosome names to be filtered out from a final (filtered/nodup) BAM. No chromosomes are filtered out by default.
Resource parameters
WARNING: It is recommened not to change the following parameters unless you get resource-related errors for a certain task and you want to increase resources for such task. The following parameters are provided for users who want to run our pipeline with Caper's local on HPCs and 2).
Resources defined here are PER REPLICATE. Therefore, total number of cores will be MAX(chip.align_cpu x NUMBER_OF_REPLICATES, chip.call_peak_cpu x 2 x NUMBER_OF_REPLICATES) because align and call_peak (especially for spp) are bottlenecking tasks of the pipeline. Use this total number of cores if you manually qsub or sbatch your job (using local mode of Caper). disks is used for Google Cloud and DNAnexus only.
Parameter
Default
chip.align_cpu
4
chip.align_mem_mb
20000
chip.align_time_hr
48
chip.align_disks
local-disk 400 HDD
Parameter
Default
chip.filter_cpu
2
chip.filter_mem_mb
20000
chip.filter_time_hr
24
chip.filter_disks
local-disk 400 HDD
Parameter
Default
chip.bam2ta_cpu
2
chip.bam2ta_mem_mb
10000
chip.bam2ta_time_hr
6
chip.bam2ta_disks
local-disk 100 HDD
Parameter
Default
chip.spr_mem_mb
16000
Parameter
Default
chip.jsd_cpu
2
chip.jsd_mem_mb
12000
chip.jsd_time_hr
6
chip.jsd_disks
local-disk 200 HDD
Parameter
Default
chip.xcor_cpu
2
chip.xcor_mem_mb
16000
chip.xcor_time_hr
24
chip.xcor_disks
local-disk 100 HDD
Parameter
Default
chip.call_peak_cpu
2
chip.call_peak_mem_mb
16000
chip.call_peak_time_hr
24
chip.call_peak_disks
local-disk 200 HDD
Parameter
Default
chip.macs2_signal_track_mem_mb
16000
chip.macs2_signal_track_time_hr
24
chip.macs2_signal_track_disks
local-disk 400 HDD
IMPORTANT: If you see Java memory errors, check the following resource parameters.
There are special parameters to control maximum Java heap memory (e.g. java -Xmx4G) for Picard tools. They are strings including size units. Such string will be directly appended to Java's parameter -Xmx.
Parameter
Default
chip.filter_picard_java_heap
4G
chip.gc_bias_picard_java_heap
6G
How to use a custom aligner
ENCODE ChIP-Seq pipeline currently supports bwa and bowtie2. In order to use your own aligner you need to define the following parameters first. You can define custom_aligner_idx_tar either in your input JSON file or in your genome TSV file. Such index TAR file should be an uncompressed TAR file without any directory structured.
Parameter
Type
Description
chip.custom_aligner_idx_tar
File
Index TAR file (uncompressed) for your own aligner
chip.custom_align_py
File
Python script for your custom aligner
Here is a template for custom_align.py:
#!/usr/bin/env pythonimportosimportargparsedefparse_arguments():
parser=argparse.ArgumentParser(prog='ENCODE template aligner')
parser.add_argument('index_prefix_or_tar', type=str,
help='Path for prefix (or a tarball .tar) \ for reference aligner index. \ Tar ball must be packed without compression \ and directory by using command line \ "tar cvf [TAR] [TAR_PREFIX].*')
parser.add_argument('fastqs', nargs='+', type=str,
help='List of FASTQs (R1 and R2). \ FASTQs must be compressed with gzip (with .gz).')
parser.add_argument('--paired-end', action="store_true",
help='Paired-end FASTQs.')
parser.add_argument('--multimapping', default=4, type=int,
help='Multimapping reads')
parser.add_argument('--nth', type=int, default=1,
help='Number of threads to parallelize.')
parser.add_argument('--out-dir', default='', type=str,
help='Output directory.')
args=parser.parse_args()
# check if fastqs have correct dimensionifargs.paired_endandlen(args.fastqs)!=2:
raiseargparse.ArgumentTypeError('Need 2 fastqs for paired end.')
ifnotargs.paired_endandlen(args.fastqs)!=1:
raiseargparse.ArgumentTypeError('Need 1 fastq for single end.')
returnargsdefalign(fastq_R1, fastq_R2, ref_index_prefix, multimapping, nth, out_dir):
basename=os.path.basename(os.path.splitext(fastq_R1)[0])
prefix=os.path.join(out_dir, basename)
bam='{}.bam'.format(prefix)
# map your fastqs somehowos.system('touch {}'.format(bam))
returnbamdefmain():
# read paramsargs=parse_arguments()
# unpack index somehow on CWDos.system('tar xvf {}'.format(args.index_prefix_or_tar))
bam=align(args.fastqs[0],
args.fastqs[1] ifargs.paired_endelseNone,
args.index_prefix_or_tar,
args.multimapping,
args.nth,
args.out_dir)
if__name__=='__main__':
main()
IMPORTANT: Your custom python script should generate ONLY one *.bam file. For example, if there are two .bam files then pipeline will just pick the first one in an alphatical order.
How to use a custom peak caller
Parameter|Type|Default|Description
---------|-------|-----------
chip.peak_type | String | narrowPeak | Only ENCODE peak types are supported: narrowPeak, broadPeak and gappedPeakchip.custom_call_peak_py | File | | Python script for your custom peak caller
The file extension of your output peak file must be consitent with the peak_type you chose. For example, if you have chosen narrowPeak as peak_type then your output peak file should be *.narrowPeak.gz.
Here is a template for custom_call_peak.py:
#!/usr/bin/env pythonimportosimportargparsedefparse_arguments():
parser=argparse.ArgumentParser(prog='ENCODE template call_peak')
parser.add_argument('tas', type=str, nargs='+',
help='Path for TAGALIGN file (first) and control TAGALIGN file (second; optional).')
parser.add_argument('--fraglen', type=int, required=True,
help='Fragment length.')
parser.add_argument('--shift', type=int, default=0,
help='macs2 callpeak --shift.')
parser.add_argument('--chrsz', type=str,
help='2-col chromosome sizes file.')
parser.add_argument('--gensz', type=str,
help='Genome size (sum of entries in 2nd column of \ chr. sizes file, or hs for human, ms for mouse).')
parser.add_argument('--pval-thresh', default=0.01, type=float,
help='P-Value threshold.')
parser.add_argument('--cap-num-peak', default=500000, type=int,
help='Capping number of peaks by taking top N peaks.')
parser.add_argument('--out-dir', default='', type=str,
help='Output directory.')
args=parser.parse_args()
iflen(args.tas)==1:
args.tas.append('')
returnargsdefcall_peak(ta, ctl_ta, chrsz, gensz, pval_thresh, shift, fraglen, cap_num_peak, out_dir):
basename_ta=os.path.basename(os.path.splitext(ta)[0])
ifctl_ta:
basename_ctl_ta=os.path.basename(os.path.splitext(ctl_ta)[0])
basename_prefix='{}_x_{}'.format(basename_ta, basename_ctl_ta)
iflen(basename_prefix) >200: # UNIX cannot have len(filename) > 255basename_prefix='{}_x_control'.format(basename_ta)
else:
basename_prefix=basename_taprefix=os.path.join(out_dir, basename_prefix)
npeak='{}.narrowPeak.gz'.format(prefix)
os.system('touch {}'.format(npeak))
returnnpeakdefmain():
# read paramsargs=parse_arguments()
npeak=call_peak(
args.tas[0], args.tas[1], args.chrsz, args.gensz, args.pval_thresh,
args.shift, args.fraglen, args.cap_num_peak, args.out_dir)
if__name__=='__main__':
main()
IMPORTANT: Your custom python script should generate ONLY one *.*Peak.gz file. For example, if there are *.narrowPeak.gz and *.broadPeak.gz files pipeline will just pick the first one in an alphatical order.