-
Notifications
You must be signed in to change notification settings - Fork 1
/
INSTALL
347 lines (247 loc) · 14.8 KB
/
INSTALL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
## These are the installation instructions for the pinball pipeline.
## If you are running this on the pinball Virtual Machine, the
## installation should already be done for you, so you can skip until
## the INSTALLATION READY section.
## For questions/comments, email [email protected]
## Requirements
## => a 64bit modern linux computer (or compute farm, tested on 2.6.18+ debian/ubuntu)
## => a modern version of perl (tested on 5.8+)
## => sqlite3 (tested on 3.3.8+) or mysql server (tested on 5.0+)
##
## Optional for compute farms:
##
## => mysql server (sqlite can only handle a handful of concurrent jobs)
## => farm queueing system (ehive has support for LSF, SGE* or Amazon cloud, email me for details)
##
## * for SGE, use http://raw.github.com/avilella/ensembl-hive/master/modules/Bio/EnsEMBL/Hive/Meadow/SGE.pm
##
## # NOTE: All instructions tested on bash terminal
##
## # Do a tar xzf of the pinball package on your home directory:
## cd ~
## tar xzf pinball.tar.gz
## cd ~/pinball
##
## The pinball package contains some other programs and scripts that
## are not integrated in the pipeline but are just added for
## convenience in development and helping users analyse their data in
## different ways.
########################################
# ENSEMBL-HIVE
# Although a simplified version of the Pinball pipeline is available
# in the pinball/scripts folder, the full version runs via
# ensembl-hive (eHive). This implements all the needed functionality
# for job control system of the pipeline. It is not bound to any
# processing 'farm' system and can be run in a normal workstation
# computer or a GRID system.
########################################
# USING SQLITE
# If you don't have a mysql server installed, you can run Pinball
# using ensembl-hive's sqlite3 implementation. To install sqlite, and
# its DBD perl modules in your system:
# On a Debian-based machine:
sudo apt-get install build-essential sqlite3 libsqlite3-dev perl-doc
# Then
mkdir ~/pinball/sqlite
./cpanm -L ~/pinball/sqlite DBD::SQLite
# This will create the modules somewhere like $HOME/pinball/sqlite/lib/perl5/x86_64-linux-thread-multi
find ~/pinball/sqlite -name DBD.pm
/home/user/pinball/sqlite/lib/perl5/i686-linux-gnu-thread-multi/DBI/DBD.pm
########################################
# SETTING up PERL5LIB and PATH
# Add that directory (up to where DBI/DBD.pm starts) to the PERL5LIB as shown below:
PERL_SQLITE_MODULES_HERE=$HOME/pinball/sqlite/lib/perl5/i686-linux-gnu-thread-multi/
# Then export the necessary modules into the PERL5LIB environment variable like shown below:
echo PINBALLPERL5LIB && export PERL5LIB=$HOME/pinball/modules:$HOME/pinball/ensembl_main/ensembl/modules:$HOME/pinball/ensembl_main/ensembl-hive/modules:$HOME/pinball/ensembl_main/ensembl-compara/modules:$HOME/pinball/bioperl-1.2.3:$HOME/pinball/BioPerl-1.6.1:$PERL_SQLITE_MODULES_HERE
# Set up the PATH so that eHive's beekeeper.pl and runWorker.pl are found:
echo PINBALLPATH && export PATH=$PATH:$HOME/pinball/ensembl_main/ensembl-hive/scripts
# You should now be able to see the documentation for DBD::SQLite, ensembl-hive modules
# and run the beekeeper.pl script ('q' to quit man page):
perldoc Bio::EnsEMBL::Hive::Worker
perldoc DBD::SQLite
beekeeper.pl
########################################
# CHECKING BINARIES
# Check that the provided sga and bwa binaries run correctly in your 64bit computer:
# For sga, try:
export LD_LIBRARY_PATH=$HOME/pinball/sga/bamtools/lib
~/pinball/sga/src/sga
# To recompile sga, follow instructions here: https://raw.github.com/jts/sga/master/src/README
# For bwa, try:
~/pinball/bwa/bwa index
# To recompile 'bwa':
sudo apt-get install zlib1g-dev
cd ~/pinball/bwa/
make clean; make
cd ~/pinball
# For timelimit, try:
~/pinball/timelimit/timelimit
# To recompile 'timelimit':
cd ~/pinball/timelimit
make clean; make
cd ~/pinball
########################################
# INSTALLATION READY: installation should be READY from here onwards.
########################################
########################################
# EXAMPLE INPUT FILES:
# For some example fastq files, you can download these small datasets that will allow you to test the
# installation in just a few minutes:
# Create a separate directory where your data will reside:
mkdir -p ~/somewhere/with/disk/space
cd ~/somewhere/with/disk/space
# This is a 1% random read sampling of an example encode dataset:
wget ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/pinball_example_data/example.encode.1pc.fastq.gz
# This is a 0.1% random read sampling of an example encode dataset:
wget ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/pinball_example_data/example.encode.1pm.fastq.gz
cd ~/pinball
########################################
# EXAMPLE REFERENCE:
cd ~/somewhere/with/disk/space
# This is an example small reference fasta file containing only human chromosome chr22 (v37 g1k website)
# that will allow you to test the searching/mapping part of the pipeline in just a few minutes:
wget ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/pinball_example_data/human_g1k_v37_chr22.fasta.gz
~/pinball/bwa/bwa index human_g1k_v37_chr22.fasta.gz
cd ~/pinball
########################################
# INITIALIZE THE DATABASE:
# Give a name to this project using 'dbname':
dbname=my_example_encode
# For the datasets in this project, give them a tag name using 'tag':
tag=example_encode.1pc.1pm
# Run the initialization script like this (see below for an explanation of the different options):
perl ~/pinball/ensembl_main/ensembl-hive/scripts/init_pipeline.pl Pinball::PipeConfig::Pinball_conf -ensembl_cvs_root_dir $HOME/pinball/ensembl_main -pipeline_db -dbname=$dbname -hive_driver sqlite -password=foo -pipeline_name pinball -cluster_gigs 2 -cluster_threads 3 -cluster_queue foo -longest_n 1 -work_dir ~/somewhere/with/disk/space/$dbname -seq ~/somewhere/with/disk/space/example.encode.1pc.fastq.gz:~/somewhere/with/disk/space/example.encode.1pm.fastq.gz -tag $tag -reference ~/somewhere/with/disk/space/human_g1k_v37_chr22.fasta.gz
# For another dataset, it would look something like:
# dbname=my_whatever_name_I_want_to_give_to_the_project
# tag=my_datasets
# perl ~/pinball/ensembl_main/ensembl-hive/scripts/init_pipeline.pl Pinball::PipeConfig::Pinball_conf -ensembl_cvs_root_dir $HOME/pinball/ensembl_main -pipeline_db -dbname=$dbname -hive_driver sqlite -password=foo -pipeline_name pinball -cluster_gigs 32 -cluster_threads 22 -cluster_queue foo -work_dir /filesystem/somewhere/with/disk/space/$dbname -seq /colon/separated/list/of/fastq/or/fasta/files/compressed/or/not/file1.fa:/another/file2.fq:/and/another/file3.fq.gz -tag $tag -reference /closest/genome/reference/that/has/been/indexed/with/bwa/file.fa
########################################
# START THE PINBALL PIPELINE
# Synchronize for the first time:
cd ~/pinball
beekeeper.pl -url sqlite:///my_example_encode -sync
# Loop through the process (here using local workstation and 2 cpus,
# LSF is default, use it if you have access to a compute farm with
# LSF queueing system):
cd ~/pinball
beekeeper.pl -url sqlite:///my_example_encode -loop -local -local_cpus 2
This will execute the beekeeper.pl script, that will loop through
every two minutes (--sleep 2 is default option) and process all the
different analyses (preprocess, index, cluster, ...).
You will see the reports from beekeeper.pl showing analysis going from
READY to WORKING to DONE. You will see a message like this:
'The Beekeeper has stopped because there is nothing left to do'
# Use Ctrl+C to stop it at any point. You can run beekeeper -sync at
any point to synchronize the status of the pipeline:
beekeeper.pl -url sqlite:///my_example_encode -sync
########################################
# PIPELINE INPUT
# The most common input for the Pinball pipeline is Illumina ChIP-seq
# data from narrow-peaks datasets, like transcription-factor binding
# proteins or narrow-peak histone modifications like H3K4me3 or
# H3K27ac.
# The pipeline has also been shown to work for broad-peak ChIP-seq
# datasets but, depending on the level of sequence coverage, the
# reconstructed regions will often be split into multiple fragments.
# Pinball has also been shown to be able to cluster data from Illumina
# small RNA sequencing datasets. For example, it will be able to find
# clusters of reads corresponding to miRNAs, although these very short
# regions won't be easily mappable to reference genomes of distant
# species.
########################################
# PIPELINE RESULTS
# The pipeline produces two kind of output files:
# -> Reconstructed regions for each clustered set of reads. From your
# work_dir, there will be a 'output' directory with a file like:
# e.g.:
cd ~/somewhere/with/disk/space/my_example_encode/output
# the file should be: example_encode.1pc.1pm.clusters.fa
# This file can be used, for example, for de novo motif discovery.
# -> BAM file for the read pileups for each clustered set of
# reads. The shape of each reconstructed region is as given by the
# sga walk procedure. For example, a file like:
# output/example_encode.1pc.1pm.clusters.bam (and *.bam.bai)
# If you provided an Input/IgG control file with respect to the sequences:
# -> BAM file for the alignment of the input reads against the
reconstructed regions. For example:
# output/example_encode.1pc.1pm.control.bam (and *.bam.bai)
# The *.clusters.bam file and the *.control.bam file can be used in
combination, for example, to correct for false peaks that are
present in the control reads.
# If you provided a reference against which to align the clusters:
# -> BAM files for the alignment of the reconstructed regions against
reference sequences. If your dataset is from a species with no
available assembled genome, you can try to use the reference genome
of the closest species to yours.
# output/example_encode.1pc.1pm.human_g1k_v37_chr22.fasta.1000.bam (and *.bam.bai)
########################################
# Pinball OPTIONS
-tag string Name of the dataset to be analysed, e.g.: GM12864_CTCF
-work_dir /file/path Path to be created where the data will be processed
and the results stored.
(10 times the size of your fastq files should be enough)
-reference /path/file.fasta Fasta file of the reference sequences to be used
-seq file1.fq:file2.fq List of fasta or fastq input files to be analysed.
In the '-seq' option, if several files are to be combined in a single
dataset, the files can be colon-separated (':') like this:
/file/path/file1.fq:/file/path/file2.fq
Compressed files can be used as input (they take less space and will
be copied elsewhere), e.g.:
/file/path/file1.fq.gz:/file/path/file2.fq.gz
The reads in the files can be 'labelled' using 'at' signs ('@label@')
so that their origin can be traced back after the clustering
analysis. This is useful in analysing datasets with multiple
replicates or in analysing co-occurence for combinations of multiple
TFs, e.g.:
@TF1_repl1@/file/path/file1.fq.gz:@TF1_repl2@/file/path/file2.fq.gz
@TF1@/file/path/file1.fq.gz:@TF2@/file/path/file2.fq.gz
########################################
# sga related options:
-minreadlen NUM sga preprocess: discard sequences that are shorter than NUM (default: 34)
-overlap NUM sga: minimum overlap required between two reads (default: 25)
-csize NUM sga cluster: minimum number of reads for each cluster (default: 10)
-minclustersize NUM minimum number of reads for each cluster to be parsed after clustering (default: 10)
-maxclustersize NUM maxinum number of reads for each cluster to be parsed after clustering (default: 100000)
-dust FLOAT sga preprocess: filter out low-complexity reads that have a dust score higher
than FLOAT (default: 1.0)
-longest_n NUM only output the longest NUM of walks per cluster (default: 1)
-erate NUM sga: maximum error rate allowed to consider two sequences aligned (default: 0)
e.g. 1. for two reads of 101bp each with an offset of 1 bp and one mismatch at position 50
(and 51), and error rate of erate=0.01 would consider the overlap significant.
e.g. 2. for two reads of 101bp each with an offset of 1 bp and two mismatches at position
33 and 66 (and 34/67), and error rate of erate=0.005 would consider the overlap
significant.
-rmdupfiltering sga: use 'sga rmdup' instead of 'sga filter' in the indexing step.
Rmdup is a more stringent filtering that gets rid of duplicate reads
even for non-identical that would still appear the same when using a
non-zero error-rate. If option is NULL, 'sga filter' is used (default: null)
-disk NUM sga index: Batches of NUM reads to be used at a time (default 1000000)
-phred64 1/0 sga preprocess: convert the input quality values from phred-64 to phred-33.
-sample FLOAT sga preprocess: Randomly sample reads or pairs with acceptance probability FLOAT.
########################################
# Other options:
-sga_executable /full/path/sga Path to alternative sga executable (default: $HOME/pinball/sga/src/sga)
-search_executable /f/path/bwa Path to alternative bwa executable (default: $HOME/pinball/bwa/bwa)
-samtools_executable /f/p/samtools Path to alternative samtools executable
(default: $HOME/pinball/samtools/samtools)
-timelimit_executable /p/timelimit Path to alternative timelimit executable
(default: $HOME/pinball/timelimit/timelimit)
########################################
# Resource management options:
# Workstation: if you are running pinball locally on a workstation,
# the resources are limited to the amount of memory and cpus in the
# system. If you see memory errors during the indexing or clustering
# steps, it's because your datasets are too big to be analysed with
# your hardware.
# Farm: If you are running the pipeline on a cpu farm with a queueing
# system, you can define how much memory and cpus you want to allocate
# for the clustering step of the pipeline using the following options
-cluster_gigs N Number of GBs to allocate for clustering step
-cluster_threads N Number of CPU processor threads to be used by the clustering step
-cluster_queue string Name of the LSF queue to be used by the clustering step
-email [email protected] Email of the user of the pipeline in the cpu farm
-timelimit MIN If the package 'timelimit' is available,
you can use this option to limit the number of MINutes to be spend
trying to calculate the walks for each cluster of reads. This can
speed up the walk analysis in case some of the clusters contain a
large number of reads (default: 5).