CuttleFlow is a tool that rewrites Data Analysis Workflows (DAW) according to their execution context (input data and computer/cluster the DAW is running on). It can replace tools with similar ones where appropriate, while maintaining the workflow structure and outputs. It can parallelize some tasks across multiple nodes of a cluster when appropriate, depending on the execution context. It will set the appropriate number of threads on the DAW tasks. To run our program on the rnasplice workflow, run the following commands
bash cuttleflow.sh --descriptionFolder test/rnasplice-description --output ../generated_daw_rnasplice_test
By changing the --description folder
argument, you can load a different workflow, input and infrastructure description.
You can change the output folder where the DAW is generated by changing the path of the --output
argument.
For more information on the rnasplice workflow, visit the nf-core website.
The following folder is generated from the above command:
generated_daw_rnasplice_test
|-- input.csv
|-- main.nf
|-- nextflow.config
main.nf
contains the DAW's Nextflow code and the path to the module.
nextflow.config
contains the parameters needed to run the DAW. By default, we include the generation of trace files containing metrics on the DAW execution.
input.csv
contains the path and metadata and input data required to run the DAW.
You can run the DAW with this command:
nextflow run main.nf -c nextflow.config
.
You need to edit the INPUT.json
and INFRA.json
files to be able to run the DAW on your data and infrastructure.
For more information about Nextflow, visit their website.
The docker
folder contains the Dockerfile
to create a container to run CuttleFlow. The file requirements.txt
contains a list of all python packages needed to execute CuttleFlow. To create the Docker image, run
docker build -t cuttleflow ./docker
(Please note: the requirements.txt
file needs to be in the same folder as the Dockerfile
.)
Then, to start a container using this image, run
docker run -it cuttleflow
This command starts a command line interface in a virtual ubuntu environment. The CuttleFlow code can be found in the folder /home/ubuntu/CuttleFlow
.
The description files consist of
description_folder
|-- DAW.json
|-- INFRA.json
|-- INPUT.json
|-- SPLIT_MERGE_TASKS.json
- 'DAW.json' contains the DAW declaration, including the inputs and outputs of the various tasks, and the path to the Nextflow modules to be used.
- 'INFRA.json' contains the description of the machine or cluster you want to run your workflow on.
- 'INPUT.json' contains information about your input data and its paths.
- 'SPLIT_MERGE_TASKS.json' is a file that you will need to edit to add your own parallelisation logic.
The DAW.json contains all the DAW tasks in a list called tasks
. Each task has multiple properties that need to be specified in the corresponding list entry:
name
: task name, only used internally during workflow rewriting, e.g."star_align"
toolname
: name of the tool performing the task, e.g."STAR"
operation
: operation performed by the task, e. g."align"
inputs
: list of task inputs, e.g.["TRIMGALORE.out_channel.preprocessed_reads", "STAR_GENOMEGENERATE.out_channel.index", "annotation_gtf"]
outputs
: list of task outputs, e.g.["sam"]
parameters
(optional): can be used to specify task parametersmodule_name
: name the module is imported with in the final nextflow script, e.g."STAR_ALIGN"
module_path
: path to the nextflow task module, e.g. "./modules/STAR_ALIGN.nf"include_from
(optional): if themodule_name
variable is different from the nextflow process name, this variable should contain the name of the nextflow process in the nextflow file specified by themodule_path
- this allows to reuse nextflow modules in the DAW by importing the same module multiple times with different module nameschannel_operators
(optional): list of nextflow channel operators like".collect()"
for each task input - list needs to have the same length as theinputs
list
The corresponding entry in the tasks
list should look like this:
{
"name": "star_align",
"toolname": "STAR",
"operation": "align",
"inputs": ["TRIMGALORE.out_channel.preprocessed_reads", "STAR_GENOMEGENERATE.out_channel.index", "annotation_gtf"],
"outputs": ["sam"],
"parameters": [""],
"module_name": "STAR_ALIGN",
"module_path": ".modules/STAR_ALIGN.nf",
"channel_operators": [".collect()", "", ""]
}
The different inputs of each task must be specified in the same order in the DAW description as in the corresponding nextflow process. Task inputs that have not been previously processed by other tasks are specified by the same name as in the description in the INPUT.json
. Task inputs that are outputs of previous task need to be specified in the following way: {PARENT.module_name}.out_channel.{PARENT.output_name}
. For example, if one input is the output preprocessed_reads
from the task TRIMGALORE
, the specified input should be TRIMGALORE.out_channel.preprocessed_reads
.
The infrastructure description is provided in the file INFRA.json. This json should contain information about the nodes of the infrastructure, and optionally, the infrastructures bandwidth. The nodes
variable in the json file should be a list of elements describing the name, RAM, cores and CPU frequency of the single nodes. For example, the infrastructure description of a system with one node which has 128GB of RAM, 8 cores and a CPU frequency of 2.4 GHz could look like this:
{
"bandwidth":"X",
"nodes": [
{
"name": "node",
"RAM": "128G",
"cores":"8",
"CPU":"2400m"
}]
}
The input description consists of three lists: samples
, references
and parameters
. The samples
list entries each contain the following information:
name
: sample name, e.g."sample01"
type
: input sample type, usually"reads"
path_r1
: path to the sample, e.g. `"./samples/sample01.fq.fz"'path_r2
(optional): path to the second sample (if paired end)strand
: specifies strandedness of the sample, e.g."forward"
uncompressed_size
: uncompressed sample size in GB, e.g."16"
condition
(optional): can be used to specify sample conditions likeisolated
The corresponding entry in the samples
list shoud look like this:
{
"name": "sample01",
"type": "reads",
"path_r1": "./samples/sample01.fastq.gz",
"strand": "forward",
"uncompressed_size": "16"
}
The entries in the references
list contain the following information:
name
: reference name, e.g."annotation_gtf"
path
: path to reference, e.g."./references/annotation.gtf"
type
: reference type, usually one of"genome"
,"annotation"
or"transcriptome"
uncompressed_size
: uncompressed reference size in GB, e.g."1.5"
The corresponding entry in the references
list shoud look like this:
{
"name": "annotation_gtf",
"path": "./references/annotation.gtf",
"reference_type": "annotation",
"uncompressed_size": "1.5"
}
The entries in the parameters
list contain workflow parameters included in the nextflow config file. For each parameter, simply add an entry like
{
"name": "outdir",
"value": "./results"
}
Where name
specifies the parameter name and value
the corresponding value.
The description of the split and merge tasks contains one entry per split-merge-task pair that can be added to the workflow during rewriting by CuttleFlow. Each of these entries has a name, e.g. "fastq_reads"
. Then, each of these entries needs the following variables:
operation
: the operation of the task(s) to be split, e.g.align
align_operation
:"True"
if the operation is align-like, else"False"
reference_type
: type of the input that should be split, e.g."sample"
split
: task description of the split task, needs to comply to the same task description rules as the tasks in the DAW descriptionmerge
: task description of the merge task, needs to comply to the same task description rules as the tasks in the DAW description
If you want to add annotations for some bio-tools so that they can be replaced or split by CuttleFlow, you should edit the following files:
source/annotation_files/infrastructure_runtime/{CLUSTER}.json
: add a json description of the infrastructure you ran your tool on. This json should have the same format as the infrastructure description used for rewriting.source/annotation_files/{TOOLNAME}.json
: add the description of your new tool here. Your tool description should contain the following information:
toolname
: name of the tool, e.g."Hisat2"
operation
: operation performed by the tool, e.g."align"
domain_specific_features
: specific features of the tool, e.g."splice_junctions"
is_splittable
:"True"
if the tool can be parallelized, else"False"
mandatory_input_list
: list of inputs that are required by the tool, e.g.["reads", "ref_genome"]
output_list
: list of outputs produced by the tool, e.g.["sam", "log"]
module_path
: path to the corresponding nextflow module, e.g. `"./modules/hisat2.nf"module_name
: name of the corresponding nextflow process, e.g."HISAT2_ALIGN"
Additionally, RAM requirements of alignment tools for different species can be specified. These information are entered in a list called resource_requirements_RAM
, with each entry containing the following information:
organism_name
: name of the reference species, e.g."drosophila"
reference_size
: size of the reference genome, e.g."0.137G"
RAM
: RAM required by the tool to align against the described reference genome, e.g."1G"
In total, the exemplary annotation of the toolHisat2
should look like this:
{
"toolname": "Hisat2",
"operation": "align",
"domain_specific_features": "splice_junctions",
"is_splittable": "True",
"mendatory_input_list": ["reads", "ref_genome"],
"output_list": ["sam", "log"],
"module_path": "./modules/hisat2.nf",
"module_name":"HISAT2_ALIGN",
"resource_requirements_RAM": [
{
"organism_name": "drosophila",
"reference_size": "0.137G",
"RAM": "1G"
}
]
}
source/annotation_files/runtime_aligners_with_CPU_RAM.csv
: add the runtime of your tool aswell as information about the executed alignment on your cluster here. Each line in thecsv
-file contains the following information:
infrastructure
: name of the infrastructure the tool was run on as specified in the json description of the infrastructuredataset_size
: size of the fastq file that was aligned in GBRAM
: (average) RAM of the cluster nodesCPUMHz
: (average) CPU frequency of the cluster nodes in MHzref_genome_size
: size of the reference genome in GBreference_species
: name of the reference species
Then, the file contains one column per alignment tool. If you only ran your experiments with a different tool or only a subset of the tools for which columns exist, add NaN
values for every experiments you do not have runtime data for.
Stashed changes