Skip to content

Latest commit

 

History

History
223 lines (186 loc) · 11.5 KB

README.md

File metadata and controls

223 lines (186 loc) · 11.5 KB

CuttleFlow

CuttleFlow is a tool that rewrites Data Analysis Workflows (DAW) according to their execution context (input data and computer/cluster the DAW is running on). It can replace tools with similar ones where appropriate, while maintaining the workflow structure and outputs. It can parallelize some tasks across multiple nodes of a cluster when appropriate, depending on the execution context. It will set the appropriate number of threads on the DAW tasks. To run our program on the rnasplice workflow, run the following commands

bash cuttleflow.sh --descriptionFolder test/rnasplice-description --output ../generated_daw_rnasplice_test

By changing the --description folder argument, you can load a different workflow, input and infrastructure description. You can change the output folder where the DAW is generated by changing the path of the --output argument.

For more information on the rnasplice workflow, visit the nf-core website.

Output

The following folder is generated from the above command:

generated_daw_rnasplice_test
|-- input.csv
|-- main.nf
|-- nextflow.config

main.nf contains the DAW's Nextflow code and the path to the module. nextflow.config contains the parameters needed to run the DAW. By default, we include the generation of trace files containing metrics on the DAW execution. input.csv contains the path and metadata and input data required to run the DAW.

Executing the ouput DAW

You can run the DAW with this command: nextflow run main.nf -c nextflow.config.

You need to edit the INPUT.json and INFRA.json files to be able to run the DAW on your data and infrastructure.

For more information about Nextflow, visit their website.

Containerization using Docker

The docker folder contains the Dockerfile to create a container to run CuttleFlow. The file requirements.txt contains a list of all python packages needed to execute CuttleFlow. To create the Docker image, run

docker build -t cuttleflow ./docker

(Please note: the requirements.txt file needs to be in the same folder as the Dockerfile.) Then, to start a container using this image, run

docker run -it cuttleflow

This command starts a command line interface in a virtual ubuntu environment. The CuttleFlow code can be found in the folder /home/ubuntu/CuttleFlow.

Creating your own description files

The description files consist of

description_folder
|-- DAW.json
|-- INFRA.json
|-- INPUT.json
|-- SPLIT_MERGE_TASKS.json
  • 'DAW.json' contains the DAW declaration, including the inputs and outputs of the various tasks, and the path to the Nextflow modules to be used.
  • 'INFRA.json' contains the description of the machine or cluster you want to run your workflow on.
  • 'INPUT.json' contains information about your input data and its paths.
  • 'SPLIT_MERGE_TASKS.json' is a file that you will need to edit to add your own parallelisation logic.

DAW description

The DAW.json contains all the DAW tasks in a list called tasks. Each task has multiple properties that need to be specified in the corresponding list entry:

  • name: task name, only used internally during workflow rewriting, e.g. "star_align"
  • toolname: name of the tool performing the task, e.g. "STAR"
  • operation: operation performed by the task, e. g. "align"
  • inputs: list of task inputs, e.g. ["TRIMGALORE.out_channel.preprocessed_reads", "STAR_GENOMEGENERATE.out_channel.index", "annotation_gtf"]
  • outputs: list of task outputs, e.g. ["sam"]
  • parameters (optional): can be used to specify task parameters
  • module_name: name the module is imported with in the final nextflow script, e.g. "STAR_ALIGN"
  • module_path: path to the nextflow task module, e.g. "./modules/STAR_ALIGN.nf"
  • include_from (optional): if the module_name variable is different from the nextflow process name, this variable should contain the name of the nextflow process in the nextflow file specified by the module_path - this allows to reuse nextflow modules in the DAW by importing the same module multiple times with different module names
  • channel_operators (optional): list of nextflow channel operators like ".collect()" for each task input - list needs to have the same length as the inputs list

The corresponding entry in the tasks list should look like this:

{
    "name": "star_align",
    "toolname": "STAR",
    "operation": "align",
    "inputs": ["TRIMGALORE.out_channel.preprocessed_reads", "STAR_GENOMEGENERATE.out_channel.index", "annotation_gtf"],
	  "outputs": ["sam"],
	  "parameters": [""],
    "module_name": "STAR_ALIGN",
    "module_path": ".modules/STAR_ALIGN.nf",
    "channel_operators": [".collect()", "", ""]
}  

The different inputs of each task must be specified in the same order in the DAW description as in the corresponding nextflow process. Task inputs that have not been previously processed by other tasks are specified by the same name as in the description in the INPUT.json. Task inputs that are outputs of previous task need to be specified in the following way: {PARENT.module_name}.out_channel.{PARENT.output_name}. For example, if one input is the output preprocessed_reads from the task TRIMGALORE, the specified input should be TRIMGALORE.out_channel.preprocessed_reads.

Infrastructure description

The infrastructure description is provided in the file INFRA.json. This json should contain information about the nodes of the infrastructure, and optionally, the infrastructures bandwidth. The nodes variable in the json file should be a list of elements describing the name, RAM, cores and CPU frequency of the single nodes. For example, the infrastructure description of a system with one node which has 128GB of RAM, 8 cores and a CPU frequency of 2.4 GHz could look like this:

{  
   "bandwidth":"X",
   "nodes": [
    {
       "name": "node",
       "RAM": "128G",
       "cores":"8",
       "CPU":"2400m"       
    }]
}

Input description

The input description consists of three lists: samples, references and parameters. The samples list entries each contain the following information:

  • name: sample name, e.g. "sample01"
  • type: input sample type, usually "reads"
  • path_r1: path to the sample, e.g. `"./samples/sample01.fq.fz"'
  • path_r2 (optional): path to the second sample (if paired end)
  • strand: specifies strandedness of the sample, e.g. "forward"
  • uncompressed_size: uncompressed sample size in GB, e.g. "16"
  • condition (optional): can be used to specify sample conditions like isolated

The corresponding entry in the samples list shoud look like this:

{
        "name": "sample01",
        "type": "reads",
        "path_r1": "./samples/sample01.fastq.gz",
        "strand": "forward",
        "uncompressed_size": "16"
}

The entries in the references list contain the following information:

  • name: reference name, e.g. "annotation_gtf"
  • path: path to reference, e.g. "./references/annotation.gtf"
  • type: reference type, usually one of "genome", "annotation" or "transcriptome"
  • uncompressed_size: uncompressed reference size in GB, e.g. "1.5"

The corresponding entry in the references list shoud look like this:

{
       "name": "annotation_gtf",
       "path": "./references/annotation.gtf",
       "reference_type": "annotation",
       "uncompressed_size": "1.5"
}

The entries in the parameters list contain workflow parameters included in the nextflow config file. For each parameter, simply add an entry like

   {
     "name": "outdir",
     "value": "./results"
   }

Where name specifies the parameter name and value the corresponding value.

Split and merge tasks description

The description of the split and merge tasks contains one entry per split-merge-task pair that can be added to the workflow during rewriting by CuttleFlow. Each of these entries has a name, e.g. "fastq_reads". Then, each of these entries needs the following variables:

  • operation: the operation of the task(s) to be split, e.g. align
  • align_operation: "True" if the operation is align-like, else "False"
  • reference_type: type of the input that should be split, e.g. "sample"
  • split: task description of the split task, needs to comply to the same task description rules as the tasks in the DAW description
  • merge: task description of the merge task, needs to comply to the same task description rules as the tasks in the DAW description

Adding tool annotations

If you want to add annotations for some bio-tools so that they can be replaced or split by CuttleFlow, you should edit the following files:

  1. source/annotation_files/infrastructure_runtime/{CLUSTER}.json: add a json description of the infrastructure you ran your tool on. This json should have the same format as the infrastructure description used for rewriting.
  2. source/annotation_files/{TOOLNAME}.json: add the description of your new tool here. Your tool description should contain the following information:
  • toolname: name of the tool, e.g. "Hisat2"
  • operation: operation performed by the tool, e.g. "align"
  • domain_specific_features: specific features of the tool, e.g. "splice_junctions"
  • is_splittable: "True" if the tool can be parallelized, else "False"
  • mandatory_input_list: list of inputs that are required by the tool, e.g. ["reads", "ref_genome"]
  • output_list: list of outputs produced by the tool, e.g. ["sam", "log"]
  • module_path: path to the corresponding nextflow module, e.g. `"./modules/hisat2.nf"
  • module_name: name of the corresponding nextflow process, e.g. "HISAT2_ALIGN"

Additionally, RAM requirements of alignment tools for different species can be specified. These information are entered in a list called resource_requirements_RAM, with each entry containing the following information:

  • organism_name: name of the reference species, e.g. "drosophila"
  • reference_size: size of the reference genome, e.g. "0.137G"
  • RAM: RAM required by the tool to align against the described reference genome, e.g. "1G" In total, the exemplary annotation of the tool Hisat2 should look like this:
{
    "toolname": "Hisat2",
    "operation": "align",
    "domain_specific_features": "splice_junctions",
    "is_splittable": "True",
    "mendatory_input_list": ["reads", "ref_genome"],
    "output_list": ["sam", "log"],
    "module_path": "./modules/hisat2.nf",
    "module_name":"HISAT2_ALIGN",

    "resource_requirements_RAM": [
        {
            "organism_name": "drosophila",
            "reference_size": "0.137G",
            "RAM": "1G"
        }
    ]
} 
  1. source/annotation_files/runtime_aligners_with_CPU_RAM.csv: add the runtime of your tool aswell as information about the executed alignment on your cluster here. Each line in the csv-file contains the following information:
  • infrastructure: name of the infrastructure the tool was run on as specified in the json description of the infrastructure
  • dataset_size: size of the fastq file that was aligned in GB
  • RAM: (average) RAM of the cluster nodes
  • CPUMHz: (average) CPU frequency of the cluster nodes in MHz
  • ref_genome_size: size of the reference genome in GB
  • reference_species: name of the reference species

Then, the file contains one column per alignment tool. If you only ran your experiments with a different tool or only a subset of the tools for which columns exist, add NaN values for every experiments you do not have runtime data for.

Stashed changes