Skip to content

Latest commit

 

History

History
123 lines (97 loc) · 6.08 KB

tutorial_sherlock.md

File metadata and controls

123 lines (97 loc) · 6.08 KB

Tutorial for Stanford Sherlock 2.0 cluster

All test samples and genome data are shared on Stanford Sherlock cluster. You don't have to download any data for testing our pipeline on it.

  1. SSH to Sherlock's login node.

      $ ssh login.sherlock.stanford.edu
    
  2. Git clone this pipeline and move into it.

      $ git clone https://github.com/ENCODE-DCC/chip-seq-pipeline2
      $ cd chip-seq-pipeline2
    
  3. Download cromwell.

      $ wget https://github.com/broadinstitute/cromwell/releases/download/34/cromwell-34.jar
      $ chmod +rx cromwell-34.jar
    
  4. Set your partition in workflow_opts/sherlock.json. PIPELINE WILL NOT WORK WITHOUT A PAID SLURM PARTITION DUE TO LIMITED RESOURCE SETTINGS FOR FREE USERS. Ignore other runtime attributes for singularity.

      {
        "default_runtime_attributes" : {
          "slurm_partition": "YOUR_SLURM_PARTITON"
        }
      }
    

Our pipeline supports both Conda and Singularity.

For Conda users

  1. Install Conda

  2. Install Conda dependencies.

      $ bash conda/uninstall_dependencies.sh  # to remove any existing pipeline env
      $ bash conda/install_dependencies.sh
    
  3. Run a pipeline for a SUBSAMPLED (1/400) paired-end sample of ENCSR936XTK.

      $ source activate encode-chip-seq-pipeline # IMPORTANT!
      $ INPUT=examples/sherlock/ENCSR936XTK_subsampled_sherlock.json
      $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm cromwell-34.jar run chip.wdl -i ${INPUT} -o workflow_opts/sherlock.json
    
  4. It will take about an hour. You will be able to find all outputs on cromwell-executions/chip/[RANDOM_HASH_STRING]/. See output directory structure for details.

  5. See full specification for input JSON file.

For singularity users

  1. Add the following line to your BASH startup script (~/.bashrc or ~/.bash_profile).

      module load system singularity
    
  2. Pull a singularity container for the pipeline. This will pull pipeline's docker container first and build a singularity one on ~/.singularity. Stanford Sherlock does not allow building a container on login nodes. Wait until you get a command prompt after sdev.

      $ sdev    # sherlock cluster does not allow building a container on login node
      $ SINGULARITY_PULLFOLDER=~/.singularity singularity pull docker://quay.io/encode-dcc/chip-seq-pipeline:v1.1
      $ exit    # exit from an interactive node
    
  3. Run a pipeline for a SUBSAMPLED (1/400) paired-end sample of ENCSR936XTK.

      $ source activate encode-chip-seq-pipeline # IMPORTANT!
      $ INPUT=examples/sherlock/ENCSR936XTK_subsampled_sherlock.json
      $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm_singularity cromwell-34.jar run chip.wdl -i ${INPUT} -o workflow_opts/sherlock.json
    
  4. It will take about an hour. You will be able to find all outputs on cromwell-executions/chip/[RANDOM_HASH_STRING]/. See output directory structure for details.

  5. See full specification for input JSON file.

  6. IF YOU WANT TO RUN PIPELINES WITH YOUR OWN INPUT DATA/GENOME DATABASE, PLEASE ADD THEIR DIRECTORIES TO workflow_opts/sherlock.json. For example, you have input FASTQs on /your/input/fastqs/ and genome database installed on /your/genome/database/ then add /your/ to --bind in singularity_command_options. You can also define multiple directories there. It's comma-separated.

      {
          "default_runtime_attributes" : {
              "singularity_container" : "~/.singularity/atac-seq-pipeline-v1.1.simg",
              "singularity_command_options" : "--bind /scratch,/oak/stanford,/your/,YOUR_OWN_DATA_DIR1,YOUR_OWN_DATA_DIR1,..."
          }
      }
    

Running multiple pipelines with cromwell server mode

  1. If you want to run multiple (>10) pipelines, then run a cromwell server on an interactive node. We recommend to use screen or tmux to keep your session alive and note that all running pipelines will be killed after walltime. Run a Cromwell server with the following commands.

      $ srun -n 2 --mem 5G -t 3-0 --qos normal -p [YOUR_SLURM_PARTITION] --pty /bin/bash -i -l    # 2 CPU, 5 GB RAM and 3 day walltime
      $ hostname -f    # to get [CROMWELL_SVR_IP]
    

    For Conda users,

      $ source activate encode-chip-seq-pipeline
      $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm cromwell-34.jar server
    

    For singularity users,

      $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm_singularity cromwell-34.jar server
    
  2. You can modify backend.providers.slurm.concurrent-job-limit or backend.providers.slurm_singularity.concurrent-job-limit in backends/backend.conf to increase maximum concurrent jobs. This limit is not per sample. It's for all sub-tasks of all submitted samples.

  3. On a login node, submit jobs to the cromwell server. You will get [WORKFLOW_ID] as a return value. Keep these workflow IDs for monitoring pipelines and finding outputs for a specific sample later.

      $ INPUT=YOUR_INPUT.json
      $ curl -X POST --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1" \
        -F [email protected] \
        -F workflowInputs=@${INPUT} \
        -F workflowOptions=@workflow_opts/sherlock.json
    

To monitor pipelines, see cromwell server REST API description for more details. squeue will not give you enough information for monitoring jobs per sample. $ curl -X GET --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1/[WORKFLOW_ID]/status"