Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore streaming sequence to Fastq and/or trimFastq steps #36

Open
chrisamiller opened this issue May 4, 2022 · 2 comments
Open

Explore streaming sequence to Fastq and/or trimFastq steps #36

chrisamiller opened this issue May 4, 2022 · 2 comments
Labels
performance optimizations of run time adn/or cost

Comments

@chrisamiller
Copy link
Member

These two steps eat up lots of disk costs:
sequenceToFastq: 0.139237127 local-disk 552 SSD (two of those)
trimFastq: 0.792021773 local-disk 1148 SSD (two of those)

  1. sequenceToFastq runs even when fastqs are given, and makes a copy - this is wasteful. Can we set conditional execution of that step (does WDL support that?)

  2. trimfastq could probably by piped directly into bwa, sacrificing some composability for speed. There is already optional adapter trimming in sequence_align_and_tag.wdl. Seems like we could add other trimming there as well (or in the HISAT or STAR steps for RNA)

@malachig malachig added the performance optimizations of run time adn/or cost label Jul 8, 2022
@Layth17
Copy link
Member

Layth17 commented Mar 28, 2023

Referencing this video: https://www.youtube.com/watch?v=13YfaNPv088

WDL can indeed do conditional execution, however this seems to be implemented in WDL version 1.1
( and not our current 1.0 )

1.1 https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#conditional-if-block
1.0 https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md

Boolean flag # some flag for existence or non-existence

if ( flag ) {  
   call some_task { ... }
}

edit: it seems that the conditional if-block does not have their ✨ symbol (which is used to indicate a new feature) next to it. Maybe it is actually in version 1.0 as well...

@Layth17
Copy link
Member

Layth17 commented Mar 28, 2023

Ok, so I tested the following (syntax-wise, version 1.0) and it works:

workflow sequenceToTrimmedFastq {
   input {
     ...
    Boolean bfastq1 = unaligned.sequence.fastq1
    Boolean bfastq2 = unaligned.sequence.fastq2
   }

   if (bfastq1 && bfastq2) {
     call stf.sequenceToFastq as sequenceToFastq {
       input:
       ....
     }
   }

   call tf.trimFastq {
     input:
     reads1=select_first([sequenceToFastq.read1_fastq, unaligned.sequence.fastq1]),
     reads2=select_first([sequenceToFastq.read2_fastq, unaligned.sequence.fastq2]),
     ...
   }

we would need to use select_first([]) to choose between things that are optionally generated.

I can only imagine that setting a Boolean to a File would return false if the file does not exist, but I cannot confirm that yet.

I think this is what you are looking for here @chrisamiller ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance optimizations of run time adn/or cost
Projects
None yet
Development

No branches or pull requests

3 participants