-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: canonical transcript mapped read extraction #77
fix: canonical transcript mapped read extraction #77
Conversation
…erse meta-package
… the same samtools_index rule to avoid multiple wrapper version being loaded in the future
… in QC plot scripts
Just to document what still needs to be done, here: Currently, the |
…n `results/logs/`
…am should handle this)
… that we will need all of the transcripts for proper maximum 3-prime position filtering
… the samtools view rule in the future, once we have a proper pysam rule for the read filtering in one step
LGTM, but there are conflicts with the master branch that need to be fixed before merging. |
…-kallisto-sleuth into fix-canonical-transcript-mapped-read-extraction
Thanks for looking through this. Will merge and create a new release as soon as it passes with the conflicts resolution... |
…sleuth bioconda recipe implicitly
…esting data to get QuantSeq tests to pass (#86) The underlying problem that we identified with the work on this `debug-vroom` branch was a malformatted `custom` file for specifying canonical transcript to use in `sleuth-diffexp.R`. The take-away message here were: 1. The `sleuth-diffexp.R` script with its large `write_results()` function was hard to debug, and to ease the burden a bit in the future, we should probably keep the extra logging statements we included. 2. The `datavzrd/diffexp-template.yaml` does not seem to play nice with `custom` canonical transcript files. The `canonical` column does not make sense in the `genes_aggregated` results table (as this should only contain gene names / identifiers, and no transcript identifiers, and only the latter could be canonical or not), so we remove it there. However, how to have a `canonical` column in the `genes_representative` case with a `custom` canonical transcript file still needs to be solved before merging this PR.
…+ awk instead of pysam
… transcript reads
🤖 I have created a release *beep* *boop* --- ## [2.5.3](v2.5.2...v2.5.3) (2024-01-30) ### Bug Fixes * canonical transcript mapped read extraction ([#77](#77)) ([52b56b0](52b56b0)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: David Laehnemann <[email protected]>
The main aim of this PR is to get the extraction of reads mapped to the canonical transcripts in the
rule get_mapped_canonical_transcripts
down from about 44GB of memory usage regardless of the input BAM file size and hours of grepping, down to seconds / minutes of extraction time with almost no memory footprint. This happens here:https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/compare/fix-canonical-transcript-mapped-read-extraction?expand=1#diff-6562a38fb77f8839a8731b8a882bf0d0b683d6268cdffb433dcbe1f360ccedc4R103
This required using a BED file, and thus I switched to generating a valid BED file in the
rule get_canonical_ids
and switched to using that BED file for keeping track of transcript strand information instead of hacking this into the contig names of the reference fasta file. This lead to some cleanup in the workflow.Other things that happened along the way are:
r-dplyr
dependency, as this is pulled in byr-tidyverse
anywaysuse
ing thesamtools index
rule -- thus, we only have to update the wrapper version in one place and all instances should always stay in syncFor now, this is not yet tested. So I'll mark this as a draft to start with. But I wanted it up here to be able to test it on different setups by checking out the branch.