Welcome to the APAeval GitHub repository.
Quick links
- Use a benchmarked method on your own RNA-seq data
- Benchmark a new method
- Extend APAeval's benchmarks
APAeval is a community effort that was born as the APAeval hackathon at the RNA 2021 Conference. We are aiming to evaluate computational methods for the detection and quantification of poly(A) sites from RNA-seq samples in an open, reproducible and extensible manner.
- Overview of APAeval benchmarking
- What can you do?
- Some technical stuff
- Code of Conduct
- Open Science, licenses & attribution
- Get in touch
- Contributors ✨
APAeval currently consists of three benchmarking events, each consisting of a set of challenges for bioinformatics methods (=participants) that use RNA-seq data to:
- Identify polyadenylation sites
- Report poly(A) site expression as absolute quantification in TPM
- Report relative expression of poly(A) sites within transcripts
We'd still like to set up a fourth event to evaluate tools that calculate differential usage of polyadenylation sites. If you'd like to contribute, continue reading below.
- As described above, APAeval consists of three benchmarking events to evaluate the performance of different tasks that the methods of interest (=participants) might be able to perform: PAS identification, absolute quantification, and relative quantification. A method can participate in one, two or all three events, depending on its functions.
- Raw data: For challenges within the benchmarking events, APAeval is using data from several different selected publications. Generally, one dataset (consisting of one or more samples) corresponds to one challenge (here, datasets for challenges x and y are depicted). All raw RNA-seq data is processed with nf-core/rna-seq for quality control and mapping. For each dataset we provide a matching ground truth file, created from 3’ end seq data from the same publications as the raw RNA-seq data, that will be used in the challenges to assess the performance of participants. You can find an overview of RNA-seq and matching ground truth samples in the APAeval Zenodo snapshot.
- Sanctioned input files: The processed input data is made available in .bam format. Additionally, for each dataset a gencode annotation in .gtf format, as well as a reference PAS atlas in .bed format for participants that depend on pre-defined PAS (not shown), are provided.
- In order to evaluate each participant in different challenges, a re-usable “method workflow” has to be written in either Snakemake or Nextflow. Within this workflow, all necessary pre- and post-processing steps that are needed to get from the input formats provided by APAeval (see 3.), to the output specified by APAeval in their metrics specifications (see 5.) have to be performed.
- To ensure compatibility with the workflows of the benchmarking events, specifications for file formats (output of method workflows = input for benchmarking workflows) are provided by APAeval.
- Within a benchmarking event, one or more challenges will be performed. A challenge is primarily defined by the input dataset used for performance assessment. Results of a challenge (metrics) are computed for each participant within a "benchmarking workflow".
- In order to compare the performance of participants, results for each participant are uploaded to the OEB database, where metrics for all participants are visualized per challenge.
Firstly, you might want to check our manuscript or our OpenEBench site to find the method that would perform best for your use case. If you have decided on a method to use, head over to the method workflows section in this repo and follow the instructions in the README.md
of the method of your choice. All our method workflows are built in either Snakemake or Nextflow, and use containers for individual steps to ensure reproducibility and reusability. For instructions on how to set up a conda environment for running APAeval workflows see here.
You'll need to have your RNA-seq data ready in
.bam
format. No idea how to get there? You could check out the nf-core RNA-Seq analysis pipeline or other tools such as ZARP.
Have you developed a new computational method for investigating APA from RNA-seq data? Or are you interested in one of the tools we haven't managed to include in APAeval yet? We'd be very happy if you decided to contribute to APAeval!
In order to ensure reproducibility of the benchmarks, as well as reusability and shareability of the benchmarked method, you'd start by writing an APAeval style method workflow. That workflow will take .bam
files as an input, and create .bed
files compatible with the specification for the respective APAeval benchmarking event. Create a PR (pull request; please ask in our Github discussions board to be added to APAeval as a collaborator, or create the PR from a fork) in this repo and wait for your request to be approved. You can then run the workflow on the data for all APAeval challenges and use the resulting .bed
files in the corresponding APAeval benchmarking workflow in order to compare the performance of your tool to the APAeval ground truths. Finally you can submit your metrics .json
files to us and we'll take care of including them in our OEB site.
One of the main goals of APAeval is to provide extensible benchmarking, such that new tools, new challenges or new metrics can be added at any time. Therefore we warmly welcome any contribution to the project. A good starting point would be to visit our issue and discussion boards. The latter one is also the place where you can reach out to us and request we add you to the repo as a collaborator (alternatively, create your PRs from a fork). You can then take on an existing task, suggest a new one, or start a discussion.
We are partnering with OpenEBench, a benchmarking and technical monitoring platform for bioinformatics tools. OpenEBench development, maintenance and operation is coordinated by Barcelona Supercomputing Center (BSC) together with partners from the European Life Science infrastructure initiative ELIXIR.
OpenEBench tooling will facilitate the computation and visualization of benchmarking results and store the results of all benchmarking events and challenges in their databases, making it easy for others to explore results. This should also make it easy to add additional participants to existing benchmarking events later on. OpenEBench developers are also advising us on creating benchmarks that are compatible with good practices in the wider community of bioinformatics challenges.
For reproducible execution of our workflows (both method and benchmarking workflows) we're using a conda environment with fixed versions of Snakemake, Nextflow, some python packages, and Singularity. Make sure you have conda installed and from the root directory of this repo create the APAeval environment with
conda env create -f apaeval_env.yaml
You can then activate it with:
conda activate apaeval
NOTE: If you're working on Windows or Mac, you might have to google about setting up a virtual machine for running Singularity.
ANOTHER NOTE: If you run into problems regarding root access & Singularity with the described setup, try removing Singularity installation from the
apaeval_env.yaml
and install it independently.
You can now execute the workflows!
Here are some pointers and tutorials for the main software tools that we are using at APAeval:
Conda: tutorial
Docker: tutorial
Git: tutorial
GitHub: general tutorial / GitHub flow tutorial
Nextflow: tutorial
Singularity: tutorial
Snakemake: tutorial
Please be kind to one another and mind the Contributor Covenant's Code of Conduct for all interactions with the community. A copy of the Code of Conduct is also shipped with this repository. Please report any violations to the Code of Conduct to [email protected].
Following best practices for writing software and sharing data and code is important to us, and therefore we want to apply, as much as possible, FAIR Principles to data and software alike. This includes publishing all code open source, under permissive licenses approved by the Open Source Initiative and all data by a permissive Creative Commons license.
In particular, we publish all code under the MIT license and all data under the CC0 license. An exception are all benchmarking workflows, which are published under the GPLv3 license, as the provided template is derived from an OpenEBench example workflow that is itself licensed under GPLv3. A copy of the MIT license is also shipped with this repository.
We also believe that attribution, provenance and transparency are crucial for an open and fair work environment in the sciences, especially in a community effort like APAeval. Therefore, we would like to make clear from the beginning that in all publications deriving from APAeval (journal manuscript, data and code repositories), any non-trivial contributions will be acknowledged by authorship.
We expect that all contributors accept the license and attribution policies outlined above.
If you would like to contribute to APAeval or have any questions, we'd be happy to hear from you via our Github Discussions board. If you already have a specific issue in mind, feel free to add it to our issues board. You can also reach out to [email protected].
If APAeval was useful for you in your work, please cite our manuscript:
Extensible benchmarking of methods that identify and quantify polyadenylation sites from RNA-seq data
Sam Bryce-Smith, Dominik Burri, Matthew R. Gazzara, Christina J. Herrmann, Weronika Danecka, Christina M. Fitzsimmons, Yuk Kei Wan, Farica Zhuang, Mervin M. Fansler, José M. Fernández, Meritxell Ferret, Asier Gonzalez-Uriarte, Samuel Haynes, Chelsea Herdman, Alexander Kanitz, Maria Katsantoni, Federico Marini, Euan McDonnel, Ben Nicolet, Chi-Lam Poon, Gregor Rot, Leonard Schärfen, Pin-Jou Wu, Yoseop Yoon, Yoseph Barash, Mihaela Zavolan
bioRxiv 2023.06.23.546284; doi: https://doi.org/10.1101/2023.06.23.546284
Thanks goes to these wonderful people (emoji key):
Chelsea Herdman 📆 📋 🤔 👀 📢 📖 |
ninsch3000 💻 🔣 📖 🎨 📋 🧑🏫 📆 💬 👀 📢 🤔 🐛 ✅ |
Euan McDonnell 💻 🤔 🧑🏫 |
Alex Kanitz 🐛 💻 📖 💡 📋 🤔 🚇 🚧 🧑🏫 📆 💬 👀 📢 |
Yuk Kei Wan 🐛 📝 💻 🔣 📖 💡 📋 🤔 🧑🏫 📆 💬 |
Ben 🔣 🤔 📆 |
pjewell-biociphers 🚧 |
mzavolan 🔣 📖 📋 💵 🤔 🧑🏫 📆 💬 👀 📢 |
Mervin Fansler 🐛 💻 📖 📋 🤔 🧑🏫 📆 💬 👀 |
Maria Katsantoni 💻 🤔 🧑🏫 💬 |
daneckaw 💻 🔣 📋 🤔 📆 ✅ |
Dominik Burri 🐛 💻 🔣 📖 💡 📋 🤔 🚇 🧑🏫 📆 💬 |
mrgazzara 💻 📖 🔣 📋 🤔 🚇 🚧 📆 🧑🏫 📢 |
Christina Fitzsimmons 📖 📋 🤔 📆 📢 |
Leo Schärfen 💻 🤔 📢 |
poonchilam 💻 🤔 💬 |
dseyres 💻 📖 🤔 |
Pierre-Luc 🔣 📖 📋 🤔 📆 |
SamBryce-Smith 💻 🤔 🐛 📖 🚧 🧑🏫 📆 💬 👀 ✅ 📢 |
Pin-Jou Wu 💻 🤔 |
yoseopyoon 💻 🤔 |
Farica Zhuang 🐛 💻 📖 🤔 🚧 📆 💬 👀 |
Asier Gonzalez 🐛 💻 💡 🤔 🚇 🧑🏫 📆 💬 |
txellferret 💻 💡 🤔 🚇 🧑🏫 💬 |
Gregor Rot 🐛 💻 🤔 🚧 👀 |
José María Fernández 🤔 🚇 🧑🏫 |
This project follows the all-contributors specification. Contributions of any kind welcome!