From a98c3fc4489044718cab4fd615ef322642ba3464 Mon Sep 17 00:00:00 2001 From: nsheff Date: Thu, 13 Jun 2024 08:35:40 -0400 Subject: [PATCH] Revamp manuscripts documentation --- docs/{geniml => }/manuscripts/gharavi2021.md | 2 +- docs/{geniml => }/manuscripts/gharavi2024.md | 2 +- docs/manuscripts/gu2021.md | 14 ++++++++++++++ docs/{geniml => }/manuscripts/leroy2024.md | 8 ++++++++ docs/{geniml => }/manuscripts/rymuza2024.md | 6 +++--- docs/{geniml => }/manuscripts/zheng2024.md | 4 +++- mkdocs.yml | 11 ++++++----- 7 files changed, 36 insertions(+), 11 deletions(-) rename docs/{geniml => }/manuscripts/gharavi2021.md (78%) rename docs/{geniml => }/manuscripts/gharavi2024.md (98%) create mode 100644 docs/manuscripts/gu2021.md rename docs/{geniml => }/manuscripts/leroy2024.md (82%) rename docs/{geniml => }/manuscripts/rymuza2024.md (87%) rename docs/{geniml => }/manuscripts/zheng2024.md (93%) diff --git a/docs/geniml/manuscripts/gharavi2021.md b/docs/manuscripts/gharavi2021.md similarity index 78% rename from docs/geniml/manuscripts/gharavi2021.md rename to docs/manuscripts/gharavi2021.md index 7aea002..1ff8fc6 100644 --- a/docs/geniml/manuscripts/gharavi2021.md +++ b/docs/manuscripts/gharavi2021.md @@ -4,4 +4,4 @@ This paper was our first publication showing how to build and evaluate region set embeddings using region-set2vec, based on word2vec. -See: [train Region2Vec embeddings](../tutorials/region2vec.md) \ No newline at end of file +See: [train Region2Vec embeddings](../geniml/tutorials/region2vec.md) \ No newline at end of file diff --git a/docs/geniml/manuscripts/gharavi2024.md b/docs/manuscripts/gharavi2024.md similarity index 98% rename from docs/geniml/manuscripts/gharavi2024.md rename to docs/manuscripts/gharavi2024.md index bf8ee11..2a2912b 100644 --- a/docs/geniml/manuscripts/gharavi2024.md +++ b/docs/manuscripts/gharavi2024.md @@ -10,5 +10,5 @@ As available genomic interval data increase in scale, we require fast systems to This paper trained BEDspace models (using StarSpace with BED files). See these tutorials: -- [How to use BEDSpace to jointly embed regions and metadata](../tutorials/bedspace.md) +- [How to use BEDSpace to jointly embed regions and metadata](../geniml/tutorials/bedspace.md) diff --git a/docs/manuscripts/gu2021.md b/docs/manuscripts/gu2021.md new file mode 100644 index 0000000..83be004 --- /dev/null +++ b/docs/manuscripts/gu2021.md @@ -0,0 +1,14 @@ +# Bedshift: perturbation of genomic interval sets + +Paper: [Manuscript at Genome Biology](https://doi.org/10.1186/s13059-021-02440-w) + + +## Abstract + +Functional genomics experiments, like ChIP-Seq or ATAC-Seq, produce results that are summarized as a region set. There is no way to objectively evaluate the effectiveness of region set similarity metrics. We present Bedshift, a tool for perturbing BED files by randomly shifting, adding, and dropping regions from a reference file. The perturbed files can be used to benchmark similarity metrics, as well as for other applications. We highlight differences in behavior between metrics, such as that the Jaccard score is most sensitive to added or dropped regions, while coverage score is most sensitive to shifted regions. + +## Relevant tutorials + +Analysis from the paper is described in these tutorials: + +- [Randomizing BED files with BEDshift](../geniml/tutorials/bedshift.md) diff --git a/docs/geniml/manuscripts/leroy2024.md b/docs/manuscripts/leroy2024.md similarity index 82% rename from docs/geniml/manuscripts/leroy2024.md rename to docs/manuscripts/leroy2024.md index bf28438..cb80274 100644 --- a/docs/geniml/manuscripts/leroy2024.md +++ b/docs/manuscripts/leroy2024.md @@ -8,3 +8,11 @@ Paper: [Manuscript at bioRxiv](http://dx.doi.org/10.1101/2023.08.01.551452) **Motivation** Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) is now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower-dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. **Results** We implemented our approach in scEmbed, an unsupervised machine learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, pre-trained models on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. + +## Relevant tutorials + +Analysis from the paper is described in these tutorials: + +- [Train single-cell embeddings](../geniml/tutorials/train-scembed-model.md) +- [Populate a vector store](../geniml/tutorials/load-qdrant-with-cell-embeddings.md) +- [Predict cell-types using KNN](../geniml/tutorials/cell-type-annotation-with-knn.md) \ No newline at end of file diff --git a/docs/geniml/manuscripts/rymuza2024.md b/docs/manuscripts/rymuza2024.md similarity index 87% rename from docs/geniml/manuscripts/rymuza2024.md rename to docs/manuscripts/rymuza2024.md index 6d2a571..c9b1df7 100644 --- a/docs/geniml/manuscripts/rymuza2024.md +++ b/docs/manuscripts/rymuza2024.md @@ -17,11 +17,11 @@ This paper published 2 types of method: 1. Methods to *construct* a universe, an You can construct a universe either on the command line, or using geniml as a library: -- [Create consensus peaks with CLI](../tutorials/create-consensus-peaks.md) -- [Create consensus peaks with Python](../code/create-consensus-peaks-python.md) +- [Create consensus peaks with CLI](../geniml/tutorials/create-consensus-peaks.md) +- [Create consensus peaks with Python](../geniml/code/create-consensus-peaks-python.md) ### 2. Evaluating a universe The main methods are implemented in the `assess-universe` model with tutorial: -- [Assess universe fit tutorial](../tutorials/assess-universe.md) \ No newline at end of file +- [Assess universe fit tutorial](../geniml/tutorials/assess-universe.md) \ No newline at end of file diff --git a/docs/geniml/manuscripts/zheng2024.md b/docs/manuscripts/zheng2024.md similarity index 93% rename from docs/geniml/manuscripts/zheng2024.md rename to docs/manuscripts/zheng2024.md index 427aa7f..1c9b8d7 100644 --- a/docs/geniml/manuscripts/zheng2024.md +++ b/docs/manuscripts/zheng2024.md @@ -9,4 +9,6 @@ Representation learning models have become a mainstay of modern genomics. These ## Relevant tutorials -To evaluate, refer to this tutorial: https://github.com/databio/region2vec_eval \ No newline at end of file +Analysis from the paper is described in these tutorials: + +- [How to evalute embeddings](../geniml/tutorials/evaluation.md) diff --git a/mkdocs.yml b/mkdocs.yml index dd316ab..8596591 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -119,11 +119,12 @@ nav: - How to cite: - How to cite: citations.md - Published manuscripts: - - Gharavi et al. 2021: geniml/manuscripts/gharavi2021.md - - Rymuza et al. 2024: geniml/manuscripts/rymuza2024.md - - Gharavi et al. 2024: geniml/manuscripts/gharavi2024.md - - LeRoy et al. 2024: geniml/manuscripts/leroy2024.md - - Zheng et al. 2024: geniml/manuscripts/zheng2024.md + - Gharavi et al. 2021: manuscripts/gharavi2021.md + - Gu et al. 2021: manuscripts/gu2021.md + - Rymuza et al. 2024: manuscripts/rymuza2024.md + - Gharavi et al. 2024: manuscripts/gharavi2024.md + - LeRoy et al. 2024: manuscripts/leroy2024.md + - Zheng et al. 2024: manuscripts/zheng2024.md autodoc: jupyter: