Skip to content

Commit

Permalink
update citations
Browse files Browse the repository at this point in the history
  • Loading branch information
nsheff committed Jun 13, 2024
1 parent a984334 commit d2518b9
Show file tree
Hide file tree
Showing 4 changed files with 25 additions and 1 deletion.
2 changes: 1 addition & 1 deletion docs/citations.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Thanks for citing us! If you use BEDbase, geniml, or their components in your re
<li><b>Gharavi et al. (2024). </b><i>Joint representation learning for retrieval and annotation of genomic interval sets</i>
<br><i>Bioengineering</i>. <span class="doi">DOI: <a href="http://dx.doi.org/10.3390/bioengineering11030263">10.3390/bioengineering11030263</a></span></li>
<li><b>Zheng et al. (2023). </b><i>Methods for evaluating unsupervised vector representations of genomic regions</i>
<br><i>bioRxiv</i>. <span class="doi">DOI: <a href="http://dx.doi.org/10.1101/2023.08.03.551899">10.1101/2023.08.03.551899</a></span></li>
<br><i>bioRxiv</i>. <span class="doi">DOI: <a href="http://dx.doi.org/10.1101/2023.08.28.555137">10.1101/2023.08.28.555137</a></span></li>
<li><b>Xue et al. (2023). </b><i>Opportunities and challenges in sharing and reusing genomic interval data</i>
<br><i>Frontiers in Genetics</i>. <span class="doi">DOI: <a href="http://dx.doi.org/10.3389/fgene.2023.1155809">10.3389/fgene.2023.1155809</a></span></li>
<li><b>Rymuza et al. (2023). </b><i>Methods for constructing and evaluating consensus genomic interval sets</i>
Expand Down
10 changes: 10 additions & 0 deletions docs/geniml/manuscripts/leroy2024.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings

Paper: [Manuscript at bioRxiv](http://dx.doi.org/10.1101/2023.08.01.551452)


## Abstract

**Motivation** Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) is now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower-dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning.

**Results** We implemented our approach in scEmbed, an unsupervised machine learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, pre-trained models on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use.
12 changes: 12 additions & 0 deletions docs/geniml/manuscripts/zheng2024.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Methods for evaluating unsupervised vector representations of genomic regions

Paper: [Manuscript at bioRxiv](http://dx.doi.org/10.1101/2023.08.28.555137)


## Abstract

Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.

## Relevant tutorials

To evaluate, refer to this tutorial: https://github.com/databio/region2vec_eval
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@ nav:
- Gharavi 2021: geniml/manuscripts/gharavi2021.md
- Rymuza 2024: geniml/manuscripts/rymuza2024.md
- Gharavi 2024: geniml/manuscripts/gharavi2024.md
- LeRoy 2024: geniml/manuscripts/leroy2024.md
- Zheng 2024: geniml/manuscripts/zheng2024.md
- How to cite: citations.md
- API documentation: geniml/autodoc_build/geniml.md
- Support: geniml/support.md
Expand Down

0 comments on commit d2518b9

Please sign in to comment.