Merge pull request #57 from databio/update_geniml_docs

Copy over fresh docs
databio · Feb 20, 2024 · 2d3be3d · 2d3be3d
2 parents e09be64 + d0dce8b
commit 2d3be3d
Show file tree

Hide file tree

Showing 12 changed files with 2,609 additions and 80 deletions.
diff --git a/docs/geniml/datasets.md b/docs/geniml/datasets.md
@@ -0,0 +1,61 @@
+# Genomic datasets with `geniml`
+
+## Overview
+Genomic datasets are well known for their incredible size. Therefore, using these for machine learning pipelines requires clever strategies and considerations to effectively handle the large volumes of data. This is particularly problematic when training new models. To that end, `geniml` has two separate workflows for handling genomic data: one for model training and a second for model inference (using a pre-trained model). These workflows have big differences on *when* and *where* genomic datasets are tokenized and stored.
+
+## Model training
+Model training, especially pre-training, usually requires large datasets with billions of genomic tokens. These datasets are way too large to fit in memory and therefore must be streamed from disk during training. Because of this, the data must be tokenized "on the fly" for each epoch. This is wildly inefficient and for particularly large datasets, results in the majority of training time being dedicated to tokenization alone. Therefore, the data need be *pre-tokenized* into an intermediate form which can then be streamed in for each epoch. This removes tokenization entirely from the training procedure and therefore increase efficiency (Fig. 1A).
+
+##  Model inference
+Model inference is the process of utilizing a pre-trained model to analyze some new, unseen data. While the output of the model might vary (embeddings, label, new data), the input is always the same: genomic tokens. Except under rare circumstances, it's not typical that model inference involves large volumes of data. Therefore, pre-tokenization is *not necessary*. Because of this, data is to be tokenized **in-memory** and directly to the format required to pass through the model (Fig. 1B).
+
+## Tokenization forms
+Given the above requirements, tokenizers need to be able to output tokenized genomic data into different forms and locations. For model training: tokenizers should take either bed files or `.h5ad` single-cell datasets and convert them into an intermediary `.gtok` file format. These `.gtok` files will be directly consumed during model training. For model inference: tokenizers should take either bed files or `.h5ad` single-cell datasets and output an in-memory representation of these tokens; typically in the form of a `torch.Tensor` or python list. The following table summarizes the format, location, and scenario in which data is tokenized:
+
+<div align="center">
+
+|                 | Where  | What        | When              |
+| --------------- | --------- | ------------- | ----------------- |
+| Model training  | On disk   | `.gtok` files | Prior to training |
+| Model inference | In memory | `torch.Tensors`              | On the fly                  |
+
+</div>
+
+<div align="center">
+<img width="700" src="./img/geniml_tokenization_strategy.png" alt="Tokenization strategies for model training and inference" />
+</div>
+
+## Datasets in `geniml`
+`geniml` uses `pytorch` + `lightning` to train models. This ecosystem encourages the use of `torch`s built-in `Dataset` class to parallelize and batch the loading of data. Because training and fine-tuning models requires pre-tokenized data (`.gtok` files), `geniml` needs datasets to handle this. It most likely will look like:
+```python
+from typing import List
+
+from torch.data.utils import IterableDataset
+
+class PretokenizedDataset(IterableDataset):
+  def __init__(self, data: Union[str, List[str]):
+    self.data = data
+      if isinstance(data, str):
+        self._is_folder = True
+      elif isinstance(data, list) and isinstance(data[0], str):
+        self._is_folder = False
+      else:
+        raise ValueError("`data` must be a path to a folder or a list of `.gtok` files")
+
+  def __iter__(self, indx: int):
+    if self._is_folder:
+      for file in os.listdir(self.data):
+        with open(file, 'r') as f:
+          for line in file.readlines():
+            yield line.split(",")
+    else:
+      for file in self.data:
+        with open(file, 'r') as f:
+          for line in file.readlines():
+            yield line.split(",")
+
+```
+Here, we are no longer tokenizing each epoch, rather just streaming in data that has already been pre-tokenized. I still need to think about this in the context of fine-tuning and datasets that require targets and labels.
+
+## `.gtok` file format
+The `.gtok` file format is a binary file where each token is stored as a 32-bit integer. This allows tokens to be stored in a very compact format with the ability to represent up to 4 billion unique tokens. Using our companion package `genimtools`, we can convert `.bed` files into `.gtok` files after tokenization.
diff --git a/docs/geniml/img/geniml_tokenization_strategy.png b/docs/geniml/img/geniml_tokenization_strategy.png
diff --git a/docs/geniml/img/geniml_tokenization_strategy.svg b/docs/geniml/img/geniml_tokenization_strategy.svg
diff --git a/docs/geniml/manuscripts.md b/docs/geniml/manuscripts.md
@@ -0,0 +1,17 @@
+# Published manuscripts describing geniml components
+
+If you find `geniml` useful for your research, please cite us! It helps us convince others that our work is useful. Here is a list of published papers that describe different modules or components in the `geniml` package.
+
+---
+
+N. J. LeRoy et al., “Fast clustering and cell-type annotation of scATACdata with pre-trained embeddings,” bioRxiv, 2023, doi: 10.1101/2023.08.01.551452.
+
+J. Rymuza et al., “Methods for constructing and evaluating consensus genomic interval sets,” bioRxiv, 2023, doi: 10.1101/2023.08.03.551899.
+
+E. Gharavi, N. J. LeRoy, G. Zheng, A. Zhang, D. E. Brown, and N. C. Sheffield, “Joint representation learning for retrieval and annotation of genomic interval sets,” bioRxiv, 2023, doi: 10.1101/2023.08.21.554131.
+
+G. Zheng et al., “Methods for evaluating unsupervised vector representations of genomic regions,” bioRxiv, 2023, doi: 10.1101/2023.08.03.551899.
+
+B. Xue, O. Khoroshevskyi, R. A. Gomez, and N. C. Sheffield, “Opportunities and challenges in sharing and reusing genomic interval data,” Frontiers in Genetics, vol. 14, 2023-03, doi: 10.3389/fgene.2023.1155809.
+
+E. Gharavi et al., “Embeddings of genomic region sets capture rich biological associations in low dimensions,” Bioinformatics, 2021-03, doi: 10.1093/bioinformatics/btab439.
diff --git a/docs/geniml/tutorials/cell-type-annotation-with-knn.md b/docs/geniml/tutorials/cell-type-annotation-with-knn.md
@@ -0,0 +1,58 @@
+# How to annotate cell-types with KNN
+In the [previous tutorial](./load-qdrant-with-cell-embeddings.md), we loaded a vector database with cell embeddings. In this tutorial, we will show how to use this vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.
+
+If you have **not** completed the previous tutorial, you should ensure you have a vector database with cell embeddings.
+
+## What is K-nearest-neighbors (KNN) classification?
+According to [IBM](https://www.ibm.com/topics/knn), K-nearest-neighbors classification is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. Point more simply: KNN is a classification algorithm that uses the distance between an **unlabeled** data point and its **labed** neighbors to classify the new data point.
+
+Assuming we have a vector-space of well-annotated cell embeddings, we can use KNN to classify new cell embeddings based on their proximity to the labeled cell embeddings.
+
+## Querying the vector database
+First, we need to generate new cell embeddings for the cells we want to annotate. **Note: it is imperative that the new cell embeddings are generated using the same model as the cell embeddings in the vector database.** The previous tutorial used `databio/r2v-luecken2021-hg38-v2` to generate cell embeddings. We will use the same model to generate new cell embeddings.
+
+```python
+import scanpy as sc
+
+from geniml.scembed import ScEmbed
+
+adata = sc.read_h5ad("path/to/adata_unlabeled.h5ad")
+
+model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
+```
+
+We can get embeddings of the dataset using the pre-trained model:
+
+```python
+embeddings = model.encode(adata)
+
+adata.obsm['scembed_X'] = np.array(embeddings)
+```
+
+Now that we have the new cell embeddings, we can query the vector database to find the K-nearest-neighbors of each cell embedding.
+
+```python
+from collections import Counter
+from qdrant_client import QdrantClient
+
+client = QdrantClient("localhost", port=6333)
+
+# Query the vector database
+k = 5 # set to whatever value you want, this is a hyperparameter
+
+for i, embedding in enumerate(embeddings):
+    neighbors = client.search(
+        collection_name="luecken2021", 
+        query_vector=embedding.tolist(), 
+        limit=k, 
+        with_payload=True
+    )
+    cell_types = [neighbor.payload["cell_type"] for neighbor in neighbors]
+
+    # get majority
+    cell_type = Counter(cell_types).most_common(1)[0][0]
+    adata.obs['cell_type'][i] = cell_type
+```
+
+And just like that, we've annotated our cells with cell-type labels using KNN classification. We can improve this methodology by first clustering the unlabeled cells and then using the cluster centroids to query the vector database. This will reduce the number of queries and improve the speed of the annotation process. Another approach would be to do a secondary consensus vote on each cluster and assign one label per cluster.
+
diff --git a/docs/geniml/tutorials/integrate-with-snapatac2.md b/docs/geniml/tutorials/integrate-with-snapatac2.md
@@ -0,0 +1,93 @@
+# Using our models with SnapATAC2
+## Overview
+
+[SnapATAC2](https://github.com/kaizhang/SnapATAC2) is a flexible, versatile, and scalable single-cell omics analysis framework. It is designed to process and analyze single-cell ATAC-seq data. SnapATAC2 is written in Rust with Python bindings. It seemlessly integrates with `scanpy` and `anndata` objects. Therefore, it is extremely easy to use `geniml` models with SnapATAC2. Here's how you can do it:
+
+## Install tools
+
+Ensure that you have `geniml` and `SnapATAC2` installed. You can install both using `pip`:
+```bash
+pip install geniml snapatac2
+```
+
+## Download some data
+
+To get started, let's download some single-cell ATAC-seq data. We will use the [10x Genomics PBMC 10k dataset](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-controller-2-standard). The dataset contains 10,000 peripheral blood mononuclear cells (PBMCs) from a healthy donor.
+
+You can easily grab the fragment files like so:
+```bash
+wget "https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz" -O pbmc_fragments.tsv.gz
+```
+
+## Pre-process with SnapATAC2
+
+Lets start by pre-processing the data with SnapATAC2. We will closely follow the [SnapATAC2 tutorial](https://kzhang.org/SnapATAC2/tutorials/pbmc.html) to get the data into an `anndata` object.
+
+### Import the data
+
+Lets import the data into `snapatac2`:
+```python
+from pathlib import Path
+import snapatac2 as snap
+
+fragment_file = Path("pbmc_fragments.tsv.gz")
+data = snap.pp.import_data(
+    fragment_file,
+    chrom_sizes=snap.genome.hg38,
+    file="pbmc.h5ad",  # Optional
+    sorted_by_barcode=False,
+)
+```
+### Run some basic quality control
+
+Using the `snapatac2` quality control functions, we can quickly assess the quality of the data:
+
+```python
+snap.pl.frag_size_distr(data, interactive=False)
+fig = snap.pl.frag_size_distr(data, show=False)
+fig.update_yaxes(type="log")
+fig.show()
+
+snap.metrics.tsse(data, snap.genome.hg38)
+snap.pl.tsse(data, interactive=False)
+
+snap.pp.filter_cells(data, min_counts=5000, min_tsse=10, max_counts=100000)
+```
+
+Next, we can add a tile matrix to the data, select features, and run `scrublet` which is a doublet detection algorithm:
+```python
+snap.pp.add_tile_matrix(data)
+snap.pp.select_features(data, n_features=250000)
+snap.pp.scrublet(data)
+
+# actually filter the cells
+snap.pp.filter_doublets(data)
+```
+
+With this, we have a clean `anndata` object that we can use with `geniml`.
+
+### Analyze with geniml
+
+We will use a Region2Vec model to cluster the cells by generating embeddings. This PBMC data comes from peripheral blood mononuclear cells (PBMCs) from a healthy donor. As such. we will use the `databio/r2v-luecken2021-hg38-v2` model to generate embeddings because it contains embeddings for the Luecken2021 dataset, a first-of-its-kind multimodal benchmark dataset of 120,000 single cells from the human bone marrow of 10 diverse donors measured with two commercially-available multi-modal technologies: nuclear GEX with joint ATAC, and cellular GEX with joint ADT profiles.
+
+```python
+import numpy as np
+import scanpy as sc
+from geniml.scembed import ScEmbed
+
+adata = sc.read_h5ad("pbmc.h5ad")
+model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
+
+adata.obsm['scembed_X'] = np.array(model.encode(adata))
+```
+
+With the embeddings, we can run a usual workflow like UMAP, clustering, and visualization:
+```python
+sc.pp.neighbors(adata, use_rep="scembed_X")
+sc.tl.umap(adata)
+
+sc.tl.leiden(adata)
+sc.pl.umap(adata, color="leiden")
+```
+
+And that's it! You've now used `geniml` with SnapATAC2. You can use the embeddings to annotate cell types, or perform other analyses. If you want to learn more about this, check out the [cell-type annotation](./cell-type-annotation-with-knn.md) tutorial.
diff --git a/docs/geniml/tutorials/load-qdrant-with-cell-embeddings.md b/docs/geniml/tutorials/load-qdrant-with-cell-embeddings.md
@@ -0,0 +1,82 @@
+# How to load a vector database with cell embeddings
+## Overview
+
+In this tutorial, we will show how to load a vector database with cell embeddings. There are many benefits to storing cell-embeddings in a vector database:
+1. **Speed**: Loading a vector database is much faster than re-encoding cells.
+2. **Reproducibility**: You can share your cell embeddings with others.
+3. **Flexibility**: You can use the same cell embeddings for many different analyses.
+4. **Interoperability**: You can use the same cell embeddings with many different tools.
+
+In a subsequent tutorial, we will show how to use a vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.
+
+## Preqrequisites
+There are two core components to this tutorial: 1) the pre-trained model, and 2) the vector database.
+
+**Pre-trained model:**
+I will be using the `databio/luecken2021` model. It was trained on the [Luecken2021](https://openreview.net/forum?id=gN35BGa1Rt) dataset, a first-of-its-kind multimodal benchmark dataset of 120,000 single cells from the human bone marrow of 10 diverse donors measured with two commercially-available multi-modal technologies: nuclear GEX with joint ATAC, and cellular GEX with joint ADT profiles.
+
+**Vector database:**
+Vector databases are a new and exciting technology that allow you to store and query high-dimensional vectors very quickly. This tutorial will use the `qdrant` vector database. As a lab, we really like `qdrant` because it is fast, easy to use, and has a great API. You can learn more about `qdrant` [here](https://qdrant.com/). For `qdrant` setup, please refer to the [qdrant documentation](https://qdrant.com/docs/). In the end, you should have a running `qdrant` instance at `http://localhost:6333`.
+
+## Data preparation
+
+Grab a fresh copy of the Luecken2021 data from the [geo accession](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122). We want the `multiome` data. This dataset contains the binary accessibility matrix, the peaks, and the barcodes. It also conveniently contains the cell-type labels. Pre-trained models also requires that the data be in a `scanpy.AnnData` format and the `.var` attribute contain `chr`, `start`, and `end` values.
+
+```python
+import scanpy as sc
+
+adata = sc.read_h5ad("path/to/adata.h5ad")
+adata = adata[:, adata.var['feature_types'] == 'ATAC']
+```
+
+## Getting embeddings
+We can easily get embeddings of the dataset using the pre-trained model:
+
+```python
+import scanpy as sc
+
+from geniml.scembed import ScEmbed
+
+adata = sc.read_h5ad("path/to/adata.h5ad")
+
+model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
+embeddings = model.encode(adata)
+
+adata.obsm['scembed_X'] = np.array(embeddings)
+```
+
+## Loading the vector database
+With the embeddings, we can now upsert them to `qdrant`. Ensure you have `qdrant_client` installed:
+
+```bash
+pip install qdrant-client
+```
+
+```python
+from qdrant_client import QdrantClient
+from qdrant_client.http.models import Distance, VectorParams, PointStruct
+
+client = QdrantClient("localhost", port=6333)
+
+client.create_collection(
+    collection_name="luecken2021",
+    vectors_config=VectorParams(size=embeddings.shape[1], distance=Distance.DOT),
+)
+
+embeddings, cell_types = adata.obsm['scembed_X'], adata.obs['cell_type']
+
+points = []
+for embedding, cell_type, i in zip(embeddings, cell_types, range(len(embeddings)):
+    points.append(
+        PointStruct(
+            id=adata.obs.index[i],
+            vector=embedding.tolist(),
+            payload={"cell_type": cell_type}
+
+    ))
+
+
+client.upsert(collection_name="luecken2021", points=points, wait=True)
+```
+
+You should now have a vector database with cell embeddings. In the next tutorial, we will show how to use this vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.
diff --git a/docs/geniml/tutorials/pre-tokenization.md b/docs/geniml/tutorials/pre-tokenization.md
@@ -0,0 +1,44 @@
+# Pre-tokening training data
+## Overview
+
+Before we train a model, we must do what is called *pre-tokenization*. Pre-tokenziation is the process of converting raw genomic region data into lists of tokens and saved into a special file format we call `.gtok` (genomic token) files. These are binary files that contain the tokenized data in the form of integers. The `.gtok` files are used directly to train the model. There are several benefits to this, including:
+1. **Speed**: The tokenization process can be slow, especially when you have a lot of data. By pre-tokenizing the data, you only have to do this once. Then, you can use the `.gtok` files to train the model as many times as you want.
+2. **Memory**: The `.gtok` files are much smaller than the original data. This means that you can store more data in memory and train larger models. Moreover, this enables streaming the data from disk, which is useful when you have a lot of data.
+3. **Reproducibility**: By saving the tokenized data, you can ensure that the same data is used to train the model every time. This is important for reproducibility.
+
+## How to pretokenize data
+Pretokenizing data is easy. You can use the built-in tokenizers and utilities in `geniml` to do this. Here is an example of how to pretokenize a bed file:
+
+```python
+from genimtools.utils import write_tokens_to_gtok
+from geniml.tokenization import ITTokenizer
+
+# instantiate a tokenizer
+tokenizer = ITTokenizer("path/to/universe.bed")
+
+# get tokens
+tokens = tokenizer.tokenize("path/to/bedfile.bed")
+write_tokens_to_gtok(tokens.ids, "path/to/bedfile.gtok")
+```
+
+Thats it! Now you can use the `.gtok` file to train a model.
+
+## How to use the `.gtok` files
+
+To facilitate working with `.gtok` files, we have some helper-classes that can be used to train a model directly from `.gtok` files. For example, you can use teh `Region2VecDataset` class to load the `.gtok` files and train a model. See the [training documentation](./train-region2vec.md) for more information.
+
+```python
+from geniml.region2vec.utils import Region2VecDataset
+
+tokens_dir = "path/to/tokens"
+dataset = Region2VecDataset(tokens_dir)
+
+for tokens in dataset:
+    # train the model
+    print(tokens) # [42, 101, 99, ...]
+```
+
+## Caveats and considerations
+When pretokenizing data, you should consider the following:
+1. Tokens are specific to the universe that the tokenizer was trained on. If you use a different universe, you will get different tokens. This means that you should use the same universe to pretokenize the data as you will use to train the model.
+2. The `.gtok` files are binary files. This means that they are not human-readable. You should keep the original bed files as well as the `.gtok` files. This is important for reproducibility and for debugging.