Skip to content

Commit

Permalink
Merge pull request #57 from databio/update_geniml_docs
Browse files Browse the repository at this point in the history
Copy over fresh docs
  • Loading branch information
nsheff authored Feb 20, 2024
2 parents e09be64 + d0dce8b commit 2d3be3d
Show file tree
Hide file tree
Showing 12 changed files with 2,609 additions and 80 deletions.
61 changes: 61 additions & 0 deletions docs/geniml/datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Genomic datasets with `geniml`

## Overview
Genomic datasets are well known for their incredible size. Therefore, using these for machine learning pipelines requires clever strategies and considerations to effectively handle the large volumes of data. This is particularly problematic when training new models. To that end, `geniml` has two separate workflows for handling genomic data: one for model training and a second for model inference (using a pre-trained model). These workflows have big differences on *when* and *where* genomic datasets are tokenized and stored.

## Model training
Model training, especially pre-training, usually requires large datasets with billions of genomic tokens. These datasets are way too large to fit in memory and therefore must be streamed from disk during training. Because of this, the data must be tokenized "on the fly" for each epoch. This is wildly inefficient and for particularly large datasets, results in the majority of training time being dedicated to tokenization alone. Therefore, the data need be *pre-tokenized* into an intermediate form which can then be streamed in for each epoch. This removes tokenization entirely from the training procedure and therefore increase efficiency (Fig. 1A).

## Model inference
Model inference is the process of utilizing a pre-trained model to analyze some new, unseen data. While the output of the model might vary (embeddings, label, new data), the input is always the same: genomic tokens. Except under rare circumstances, it's not typical that model inference involves large volumes of data. Therefore, pre-tokenization is *not necessary*. Because of this, data is to be tokenized **in-memory** and directly to the format required to pass through the model (Fig. 1B).

## Tokenization forms
Given the above requirements, tokenizers need to be able to output tokenized genomic data into different forms and locations. For model training: tokenizers should take either bed files or `.h5ad` single-cell datasets and convert them into an intermediary `.gtok` file format. These `.gtok` files will be directly consumed during model training. For model inference: tokenizers should take either bed files or `.h5ad` single-cell datasets and output an in-memory representation of these tokens; typically in the form of a `torch.Tensor` or python list. The following table summarizes the format, location, and scenario in which data is tokenized:

<div align="center">

| | Where | What | When |
| --------------- | --------- | ------------- | ----------------- |
| Model training | On disk | `.gtok` files | Prior to training |
| Model inference | In memory | `torch.Tensors` | On the fly |

</div>

<div align="center">
<img width="700" src="./img/geniml_tokenization_strategy.png" alt="Tokenization strategies for model training and inference" />
</div>

## Datasets in `geniml`
`geniml` uses `pytorch` + `lightning` to train models. This ecosystem encourages the use of `torch`s built-in `Dataset` class to parallelize and batch the loading of data. Because training and fine-tuning models requires pre-tokenized data (`.gtok` files), `geniml` needs datasets to handle this. It most likely will look like:
```python
from typing import List

from torch.data.utils import IterableDataset

class PretokenizedDataset(IterableDataset):
def __init__(self, data: Union[str, List[str]):
self.data = data
if isinstance(data, str):
self._is_folder = True
elif isinstance(data, list) and isinstance(data[0], str):
self._is_folder = False
else:
raise ValueError("`data` must be a path to a folder or a list of `.gtok` files")

def __iter__(self, indx: int):
if self._is_folder:
for file in os.listdir(self.data):
with open(file, 'r') as f:
for line in file.readlines():
yield line.split(",")
else:
for file in self.data:
with open(file, 'r') as f:
for line in file.readlines():
yield line.split(",")

```
Here, we are no longer tokenizing each epoch, rather just streaming in data that has already been pre-tokenized. I still need to think about this in the context of fine-tuning and datasets that require targets and labels.

## `.gtok` file format
The `.gtok` file format is a binary file where each token is stored as a 32-bit integer. This allows tokens to be stored in a very compact format with the ability to represent up to 4 billion unique tokens. Using our companion package `genimtools`, we can convert `.bed` files into `.gtok` files after tokenization.
Binary file added docs/geniml/img/geniml_tokenization_strategy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2,187 changes: 2,187 additions & 0 deletions docs/geniml/img/geniml_tokenization_strategy.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 17 additions & 0 deletions docs/geniml/manuscripts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Published manuscripts describing geniml components

If you find `geniml` useful for your research, please cite us! It helps us convince others that our work is useful. Here is a list of published papers that describe different modules or components in the `geniml` package.

---

N. J. LeRoy et al., “Fast clustering and cell-type annotation of scATACdata with pre-trained embeddings,” bioRxiv, 2023, doi: 10.1101/2023.08.01.551452.

J. Rymuza et al., “Methods for constructing and evaluating consensus genomic interval sets,” bioRxiv, 2023, doi: 10.1101/2023.08.03.551899.

E. Gharavi, N. J. LeRoy, G. Zheng, A. Zhang, D. E. Brown, and N. C. Sheffield, “Joint representation learning for retrieval and annotation of genomic interval sets,” bioRxiv, 2023, doi: 10.1101/2023.08.21.554131.

G. Zheng et al., “Methods for evaluating unsupervised vector representations of genomic regions,” bioRxiv, 2023, doi: 10.1101/2023.08.03.551899.

B. Xue, O. Khoroshevskyi, R. A. Gomez, and N. C. Sheffield, “Opportunities and challenges in sharing and reusing genomic interval data,” Frontiers in Genetics, vol. 14, 2023-03, doi: 10.3389/fgene.2023.1155809.

E. Gharavi et al., “Embeddings of genomic region sets capture rich biological associations in low dimensions,” Bioinformatics, 2021-03, doi: 10.1093/bioinformatics/btab439.
58 changes: 58 additions & 0 deletions docs/geniml/tutorials/cell-type-annotation-with-knn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# How to annotate cell-types with KNN
In the [previous tutorial](./load-qdrant-with-cell-embeddings.md), we loaded a vector database with cell embeddings. In this tutorial, we will show how to use this vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.

If you have **not** completed the previous tutorial, you should ensure you have a vector database with cell embeddings.

## What is K-nearest-neighbors (KNN) classification?
According to [IBM](https://www.ibm.com/topics/knn), K-nearest-neighbors classification is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. Point more simply: KNN is a classification algorithm that uses the distance between an **unlabeled** data point and its **labed** neighbors to classify the new data point.

Assuming we have a vector-space of well-annotated cell embeddings, we can use KNN to classify new cell embeddings based on their proximity to the labeled cell embeddings.

## Querying the vector database
First, we need to generate new cell embeddings for the cells we want to annotate. **Note: it is imperative that the new cell embeddings are generated using the same model as the cell embeddings in the vector database.** The previous tutorial used `databio/r2v-luecken2021-hg38-v2` to generate cell embeddings. We will use the same model to generate new cell embeddings.

```python
import scanpy as sc

from geniml.scembed import ScEmbed

adata = sc.read_h5ad("path/to/adata_unlabeled.h5ad")

model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
```

We can get embeddings of the dataset using the pre-trained model:

```python
embeddings = model.encode(adata)

adata.obsm['scembed_X'] = np.array(embeddings)
```

Now that we have the new cell embeddings, we can query the vector database to find the K-nearest-neighbors of each cell embedding.

```python
from collections import Counter
from qdrant_client import QdrantClient

client = QdrantClient("localhost", port=6333)

# Query the vector database
k = 5 # set to whatever value you want, this is a hyperparameter

for i, embedding in enumerate(embeddings):
neighbors = client.search(
collection_name="luecken2021",
query_vector=embedding.tolist(),
limit=k,
with_payload=True
)
cell_types = [neighbor.payload["cell_type"] for neighbor in neighbors]

# get majority
cell_type = Counter(cell_types).most_common(1)[0][0]
adata.obs['cell_type'][i] = cell_type
```

And just like that, we've annotated our cells with cell-type labels using KNN classification. We can improve this methodology by first clustering the unlabeled cells and then using the cluster centroids to query the vector database. This will reduce the number of queries and improve the speed of the annotation process. Another approach would be to do a secondary consensus vote on each cluster and assign one label per cluster.

93 changes: 93 additions & 0 deletions docs/geniml/tutorials/integrate-with-snapatac2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Using our models with SnapATAC2
## Overview

[SnapATAC2](https://github.com/kaizhang/SnapATAC2) is a flexible, versatile, and scalable single-cell omics analysis framework. It is designed to process and analyze single-cell ATAC-seq data. SnapATAC2 is written in Rust with Python bindings. It seemlessly integrates with `scanpy` and `anndata` objects. Therefore, it is extremely easy to use `geniml` models with SnapATAC2. Here's how you can do it:

## Install tools

Ensure that you have `geniml` and `SnapATAC2` installed. You can install both using `pip`:
```bash
pip install geniml snapatac2
```

## Download some data

To get started, let's download some single-cell ATAC-seq data. We will use the [10x Genomics PBMC 10k dataset](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-controller-2-standard). The dataset contains 10,000 peripheral blood mononuclear cells (PBMCs) from a healthy donor.

You can easily grab the fragment files like so:
```bash
wget "https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz" -O pbmc_fragments.tsv.gz
```

## Pre-process with SnapATAC2

Lets start by pre-processing the data with SnapATAC2. We will closely follow the [SnapATAC2 tutorial](https://kzhang.org/SnapATAC2/tutorials/pbmc.html) to get the data into an `anndata` object.

### Import the data

Lets import the data into `snapatac2`:
```python
from pathlib import Path
import snapatac2 as snap

fragment_file = Path("pbmc_fragments.tsv.gz")
data = snap.pp.import_data(
fragment_file,
chrom_sizes=snap.genome.hg38,
file="pbmc.h5ad", # Optional
sorted_by_barcode=False,
)
```
### Run some basic quality control

Using the `snapatac2` quality control functions, we can quickly assess the quality of the data:

```python
snap.pl.frag_size_distr(data, interactive=False)
fig = snap.pl.frag_size_distr(data, show=False)
fig.update_yaxes(type="log")
fig.show()

snap.metrics.tsse(data, snap.genome.hg38)
snap.pl.tsse(data, interactive=False)

snap.pp.filter_cells(data, min_counts=5000, min_tsse=10, max_counts=100000)
```

Next, we can add a tile matrix to the data, select features, and run `scrublet` which is a doublet detection algorithm:
```python
snap.pp.add_tile_matrix(data)
snap.pp.select_features(data, n_features=250000)
snap.pp.scrublet(data)

# actually filter the cells
snap.pp.filter_doublets(data)
```

With this, we have a clean `anndata` object that we can use with `geniml`.

### Analyze with geniml

We will use a Region2Vec model to cluster the cells by generating embeddings. This PBMC data comes from peripheral blood mononuclear cells (PBMCs) from a healthy donor. As such. we will use the `databio/r2v-luecken2021-hg38-v2` model to generate embeddings because it contains embeddings for the Luecken2021 dataset, a first-of-its-kind multimodal benchmark dataset of 120,000 single cells from the human bone marrow of 10 diverse donors measured with two commercially-available multi-modal technologies: nuclear GEX with joint ATAC, and cellular GEX with joint ADT profiles.

```python
import numpy as np
import scanpy as sc
from geniml.scembed import ScEmbed

adata = sc.read_h5ad("pbmc.h5ad")
model = ScEmbed("databio/r2v-luecken2021-hg38-v2")

adata.obsm['scembed_X'] = np.array(model.encode(adata))
```

With the embeddings, we can run a usual workflow like UMAP, clustering, and visualization:
```python
sc.pp.neighbors(adata, use_rep="scembed_X")
sc.tl.umap(adata)

sc.tl.leiden(adata)
sc.pl.umap(adata, color="leiden")
```

And that's it! You've now used `geniml` with SnapATAC2. You can use the embeddings to annotate cell types, or perform other analyses. If you want to learn more about this, check out the [cell-type annotation](./cell-type-annotation-with-knn.md) tutorial.
82 changes: 82 additions & 0 deletions docs/geniml/tutorials/load-qdrant-with-cell-embeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# How to load a vector database with cell embeddings
## Overview

In this tutorial, we will show how to load a vector database with cell embeddings. There are many benefits to storing cell-embeddings in a vector database:
1. **Speed**: Loading a vector database is much faster than re-encoding cells.
2. **Reproducibility**: You can share your cell embeddings with others.
3. **Flexibility**: You can use the same cell embeddings for many different analyses.
4. **Interoperability**: You can use the same cell embeddings with many different tools.

In a subsequent tutorial, we will show how to use a vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.

## Preqrequisites
There are two core components to this tutorial: 1) the pre-trained model, and 2) the vector database.

**Pre-trained model:**
I will be using the `databio/luecken2021` model. It was trained on the [Luecken2021](https://openreview.net/forum?id=gN35BGa1Rt) dataset, a first-of-its-kind multimodal benchmark dataset of 120,000 single cells from the human bone marrow of 10 diverse donors measured with two commercially-available multi-modal technologies: nuclear GEX with joint ATAC, and cellular GEX with joint ADT profiles.

**Vector database:**
Vector databases are a new and exciting technology that allow you to store and query high-dimensional vectors very quickly. This tutorial will use the `qdrant` vector database. As a lab, we really like `qdrant` because it is fast, easy to use, and has a great API. You can learn more about `qdrant` [here](https://qdrant.com/). For `qdrant` setup, please refer to the [qdrant documentation](https://qdrant.com/docs/). In the end, you should have a running `qdrant` instance at `http://localhost:6333`.

## Data preparation

Grab a fresh copy of the Luecken2021 data from the [geo accession](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122). We want the `multiome` data. This dataset contains the binary accessibility matrix, the peaks, and the barcodes. It also conveniently contains the cell-type labels. Pre-trained models also requires that the data be in a `scanpy.AnnData` format and the `.var` attribute contain `chr`, `start`, and `end` values.

```python
import scanpy as sc

adata = sc.read_h5ad("path/to/adata.h5ad")
adata = adata[:, adata.var['feature_types'] == 'ATAC']
```

## Getting embeddings
We can easily get embeddings of the dataset using the pre-trained model:

```python
import scanpy as sc

from geniml.scembed import ScEmbed

adata = sc.read_h5ad("path/to/adata.h5ad")

model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
embeddings = model.encode(adata)

adata.obsm['scembed_X'] = np.array(embeddings)
```

## Loading the vector database
With the embeddings, we can now upsert them to `qdrant`. Ensure you have `qdrant_client` installed:

```bash
pip install qdrant-client
```

```python
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct

client = QdrantClient("localhost", port=6333)

client.create_collection(
collection_name="luecken2021",
vectors_config=VectorParams(size=embeddings.shape[1], distance=Distance.DOT),
)

embeddings, cell_types = adata.obsm['scembed_X'], adata.obs['cell_type']

points = []
for embedding, cell_type, i in zip(embeddings, cell_types, range(len(embeddings)):
points.append(
PointStruct(
id=adata.obs.index[i],
vector=embedding.tolist(),
payload={"cell_type": cell_type}

))


client.upsert(collection_name="luecken2021", points=points, wait=True)
```

You should now have a vector database with cell embeddings. In the next tutorial, we will show how to use this vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.
44 changes: 44 additions & 0 deletions docs/geniml/tutorials/pre-tokenization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Pre-tokening training data
## Overview

Before we train a model, we must do what is called *pre-tokenization*. Pre-tokenziation is the process of converting raw genomic region data into lists of tokens and saved into a special file format we call `.gtok` (genomic token) files. These are binary files that contain the tokenized data in the form of integers. The `.gtok` files are used directly to train the model. There are several benefits to this, including:
1. **Speed**: The tokenization process can be slow, especially when you have a lot of data. By pre-tokenizing the data, you only have to do this once. Then, you can use the `.gtok` files to train the model as many times as you want.
2. **Memory**: The `.gtok` files are much smaller than the original data. This means that you can store more data in memory and train larger models. Moreover, this enables streaming the data from disk, which is useful when you have a lot of data.
3. **Reproducibility**: By saving the tokenized data, you can ensure that the same data is used to train the model every time. This is important for reproducibility.

## How to pretokenize data
Pretokenizing data is easy. You can use the built-in tokenizers and utilities in `geniml` to do this. Here is an example of how to pretokenize a bed file:

```python
from genimtools.utils import write_tokens_to_gtok
from geniml.tokenization import ITTokenizer

# instantiate a tokenizer
tokenizer = ITTokenizer("path/to/universe.bed")

# get tokens
tokens = tokenizer.tokenize("path/to/bedfile.bed")
write_tokens_to_gtok(tokens.ids, "path/to/bedfile.gtok")
```

Thats it! Now you can use the `.gtok` file to train a model.

## How to use the `.gtok` files

To facilitate working with `.gtok` files, we have some helper-classes that can be used to train a model directly from `.gtok` files. For example, you can use teh `Region2VecDataset` class to load the `.gtok` files and train a model. See the [training documentation](./train-region2vec.md) for more information.

```python
from geniml.region2vec.utils import Region2VecDataset

tokens_dir = "path/to/tokens"
dataset = Region2VecDataset(tokens_dir)

for tokens in dataset:
# train the model
print(tokens) # [42, 101, 99, ...]
```

## Caveats and considerations
When pretokenizing data, you should consider the following:
1. Tokens are specific to the universe that the tokenizer was trained on. If you use a different universe, you will get different tokens. This means that you should use the same universe to pretokenize the data as you will use to train the model.
2. The `.gtok` files are binary files. This means that they are not human-readable. You should keep the original bed files as well as the `.gtok` files. This is important for reproducibility and for debugging.
Loading

0 comments on commit 2d3be3d

Please sign in to comment.