diff --git a/README.md b/README.md index 1c96cd2..4f71864 100644 --- a/README.md +++ b/README.md @@ -351,6 +351,42 @@ Sample Output: ] ``` +## Rerank + +If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: + +```python +from pylate import rank + +queries = [ + "query A", + "query B", +] +documents = [ + ["document A", "document B"], + ["document 1", "document C", "document B"], +] +documents_ids = [ + [1, 2], + [1, 3, 2], +] + +queries_embeddings = model.encode( + queries, + is_query=True, +) +documents_embeddings = model.encode( + documents, + is_query=False, +) + +reranked_documents = rank.rerank( + documents_ids=documents_ids, + queries_embeddings=queries_embeddings, + documents_embeddings=documents_embeddings, +) +``` + ## Contributing We welcome contributions! To get started: diff --git a/docs/api/losses/Contrastive.md b/docs/api/losses/Contrastive.md index 6ce6e55..ea2922f 100644 --- a/docs/api/losses/Contrastive.md +++ b/docs/api/losses/Contrastive.md @@ -10,7 +10,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th ColBERT model. -- **score_metric** – defaults to `` +- **score_metric** – defaults to `` ColBERT scoring function. Defaults to colbert_scores. diff --git a/docs/api/losses/Distillation.md b/docs/api/losses/Distillation.md index a710d52..a172dc1 100644 --- a/docs/api/losses/Distillation.md +++ b/docs/api/losses/Distillation.md @@ -10,7 +10,7 @@ Distillation loss for ColBERT model. The loss is computed with respect to the fo SentenceTransformer model. -- **score_metric** (*Callable*) – defaults to `` +- **score_metric** (*Callable*) – defaults to `` Function that returns a score between two sequences of embeddings. diff --git a/docs/benchmarks/.pages b/docs/benchmarks/.pages index 05a507a..f45f484 100644 --- a/docs/benchmarks/.pages +++ b/docs/benchmarks/.pages @@ -1,3 +1,3 @@ -title: Benchmarks +title: Models nav: - - Benchmarks: benchmarks.md + - Models: models.md diff --git a/docs/benchmarks/benchmarks.md b/docs/benchmarks/benchmarks.md deleted file mode 100644 index c66d2d3..0000000 --- a/docs/benchmarks/benchmarks.md +++ /dev/null @@ -1,11 +0,0 @@ -# ColBERT Benchmarks - -=== "Table" - - | Model | Dataset | Language | NDCG@10 | NDCG@100 | RECALL@10 | RECALL@100 | - |:----------------------------------------|:----------|-----------:|-----------:|-----------:|---------------:|------------:| - | sentence-transformers/all-mpnet-base-v2 | dataset_x | English | 0.677864 | 0.645041 | 0.453154 | 876.714 | - | sentence-transformers/all-mpnet-base-v2 | dataset_y | English | 0.880581 | 0.858687 | 13.5424 | 10153.7 | - | sentence-transformers/all-mpnet-base-v2 | dataset_z | English | 0.878303 | 0.863555 | 0.873312 | 552.609 | - | sentence-transformers/all-mpnet-base-v2 | dataset_a | English | 0.999443 | 0.404494 | 1.33633 | 6617.5 | - diff --git a/docs/benchmarks/models.md b/docs/benchmarks/models.md new file mode 100644 index 0000000..5a2cb23 --- /dev/null +++ b/docs/benchmarks/models.md @@ -0,0 +1,15 @@ +# Available models + +Here is a list of the pre-trained ColBERT models available in PyLate along with their results on BEIR: + +=== "Table" + + +| Model | BEIR AVG | NFCorpus | SciFact | SCIDOCS | FiQA2018 | TRECCOVID | HotpotQA | Touche2020 | ArguAna | ClimateFEVER | FEVER | QuoraRetrieval | NQ | DBPedia | +|---------------------------------------|----------|----------|---------|---------|----------|-----------|----------|------------|---------|--------------|-------|----------------|------|---------| +| answerdotai/answerai-colbert-small-v1 | 53.79 | 37.3 | 74.77 | 18.42 | 41.15 | 84.59 | 76.11 | 25.69 | 50.09 | 33.07 | 90.96 | 87.72 | 59.1 | 45.58 | +| lightonai/colbertv2.0 | 50.02 | 33.8 | 69.3 | 15.4 | 35.6 | 73.3 | 66.7 | 26.3 | 46.3 | 17.6 | 78.5 | 85.2 | 56.2 | 44.6 | + +Please note that the `lightonai/colbertv2.0` is simply a translation of the original [ColBERTv2 model](https://huggingface.co/colbert-ir/colbertv2.0/tree/main) to work with PyLate and we thank Omar Khattab for allowing us to share the model on PyLate. + +We are planning to release various strong models in the near future, but feel free to contact us if you want to make your existing ColBERT compatible with PyLate! \ No newline at end of file diff --git a/docs/documentation/retrieval.md b/docs/documentation/retrieval.md index 82c3b2f..4f813bc 100644 --- a/docs/documentation/retrieval.md +++ b/docs/documentation/retrieval.md @@ -133,10 +133,46 @@ scores = retriever.retrieve( ) ``` -## Remove documents from the index +### Remove documents from the index To remove documents from the index, use the `remove_documents` method. Provide the document IDs you want to remove from the index: ```python index.remove_documents(["1", "2"]) -``` \ No newline at end of file +``` + +## ColBERT reranking + +If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: + +```python +from pylate import rank + +queries = [ + "query A", + "query B", +] +documents = [ + ["document A", "document B"], + ["document 1", "document C", "document B"], +] +documents_ids = [ + [1, 2], + [1, 3, 2], +] + +queries_embeddings = model.encode( + queries, + is_query=True, +) +documents_embeddings = model.encode( + documents, + is_query=False, +) + +reranked_documents = rank.rerank( + documents_ids=documents_ids, + queries_embeddings=queries_embeddings, + documents_embeddings=documents_embeddings, +) +``` diff --git a/docs/index.md b/docs/index.md index 6230386..f8bdb74 100644 --- a/docs/index.md +++ b/docs/index.md @@ -351,6 +351,42 @@ Sample Output: ] ``` +## Rerank + +If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: + +```python +from pylate import rank + +queries = [ + "query A", + "query B", +] +documents = [ + ["document A", "document B"], + ["document 1", "document C", "document B"], +] +documents_ids = [ + [1, 2], + [1, 3, 2], +] + +queries_embeddings = model.encode( + queries, + is_query=True, +) +documents_embeddings = model.encode( + documents, + is_query=False, +) + +reranked_documents = rank.rerank( + documents_ids=documents_ids, + queries_embeddings=queries_embeddings, + documents_embeddings=documents_embeddings, +) +``` + ## Contributing We welcome contributions! To get started: