Merge pull request #45 from lightonai/small_fixes

Small fixes
lightonai · Aug 29, 2024 · dfcf708 · dfcf708
2 parents b52880c + 62a5b5b
commit dfcf708
Show file tree

Hide file tree

Showing 14 changed files with 38 additions and 9 deletions.
diff --git a/README.md b/README.md
@@ -413,4 +413,15 @@ make ruff
 
 ```bash
 make livedoc
+```
+
+## Citation
+If you use PyLate for research, you can refer to the library with this BibTeX:
+```bibtex
+@misc{PyLate,
+  title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
+  author={Chaffin, Antoine and Sourty, Raphaël},
+  url={https://github.com/lightonai/pylate},
+  year={2024}
+}
 ```
diff --git a/docs/.pages b/docs/.pages
@@ -1,4 +1,4 @@
 nav:
     - documentation
-    - benchmarks
+    - models
     - api
diff --git a/docs/api/losses/Contrastive.md b/docs/api/losses/Contrastive.md
@@ -10,7 +10,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th
 
     ColBERT model.
 
-- **score_metric** – defaults to `<function colbert_scores at 0x17af08fe0>`
+- **score_metric** – defaults to `<function colbert_scores at 0x7f7533b9dbc0>`
 
     ColBERT scoring function. Defaults to colbert_scores.
 

diff --git a/docs/api/losses/Distillation.md b/docs/api/losses/Distillation.md
@@ -10,7 +10,7 @@ Distillation loss for ColBERT model. The loss is computed with respect to the fo
 
     SentenceTransformer model.
 
-- **score_metric** (*Callable*) – defaults to `<function colbert_kd_scores at 0x17af70360>`
+- **score_metric** (*Callable*) – defaults to `<function colbert_kd_scores at 0x7f7533b9f380>`
 
     Function that returns a score between two sequences of embeddings.
 

diff --git a/docs/documentation/datasets.md b/docs/documentation/datasets.md
@@ -136,7 +136,7 @@ Example entry:
 }
 ```
 ### Loading a pre-built knowledge distillation dataset
-You can directly download an existing knowledge distillation dataset from Hugging Face's hub, such as the [MS MARCO dataset with BGE M3 scores](https://huggingface.co/datasets/lightonai/ms-marco-en-bge).
+You can directly download an existing knowledge distillation dataset from Hugging Face's hub, such as the English [MS MARCO dataset with BGE M3 scores](https://huggingface.co/datasets/lightonai/ms-marco-en-bge) or the [French version](https://huggingface.co/datasets/lightonai/ms-marco-fr-bge).
 Simply load the different files by giving the respective names to the ```load_dataset``` function:
 
 ```python

diff --git a/docs/documentation/training.md b/docs/documentation/training.md
@@ -83,6 +83,10 @@ trainer.train()
 
 ```
 
+Please note that for multi-GPU training, running ``python training.py`` **will use Data Parallel (DP) by default**. We strongly suggest using using Distributed Data Parallelism (DDP) using accelerate or torchrun: ``accelerate launch --num_processes num_gpu training.py``.
+
+Refer to this [documentation](https://sbert.net/docs/sentence_transformer/training/distributed.html) for more information.
+
 ## Knowledge Distillation Training
 
 The training of late-interaction models have shown to benefit from knowledge distillation compared to a more simple contrastive learning.
@@ -162,6 +166,9 @@ trainer = SentenceTransformerTrainer(
 trainer.train()
 
 ```
+
+Once again, use [DDP](https://sbert.net/docs/sentence_transformer/training/distributed.html) if you want the best performance when training using multiple GPUs.
+
 ## ColBERT parameters
 All the parameters of the ColBERT modeling can be found [here](https://lightonai.github.io/pylate/api/models/ColBERT/#parameters). Important parameters to consider are:
 

diff --git a/docs/index.md b/docs/index.md
@@ -413,4 +413,15 @@ make ruff
 
 ```bash
 make livedoc
+```
+
+## Citation
+If you use PyLate for research, you can refer to the library with this BibTeX:
+```bibtex
+@misc{PyLate,
+  title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
+  author={Chaffin, Antoine and Sourty, Raphaël},
+  url={https://github.com/lightonai/pylate},
+  year={2024}
+}
 ```
diff --git a/docs/benchmarks/.pages → docs/models/.pages b/docs/benchmarks/.pages → docs/models/.pages
diff --git a/docs/benchmarks/models.md → docs/models/models.md b/docs/benchmarks/models.md → docs/models/models.md
@@ -7,8 +7,9 @@ Here is a list of the pre-trained ColBERT models available in PyLate along with
 
 | Model                                 | BEIR AVG | NFCorpus | SciFact | SCIDOCS | FiQA2018 | TRECCOVID | HotpotQA | Touche2020 | ArguAna | ClimateFEVER | FEVER | QuoraRetrieval | NQ   | DBPedia |
 |---------------------------------------|----------|----------|---------|---------|----------|-----------|----------|------------|---------|--------------|-------|----------------|------|---------|
-| answerdotai/answerai-colbert-small-v1 | 53.79    | 37.3     | 74.77   | 18.42   | 41.15    | 84.59     | 76.11    | 25.69      | 50.09   | 33.07        | 90.96 | 87.72          | 59.1 | 45.58   |
-| lightonai/colbertv2.0                 | 50.02    | 33.8     | 69.3    | 15.4    | 35.6     | 73.3      | 66.7     | 26.3       | 46.3    | 17.6         | 78.5  | 85.2           | 56.2 | 44.6    |
+| [lightonai/colbertv2.0](https://huggingface.co/lightonai/colbertv2.0)                 | 50.02    | 33.8     | 69.3    | 15.4    | 35.6     | 73.3      | 66.7     | 26.3       | 46.3    | 17.6         | 78.5  | 85.2           | 56.2 | 44.6    |
+| [answerdotai/answerai-colbert-small-v1](https://huggingface.co/answerdotai/answerai-colbert-small-v1) | 53.79    | 37.3     | 74.77   | 18.42   | 41.15    | 84.59     | 76.11    | 25.69      | 50.09   | 33.07        | 90.96 | 87.72          | 59.1 | 45.58   |
+
 
 Please note that the `lightonai/colbertv2.0` is simply a translation of the original [ColBERTv2 model](https://huggingface.co/colbert-ir/colbertv2.0/tree/main) to work with PyLate and we thank Omar Khattab for allowing us to share the model on PyLate.
 

diff --git a/evaluation/beir_dataset.py → examples/evaluation/beir_dataset.py b/evaluation/beir_dataset.py → examples/evaluation/beir_dataset.py
diff --git a/evaluation/custom_dataset.py → examples/evaluation/custom_dataset.py b/evaluation/custom_dataset.py → examples/evaluation/custom_dataset.py
diff --git a/train/triplet.py → examples/train/contrastive.py b/train/triplet.py → examples/train/contrastive.py
diff --git a/train/knowledge_distillation.py → examples/train/knowledge_distillation.py b/train/knowledge_distillation.py → examples/train/knowledge_distillation.py
diff --git a/pylate/evaluation/custom_dataset.py b/pylate/evaluation/custom_dataset.py
@@ -1,6 +1,3 @@
-from beir.datasets.data_loader import GenericDataLoader
-
-
 def load_custom_dataset(path: str, split: str = "test") -> tuple[list, list, dict]:
     """Load a custom dataset.
 
@@ -14,6 +11,8 @@ def load_custom_dataset(path: str, split: str = "test") -> tuple[list, list, dic
     Examples
     --------
     """
+    from beir.datasets.data_loader import GenericDataLoader
+
     documents, queries, qrels = GenericDataLoader(path).load(split=split)
 
     documents = [