Skip to content

Commit

Permalink
Merge pull request #45 from lightonai/small_fixes
Browse files Browse the repository at this point in the history
Small fixes
  • Loading branch information
NohTow authored Aug 29, 2024
2 parents b52880c + 62a5b5b commit dfcf708
Show file tree
Hide file tree
Showing 14 changed files with 38 additions and 9 deletions.
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,4 +413,15 @@ make ruff

```bash
make livedoc
```

## Citation
If you use PyLate for research, you can refer to the library with this BibTeX:
```bibtex
@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
```
2 changes: 1 addition & 1 deletion docs/.pages
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
nav:
- documentation
- benchmarks
- models
- api
2 changes: 1 addition & 1 deletion docs/api/losses/Contrastive.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th

ColBERT model.

- **score_metric** – defaults to `<function colbert_scores at 0x17af08fe0>`
- **score_metric** – defaults to `<function colbert_scores at 0x7f7533b9dbc0>`

ColBERT scoring function. Defaults to colbert_scores.

Expand Down
2 changes: 1 addition & 1 deletion docs/api/losses/Distillation.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Distillation loss for ColBERT model. The loss is computed with respect to the fo

SentenceTransformer model.

- **score_metric** (*Callable*) – defaults to `<function colbert_kd_scores at 0x17af70360>`
- **score_metric** (*Callable*) – defaults to `<function colbert_kd_scores at 0x7f7533b9f380>`

Function that returns a score between two sequences of embeddings.

Expand Down
2 changes: 1 addition & 1 deletion docs/documentation/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ Example entry:
}
```
### Loading a pre-built knowledge distillation dataset
You can directly download an existing knowledge distillation dataset from Hugging Face's hub, such as the [MS MARCO dataset with BGE M3 scores](https://huggingface.co/datasets/lightonai/ms-marco-en-bge).
You can directly download an existing knowledge distillation dataset from Hugging Face's hub, such as the English [MS MARCO dataset with BGE M3 scores](https://huggingface.co/datasets/lightonai/ms-marco-en-bge) or the [French version](https://huggingface.co/datasets/lightonai/ms-marco-fr-bge).
Simply load the different files by giving the respective names to the ```load_dataset``` function:

```python
Expand Down
7 changes: 7 additions & 0 deletions docs/documentation/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,10 @@ trainer.train()

```

Please note that for multi-GPU training, running ``python training.py`` **will use Data Parallel (DP) by default**. We strongly suggest using using Distributed Data Parallelism (DDP) using accelerate or torchrun: ``accelerate launch --num_processes num_gpu training.py``.

Refer to this [documentation](https://sbert.net/docs/sentence_transformer/training/distributed.html) for more information.

## Knowledge Distillation Training

The training of late-interaction models have shown to benefit from knowledge distillation compared to a more simple contrastive learning.
Expand Down Expand Up @@ -162,6 +166,9 @@ trainer = SentenceTransformerTrainer(
trainer.train()

```

Once again, use [DDP](https://sbert.net/docs/sentence_transformer/training/distributed.html) if you want the best performance when training using multiple GPUs.

## ColBERT parameters
All the parameters of the ColBERT modeling can be found [here](https://lightonai.github.io/pylate/api/models/ColBERT/#parameters). Important parameters to consider are:

Expand Down
11 changes: 11 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,4 +413,15 @@ make ruff

```bash
make livedoc
```

## Citation
If you use PyLate for research, you can refer to the library with this BibTeX:
```bibtex
@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
```
File renamed without changes.
5 changes: 3 additions & 2 deletions docs/benchmarks/models.md → docs/models/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@ Here is a list of the pre-trained ColBERT models available in PyLate along with

| Model | BEIR AVG | NFCorpus | SciFact | SCIDOCS | FiQA2018 | TRECCOVID | HotpotQA | Touche2020 | ArguAna | ClimateFEVER | FEVER | QuoraRetrieval | NQ | DBPedia |
|---------------------------------------|----------|----------|---------|---------|----------|-----------|----------|------------|---------|--------------|-------|----------------|------|---------|
| answerdotai/answerai-colbert-small-v1 | 53.79 | 37.3 | 74.77 | 18.42 | 41.15 | 84.59 | 76.11 | 25.69 | 50.09 | 33.07 | 90.96 | 87.72 | 59.1 | 45.58 |
| lightonai/colbertv2.0 | 50.02 | 33.8 | 69.3 | 15.4 | 35.6 | 73.3 | 66.7 | 26.3 | 46.3 | 17.6 | 78.5 | 85.2 | 56.2 | 44.6 |
| [lightonai/colbertv2.0](https://huggingface.co/lightonai/colbertv2.0) | 50.02 | 33.8 | 69.3 | 15.4 | 35.6 | 73.3 | 66.7 | 26.3 | 46.3 | 17.6 | 78.5 | 85.2 | 56.2 | 44.6 |
| [answerdotai/answerai-colbert-small-v1](https://huggingface.co/answerdotai/answerai-colbert-small-v1) | 53.79 | 37.3 | 74.77 | 18.42 | 41.15 | 84.59 | 76.11 | 25.69 | 50.09 | 33.07 | 90.96 | 87.72 | 59.1 | 45.58 |


Please note that the `lightonai/colbertv2.0` is simply a translation of the original [ColBERTv2 model](https://huggingface.co/colbert-ir/colbertv2.0/tree/main) to work with PyLate and we thank Omar Khattab for allowing us to share the model on PyLate.

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
5 changes: 2 additions & 3 deletions pylate/evaluation/custom_dataset.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
from beir.datasets.data_loader import GenericDataLoader


def load_custom_dataset(path: str, split: str = "test") -> tuple[list, list, dict]:
"""Load a custom dataset.
Expand All @@ -14,6 +11,8 @@ def load_custom_dataset(path: str, split: str = "test") -> tuple[list, list, dic
Examples
--------
"""
from beir.datasets.data_loader import GenericDataLoader

documents, queries, qrels = GenericDataLoader(path).load(split=split)

documents = [
Expand Down

0 comments on commit dfcf708

Please sign in to comment.