Skip to content

Commit

Permalink
update-documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
raphaelsty committed Aug 29, 2024
1 parent dfcf708 commit 9351bc0
Show file tree
Hide file tree
Showing 11 changed files with 144 additions and 83 deletions.
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-blue.svg?style=flat-square" alt="license"></a>
</div>

PyLate is a library built on top of Sentence Transformers, designed to simplify and optimize fine-tuning, inference, and retrieval with state-of-the-art ColBERT models. It enables easy fine-tuning on both single and multiple GPUs, providing flexibility for various hardware setups. PyLate also streamlines document retrieval and allows you to load a wide range of models, enabling you to construct ColBERT models from most pre-trained language models.
<p align="justify">
PyLate is a library built on top of Sentence Transformers, designed to simplify and optimize fine-tuning, inference, and retrieval with state-of-the-art ColBERT models. It enables easy fine-tuning on both single and multiple GPUs, providing flexibility for various hardware setups. PyLate also streamlines document retrieval and allows you to load a wide range of models, enabling you to construct ColBERT models from most pre-trained language models.
</p>

## Installation

Expand Down Expand Up @@ -416,7 +418,9 @@ make livedoc
```

## Citation
If you use PyLate for research, you can refer to the library with this BibTeX:

You can refer to the library with this BibTeX:

```bibtex
@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
Expand Down
4 changes: 2 additions & 2 deletions docs/api/losses/Contrastive.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th

ColBERT model.

- **score_metric** – defaults to `<function colbert_scores at 0x7f7533b9dbc0>`
- **score_metric** – defaults to `<function colbert_scores at 0x14073dcf0>`

ColBERT scoring function. Defaults to colbert_scores.

Expand Down Expand Up @@ -228,7 +228,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th

Copy parameters and buffers from :attr:`state_dict` into this module and its descendants.

If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing any keys that are expected by this module but missing from the provided ``state_dict``. * **unexpected_keys** is a list of str containing the keys that are not expected by this module but present in the provided ``state_dict``. Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``.
If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing the missing keys * **unexpected_keys** is a list of str containing the unexpected keys Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``.

**Parameters**

Expand Down
4 changes: 2 additions & 2 deletions docs/api/losses/Distillation.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Distillation loss for ColBERT model. The loss is computed with respect to the fo

SentenceTransformer model.

- **score_metric** (*Callable*) – defaults to `<function colbert_kd_scores at 0x7f7533b9f380>`
- **score_metric** (*Callable*) – defaults to `<function colbert_kd_scores at 0x16ec65120>`

Function that returns a score between two sequences of embeddings.

Expand Down Expand Up @@ -232,7 +232,7 @@ Distillation loss for ColBERT model. The loss is computed with respect to the fo

Copy parameters and buffers from :attr:`state_dict` into this module and its descendants.

If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing any keys that are expected by this module but missing from the provided ``state_dict``. * **unexpected_keys** is a list of str containing the keys that are not expected by this module but present in the provided ``state_dict``. Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``.
If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing the missing keys * **unexpected_keys** is a list of str containing the unexpected keys Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``.

**Parameters**

Expand Down
4 changes: 2 additions & 2 deletions docs/api/models/ColBERT.md
Original file line number Diff line number Diff line change
Expand Up @@ -494,7 +494,7 @@ Loads or creates a ColBERT model that can be used to map sentences / text to mul

Copy parameters and buffers from :attr:`state_dict` into this module and its descendants.

If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing any keys that are expected by this module but missing from the provided ``state_dict``. * **unexpected_keys** is a list of str containing the keys that are not expected by this module but present in the provided ``state_dict``. Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``.
If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing the missing keys * **unexpected_keys** is a list of str containing the unexpected keys Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``.

**Parameters**

Expand Down Expand Up @@ -602,7 +602,7 @@ Loads or creates a ColBERT model that can be used to map sentences / text to mul

**Returns**

*list[torch.Tensor]*: A list of pooled embeddings for each document.
*list*: A list of pooled embeddings for each document.

???- note "pop"

Expand Down
2 changes: 1 addition & 1 deletion docs/api/models/Dense.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ Performs linear projection on the token embeddings to a lower dimension.

Copy parameters and buffers from :attr:`state_dict` into this module and its descendants.

If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing any keys that are expected by this module but missing from the provided ``state_dict``. * **unexpected_keys** is a list of str containing the keys that are not expected by this module but present in the provided ``state_dict``. Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``.
If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing the missing keys * **unexpected_keys** is a list of str containing the unexpected keys Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``.

**Parameters**

Expand Down
30 changes: 20 additions & 10 deletions docs/documentation/datasets.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
PyLate is designed to be compatible with Hugging Face datasets, facilitating seamless integration for tasks such as knowledge distillation and contrastive model training. Below are examples illustrating how to load and prepare datasets for these specific training objectives.
PyLate is designed to be compatible with Hugging Face datasets, enabling seamless integration for tasks like knowledge distillation and contrastive model training. Below are examples of how to load and prepare datasets for these specific training objectives.

## Contrastive Dataset

## Constrastive dataset

Contrastive training involves datasets that contain a query, a positive document (relevant to the query), and a negative document (irrelevant to the query). This the standard triplet format used by Sentence Transformers and so PyLate contrastive training is **compatible with every existing triplet datasets.**
Contrastive training requires datasets that include a query, a positive document (relevant to the query), and a negative document (irrelevant to the query). This is the standard triplet format used by Sentence Transformers, making PyLate's contrastive training **compatible with all existing triplet datasets**.

### Loading a pre-built contrastive dataset

Expand Down Expand Up @@ -79,13 +78,18 @@ train_dataset, test_dataset = dataset.train_test_split(test_size=0.001)

## Knowledge distillation dataset

For fine-tuning a model using knowledge distillation loss, three distinct dataset files are required: train, queries, and documents. Each file contains unique and complementary information necessary for the distillation process:
For fine-tuning a model using knowledge distillation loss, three distinct dataset files are required: train, queries, and documents.

???+ info
Each file contains unique and complementary information necessary for the distillation process:


- `train`: Contains three columns: `['query_id', 'document_ids', 'scores']`
- `query_id` refers to the query identifier.
- `document_ids` is a list of document IDs relevant to the query.
- `scores` corresponds to the relevance scores between the query and each document.

- `train`: Contains three columns: `['query_id', 'document_ids', 'scores']`
- `query_id` refers to the query identifier.
- `document_ids` is a list of document IDs relevant to the query.
- `scores` corresponds to the relevance scores between the query and each document.
### Train

Example entry:

Expand Down Expand Up @@ -115,7 +119,10 @@ Example entry:
}
```

Note: Ensure that the length of `document_ids` matches the length of `scores`.
???+ warning
Ensure that the length of `document_ids` matches the length of `scores`.

### Queries

- `queries`: Contains two columns: `['query_id', 'text']`

Expand All @@ -125,6 +132,8 @@ Example entry:
{"query_id": 749480, "text": "example query 1"}
```

### Documents

- `documents`: contains two columns: `['document_ids', 'text']`

Example entry:
Expand All @@ -136,6 +145,7 @@ Example entry:
}
```
### Loading a pre-built knowledge distillation dataset

You can directly download an existing knowledge distillation dataset from Hugging Face's hub, such as the English [MS MARCO dataset with BGE M3 scores](https://huggingface.co/datasets/lightonai/ms-marco-en-bge) or the [French version](https://huggingface.co/datasets/lightonai/ms-marco-fr-bge).
Simply load the different files by giving the respective names to the ```load_dataset``` function:

Expand Down
Loading

0 comments on commit 9351bc0

Please sign in to comment.