From 9351bc043f1d30a21c65f02107da27dfb15a230e Mon Sep 17 00:00:00 2001 From: Raphael Sourty Date: Thu, 29 Aug 2024 13:38:31 +0200 Subject: [PATCH] update-documentation --- README.md | 8 ++- docs/api/losses/Contrastive.md | 4 +- docs/api/losses/Distillation.md | 4 +- docs/api/models/ColBERT.md | 4 +- docs/api/models/Dense.md | 2 +- docs/documentation/datasets.md | 30 ++++++--- docs/documentation/evaluation.md | 10 +-- docs/documentation/retrieval.md | 102 ++++++++++++++++++++----------- docs/documentation/training.md | 49 ++++++++------- docs/index.md | 8 ++- docs/models/models.md | 6 +- 11 files changed, 144 insertions(+), 83 deletions(-) diff --git a/README.md b/README.md index 89e4293..2671153 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,9 @@ license -PyLate is a library built on top of Sentence Transformers, designed to simplify and optimize fine-tuning, inference, and retrieval with state-of-the-art ColBERT models. It enables easy fine-tuning on both single and multiple GPUs, providing flexibility for various hardware setups. PyLate also streamlines document retrieval and allows you to load a wide range of models, enabling you to construct ColBERT models from most pre-trained language models. +

+PyLate is a library built on top of Sentence Transformers, designed to simplify and optimize fine-tuning, inference, and retrieval with state-of-the-art ColBERT models. It enables easy fine-tuning on both single and multiple GPUs, providing flexibility for various hardware setups. PyLate also streamlines document retrieval and allows you to load a wide range of models, enabling you to construct ColBERT models from most pre-trained language models. +

## Installation @@ -416,7 +418,9 @@ make livedoc ``` ## Citation -If you use PyLate for research, you can refer to the library with this BibTeX: + +You can refer to the library with this BibTeX: + ```bibtex @misc{PyLate, title={PyLate: Flexible Training and Retrieval for Late Interaction Models}, diff --git a/docs/api/losses/Contrastive.md b/docs/api/losses/Contrastive.md index 6579cfa..29a2bf0 100644 --- a/docs/api/losses/Contrastive.md +++ b/docs/api/losses/Contrastive.md @@ -10,7 +10,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th ColBERT model. -- **score_metric** – defaults to `` +- **score_metric** – defaults to `` ColBERT scoring function. Defaults to colbert_scores. @@ -228,7 +228,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th Copy parameters and buffers from :attr:`state_dict` into this module and its descendants. - If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing any keys that are expected by this module but missing from the provided ``state_dict``. * **unexpected_keys** is a list of str containing the keys that are not expected by this module but present in the provided ``state_dict``. Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``. + If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing the missing keys * **unexpected_keys** is a list of str containing the unexpected keys Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``. **Parameters** diff --git a/docs/api/losses/Distillation.md b/docs/api/losses/Distillation.md index 604f74b..623b40e 100644 --- a/docs/api/losses/Distillation.md +++ b/docs/api/losses/Distillation.md @@ -10,7 +10,7 @@ Distillation loss for ColBERT model. The loss is computed with respect to the fo SentenceTransformer model. -- **score_metric** (*Callable*) – defaults to `` +- **score_metric** (*Callable*) – defaults to `` Function that returns a score between two sequences of embeddings. @@ -232,7 +232,7 @@ Distillation loss for ColBERT model. The loss is computed with respect to the fo Copy parameters and buffers from :attr:`state_dict` into this module and its descendants. - If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing any keys that are expected by this module but missing from the provided ``state_dict``. * **unexpected_keys** is a list of str containing the keys that are not expected by this module but present in the provided ``state_dict``. Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``. + If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing the missing keys * **unexpected_keys** is a list of str containing the unexpected keys Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``. **Parameters** diff --git a/docs/api/models/ColBERT.md b/docs/api/models/ColBERT.md index 8fbba43..b2b7a19 100644 --- a/docs/api/models/ColBERT.md +++ b/docs/api/models/ColBERT.md @@ -494,7 +494,7 @@ Loads or creates a ColBERT model that can be used to map sentences / text to mul Copy parameters and buffers from :attr:`state_dict` into this module and its descendants. - If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing any keys that are expected by this module but missing from the provided ``state_dict``. * **unexpected_keys** is a list of str containing the keys that are not expected by this module but present in the provided ``state_dict``. Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``. + If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing the missing keys * **unexpected_keys** is a list of str containing the unexpected keys Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``. **Parameters** @@ -602,7 +602,7 @@ Loads or creates a ColBERT model that can be used to map sentences / text to mul **Returns** - *list[torch.Tensor]*: A list of pooled embeddings for each document. + *list*: A list of pooled embeddings for each document. ???- note "pop" diff --git a/docs/api/models/Dense.md b/docs/api/models/Dense.md index 1deae34..b9007dd 100644 --- a/docs/api/models/Dense.md +++ b/docs/api/models/Dense.md @@ -244,7 +244,7 @@ Performs linear projection on the token embeddings to a lower dimension. Copy parameters and buffers from :attr:`state_dict` into this module and its descendants. - If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing any keys that are expected by this module but missing from the provided ``state_dict``. * **unexpected_keys** is a list of str containing the keys that are not expected by this module but present in the provided ``state_dict``. Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``. + If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing the missing keys * **unexpected_keys** is a list of str containing the unexpected keys Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``. **Parameters** diff --git a/docs/documentation/datasets.md b/docs/documentation/datasets.md index 09adbdd..9488a4e 100644 --- a/docs/documentation/datasets.md +++ b/docs/documentation/datasets.md @@ -1,9 +1,8 @@ -PyLate is designed to be compatible with Hugging Face datasets, facilitating seamless integration for tasks such as knowledge distillation and contrastive model training. Below are examples illustrating how to load and prepare datasets for these specific training objectives. +PyLate is designed to be compatible with Hugging Face datasets, enabling seamless integration for tasks like knowledge distillation and contrastive model training. Below are examples of how to load and prepare datasets for these specific training objectives. +## Contrastive Dataset -## Constrastive dataset - -Contrastive training involves datasets that contain a query, a positive document (relevant to the query), and a negative document (irrelevant to the query). This the standard triplet format used by Sentence Transformers and so PyLate contrastive training is **compatible with every existing triplet datasets.** +Contrastive training requires datasets that include a query, a positive document (relevant to the query), and a negative document (irrelevant to the query). This is the standard triplet format used by Sentence Transformers, making PyLate's contrastive training **compatible with all existing triplet datasets**. ### Loading a pre-built contrastive dataset @@ -79,13 +78,18 @@ train_dataset, test_dataset = dataset.train_test_split(test_size=0.001) ## Knowledge distillation dataset -For fine-tuning a model using knowledge distillation loss, three distinct dataset files are required: train, queries, and documents. Each file contains unique and complementary information necessary for the distillation process: +For fine-tuning a model using knowledge distillation loss, three distinct dataset files are required: train, queries, and documents. + +???+ info + Each file contains unique and complementary information necessary for the distillation process: + + - `train`: Contains three columns: `['query_id', 'document_ids', 'scores']` + - `query_id` refers to the query identifier. + - `document_ids` is a list of document IDs relevant to the query. + - `scores` corresponds to the relevance scores between the query and each document. -- `train`: Contains three columns: `['query_id', 'document_ids', 'scores']` - - `query_id` refers to the query identifier. - - `document_ids` is a list of document IDs relevant to the query. - - `scores` corresponds to the relevance scores between the query and each document. +### Train Example entry: @@ -115,7 +119,10 @@ Example entry: } ``` -Note: Ensure that the length of `document_ids` matches the length of `scores`. +???+ warning + Ensure that the length of `document_ids` matches the length of `scores`. + +### Queries - `queries`: Contains two columns: `['query_id', 'text']` @@ -125,6 +132,8 @@ Example entry: {"query_id": 749480, "text": "example query 1"} ``` +### Documents + - `documents`: contains two columns: `['document_ids', 'text']` Example entry: @@ -136,6 +145,7 @@ Example entry: } ``` ### Loading a pre-built knowledge distillation dataset + You can directly download an existing knowledge distillation dataset from Hugging Face's hub, such as the English [MS MARCO dataset with BGE M3 scores](https://huggingface.co/datasets/lightonai/ms-marco-en-bge) or the [French version](https://huggingface.co/datasets/lightonai/ms-marco-fr-bge). Simply load the different files by giving the respective names to the ```load_dataset``` function: diff --git a/docs/documentation/evaluation.md b/docs/documentation/evaluation.md index fbf1747..3313a94 100644 --- a/docs/documentation/evaluation.md +++ b/docs/documentation/evaluation.md @@ -94,11 +94,13 @@ The output is a dictionary containing various evaluation metrics. Here’s a sam } ``` -Key Points: -1. is_query flag: Always set is_query=True when encoding queries and is_query=False when encoding documents. This ensures the model applies the correct prefixes for queries and documents. -2. Evaluation metrics: The pipeline supports a wide range of evaluation metrics, including NDCG, hits, MAP, recall, and precision, with different cutoff points. -3. Relevance judgments (qrels): The qrels are used to calculate how well the retrieved documents match the ground truth. +???+ info + 1. is_query flag: Always set is_query=True when encoding queries and is_query=False when encoding documents. This ensures the model applies the correct prefixes for queries and documents. + + 2. Evaluation metrics: The pipeline supports a wide range of evaluation metrics, including NDCG, hits, MAP, recall, and precision, with different cutoff points. + + 3. Relevance judgments (qrels): The qrels are used to calculate how well the retrieved documents match the ground truth. ### BEIR datasets diff --git a/docs/documentation/retrieval.md b/docs/documentation/retrieval.md index 4f813bc..d8d5eaf 100644 --- a/docs/documentation/retrieval.md +++ b/docs/documentation/retrieval.md @@ -40,6 +40,7 @@ index.add_documents( ``` Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it: + ```python # To load an index, simply instantiate it with the correct folder/name and without overriding it index = indexes.Voyager( @@ -47,20 +48,28 @@ index = indexes.Voyager( index_name="index", ) ``` -#### Pooling document embeddings -[In a recent study](https://www.answer.ai/posts/colbert-pooling.html), we showed that similar tokens in document embeddings can be pooled together to reduce the overall cost of ColBERT indexing without without losing much performance. You can use this feature by setting the `pool_factor` parameter when encoding the documents to only keep 1 / `pool_factor` tokens. The results show that using a `pool_factor` of 2 cut the memory requirement of the index in half with virtually 0 performance drop. Higher compression can be achieved at the cost of some performance, please refer to the blogpost for all the details and results. -This simple modification to the encoding call thus save a lot of space with a very contained impact on the performances: +???+ tip -```python -documents_embeddings = model.encode( - documents, - batch_size=32, - is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries - pool_factor=2, - show_progress_bar=True, -) -``` + #### Pooling document embeddings + + [In this blog post](https://www.answer.ai/posts/colbert-pooling.html), we showed that similar tokens in document embeddings can be pooled together to reduce the overall cost of ColBERT indexing without without losing much performance. + + You can use this feature by setting the `pool_factor` parameter when encoding the documents to only keep 1 / `pool_factor` tokens. + + The results show that using a `pool_factor` of 2 cut the memory requirement of the index in half with virtually 0 performance drop. Higher compression can be achieved at the cost of some performance, please refer to the blog post for all the details and results. + + This simple modification to the encoding call thus save a lot of space with a very contained impact on the performances: + + ```python + documents_embeddings = model.encode( + documents, + batch_size=32, + is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries + pool_factor=2, + show_progress_bar=True, + ) + ``` ### Retrieving top-k documents for queries @@ -106,33 +115,35 @@ Example output ``` ### Parameters affecting the retrieval performance -The retrieval is not an exact search, which mean that certain parameters can affect the quality of the approximate search. First, because we leverage a HNSW index, the usual parameters can be passed when creating the index: -- `M`, the maximum number of connections of a node in the graph. Higher values will improve recall and reduce retrieval time but will increase memory usage and the creation time of the index. -- `ef_construction` the maximum number of neighbors for a node during the creation of the index. Higher values increase the quality of the index but increase the creation time of the index. -- `ef_search` the maximum number of neighbors for a node during the search. Higher values increase the quality of the search but also the search time. +The retrieval is not an exact search, which mean that certain parameters can affect the quality of the approximate search: -Please refer to dedicated [HNSW documentation for more details](https://www.pinecone.io/learn/series/faiss/hnsw/). +- `M`: the maximum number of connections of a node in the graph. Higher values will improve recall and reduce retrieval time but will increase memory usage and the creation time of the index. +- `ef_construction`: the maximum number of neighbors for a node during the creation of the index. Higher values increase the quality of the index but increase the creation time of the index. +- `ef_search`: the maximum number of neighbors for a node during the search. Higher values increase the quality of the search but also the search time. -Another parameter not related to the index strongly influence the quality of the search, `k_token`. It corresponds to the number of neighbors retrieved for each of the query token and so the total number of candidates scored. Higher values will consider more candidates and so get better results but will slow the search. +Refer to [HNSW documentation for more details](https://www.pinecone.io/learn/series/faiss/hnsw/). -```python -index = indexes.Voyager( - index_folder="pylate-index", - index_name="index", - override=True, # This overwrites the existing index if any - M=M, - ef_construction=ef_construction, - ef_search=ef_search, -) +???+ info + Another parameter that significantly influences search quality is **k_token**. This parameter determines the **number of neighbors retrieved for each query token**. Higher values of k_token will consider more candidates, leading to better results but at the cost of slower search performance. -scores = retriever.retrieve( - queries_embeddings=queries_embeddings, - k=10, # Retrieve the top 10 matches for each query - k_token=200 # retrieve 200 candidates per query token -) + ```python + index = indexes.Voyager( + index_folder="pylate-index", + index_name="index", + override=True, # This overwrites the existing index if any + M=M, + ef_construction=ef_construction, + ef_search=ef_search, + ) + + scores = retriever.retrieve( + queries_embeddings=queries_embeddings, + k=10, # Retrieve the top 10 matches for each query + k_token=200 # retrieve 200 candidates per query token + ) + ``` -``` ### Remove documents from the index To remove documents from the index, use the `remove_documents` method. Provide the document IDs you want to remove from the index: @@ -146,25 +157,32 @@ index.remove_documents(["1", "2"]) If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: ```python -from pylate import rank +from pylate import rank, models queries = [ "query A", "query B", ] + documents = [ ["document A", "document B"], ["document 1", "document C", "document B"], ] + documents_ids = [ [1, 2], [1, 3, 2], ] +model = models.ColBERT( + model_name_or_path="lightonai/colbertv2.0", +) + queries_embeddings = model.encode( queries, is_query=True, ) + documents_embeddings = model.encode( documents, is_query=False, @@ -176,3 +194,19 @@ reranked_documents = rank.rerank( documents_embeddings=documents_embeddings, ) ``` + +Sample output: + +``` +[ + [ + {"id": 1, "score": 13.866096496582031}, + {"id": 2, "score": 7.363473415374756} + ], + [ + {"id": 2, "score": 16.025644302368164}, + {"id": 3, "score": 7.144075870513916}, + {"id": 1, "score": 4.203659534454346}, + ], +] +``` \ No newline at end of file diff --git a/docs/documentation/training.md b/docs/documentation/training.md index 8ac95b3..85954ab 100644 --- a/docs/documentation/training.md +++ b/docs/documentation/training.md @@ -1,18 +1,22 @@ # ColBERT Training -PyLate training is based on Sentence Transformer (and thus transformers) trainer, enabling a lot of functionnality such multi-GPU and FP16/BF16 training as well as logging to Weights & Biases out-of-the-box. This allows efficient, scalable and monitorable training. There are two primary ways to train ColBERT models using PyLate: +PyLate training is based on Sentence Transformer (and thus transformers) trainer, enabling a lot of functionality such multi-GPU and FP16/BF16 training as well as logging to Weights & Biases out-of-the-box. This allows efficient, and scalable training. -1. **Contrastive Loss (Simplest Method)**: The easiest way to train your model is by using contrastive loss, which only requires a dataset containing triplets—each consisting of a query, a positive document (relevant to the query), and a negative document (irrelevant to the query). This method trains the model to maximize the similarity between the query and the positive document, while minimizing it with the negative document. +???+ info + There are two primary ways to train ColBERT models using PyLate: -2. **Knowledge Distillation**: To train a ColBERT model using knowledge distillation, you need to provide a dataset with three components: queries, documents, and the relevance scores between them. This method compresses the knowledge of a larger model / more accurate model (cross-encoder) into a smaller one, using the relevance scores to guide the training process. + 1. **Contrastive Loss**: Simplest method, it only requires a dataset containing triplets, each consisting of a query, a positive document (relevant to the query), and a negative document (irrelevant to the query). This method trains the model to maximize the similarity between the query and the positive document, while minimizing it with the negative document. + + 2. **Knowledge Distillation**: To train a ColBERT model using knowledge distillation, you need to provide a dataset with three components: queries, documents, and the relevance scores between them. This method compresses the knowledge of a larger model / more accurate model (cross-encoder) into a smaller one, using the relevance scores to guide the training process. ## Contrastive Training -The original training of ColBERT was done using contrastive learning, that is, train the model to differentiate between relevant (positive) and irrelevant (negative) documents for a given query by maximizing the similarity between a query and a positive document while minimizing the similarity with irrelevant documents. -The contrastive learning in PyLate is done using triplet dataset, that is, a query is associated to one positive and one negative. It is thus **compatible with any triplet datasets from the sentence-transformers library**. +ColBERT was originally trained using contrastive learning. This approach involves teaching the model to distinguish between relevant (positive) and irrelevant (negative) documents for a given query. The model is trained to maximize the similarity between a query and its corresponding positive document while minimizing the similarity with irrelevant documents. + +PyLate uses contrastive learning with a triplet dataset, where each query is paired with one positive and one negative example. **This makes it fully compatible with any triplet datasets from the sentence-transformers library**. -During training, the model is tasked to maximize the similarity of the query with its positive while minimizing the similarity with all the negatives as well as the positives of the other queries in the batch (thus also leveraging in-batch negatives). +During training, the model is optimized to maximize the similarity between the query and its positive example while minimizing the similarity with all negative examples and the positives from other queries in the batch. This approach leverages in-batch negatives for more effective learning. -Here is a example of code to run contrastive training using PyLate: +Here is an example of code to run contrastive training with PyLate: ```python import torch @@ -83,16 +87,16 @@ trainer.train() ``` -Please note that for multi-GPU training, running ``python training.py`` **will use Data Parallel (DP) by default**. We strongly suggest using using Distributed Data Parallelism (DDP) using accelerate or torchrun: ``accelerate launch --num_processes num_gpu training.py``. +???+ tip + Please note that for multi-GPU training, running ``python training.py`` **will use Data Parallel (DP) by default**. We strongly suggest using using Distributed Data Parallelism (DDP) using accelerate or torchrun: ``accelerate launch --num_processes num_gpu training.py``. -Refer to this [documentation](https://sbert.net/docs/sentence_transformer/training/distributed.html) for more information. + Refer to this [documentation](https://sbert.net/docs/sentence_transformer/training/distributed.html) for more information. ## Knowledge Distillation Training -The training of late-interaction models have shown to benefit from knowledge distillation compared to a more simple contrastive learning. -Knowledge distillation training aim at making ColBERT models learn to reproduce the outputs of a more capable (e.g, a cross-encoder) teacher model. This is done by using a dataset containing queries, documents and the scores attributed by the teacher to the different query/document pairs. +Training late-interaction models, such as ColBERT, has been shown to benefit from knowledge distillation compared to simpler contrastive learning approaches. Knowledge distillation training focuses on teaching ColBERT models to replicate the outputs of a more capable teacher model (e.g., a cross-encoder). This is achieved using a dataset that includes queries, documents, and the scores assigned by the teacher model to each query/document pair. -Here is a example of code to run knowledge distillation training using PyLate: +Below is an example of code to run knowledge distillation training using PyLate: ```python import torch @@ -167,23 +171,26 @@ trainer.train() ``` -Once again, use [DDP](https://sbert.net/docs/sentence_transformer/training/distributed.html) if you want the best performance when training using multiple GPUs. +???+ tip + Please note that for multi-GPU training, running ``python training.py`` **will use Data Parallel (DP) by default**. We strongly suggest using using Distributed Data Parallelism (DDP) using accelerate or torchrun: ``accelerate launch --num_processes num_gpu training.py``. + + Refer to this [documentation](https://sbert.net/docs/sentence_transformer/training/distributed.html) for more information. ## ColBERT parameters All the parameters of the ColBERT modeling can be found [here](https://lightonai.github.io/pylate/api/models/ColBERT/#parameters). Important parameters to consider are: -- `model_name_or_path` the name of the base encoder model or PyLate model to init from. -- `embedding_size` the output size of the projection layer. Large values give more capacity to the model but are heavier to store. -- `query_prefix` and `document_prefix` represents the strings that will be prepended to query and document respectively. -- `query_length` and `document_length` set the maximum size of queries and documents. Queries will be padded/truncated to the maximum length while documents are only truncated. -- `attend_to_expansion_tokens` define whether the model will attend to the query expansion tokens (padding of queries) or if only the expansion tokens will attend to the other tokens. In the original ColBERT, the tokens **do not attend** to expansion tokens. -- `skiplist_words` is list of words to skip from the documents scoring (note that these tokens are used for encoding and are only skipped during the scoring), the default is the list of string.punctuation as in the original ColBERT. +???+ info + - `model_name_or_path` the name of the base encoder model or PyLate model to init from. + - `embedding_size` the output size of the projection layer. Large values give more capacity to the model but are heavier to store. + - `query_prefix` and `document_prefix` represents the strings that will be prepended to query and document respectively. + - `query_length` and `document_length` set the maximum size of queries and documents. Queries will be padded/truncated to the maximum length while documents are only truncated. + - `attend_to_expansion_tokens` define whether the model will attend to the query expansion tokens (padding of queries) or if only the expansion tokens will attend to the other tokens. In the original ColBERT, the tokens **do not attend** to expansion tokens. + - `skiplist_words` is list of words to skip from the documents scoring (note that these tokens are used for encoding and are only skipped during the scoring), the default is the list of string.punctuation as in the original ColBERT. ## Sentence Transformers Training Arguments -PyLate is built on top of SentenceTransformer, you can thus use the same arguments you already are familiar with to control the training. -The table below lists the arguments for the `SentenceTransformerTrainingArguments` class. Feel free to refer to the [SentenceTransformers](https://sbert.net/docs/sentence_transformer/training_overview.html#) library documentation for more information +PyLate is built on top of SentenceTransformer, so you can use the same arguments you are already familiar with to control the training process. The table below lists the arguments available in the SentenceTransformerTrainingArguments class. For more details, please refer to the [SentenceTransformers documentation](https://sbert.net/docs/sentence_transformer/training_overview.html#). === "Table" | Parameter | Name | Definition | Training Performance | Observing Performance | diff --git a/docs/index.md b/docs/index.md index 9d3307e..b15f5d8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -12,7 +12,9 @@ license -PyLate is a library built on top of Sentence Transformers, designed to simplify and optimize fine-tuning, inference, and retrieval with state-of-the-art ColBERT models. It enables easy fine-tuning on both single and multiple GPUs, providing flexibility for various hardware setups. PyLate also streamlines document retrieval and allows you to load a wide range of models, enabling you to construct ColBERT models from most pre-trained language models. +

+PyLate is a library built on top of Sentence Transformers, designed to simplify and optimize fine-tuning, inference, and retrieval with state-of-the-art ColBERT models. It enables easy fine-tuning on both single and multiple GPUs, providing flexibility for various hardware setups. PyLate also streamlines document retrieval and allows you to load a wide range of models, enabling you to construct ColBERT models from most pre-trained language models. +

## Installation @@ -416,7 +418,9 @@ make livedoc ``` ## Citation -If you use PyLate for research, you can refer to the library with this BibTeX: + +You can refer to the library with this BibTeX: + ```bibtex @misc{PyLate, title={PyLate: Flexible Training and Retrieval for Late Interaction Models}, diff --git a/docs/models/models.md b/docs/models/models.md index 0ea8766..102cc6e 100644 --- a/docs/models/models.md +++ b/docs/models/models.md @@ -4,13 +4,13 @@ Here is a list of the pre-trained ColBERT models available in PyLate along with === "Table" - | Model | BEIR AVG | NFCorpus | SciFact | SCIDOCS | FiQA2018 | TRECCOVID | HotpotQA | Touche2020 | ArguAna | ClimateFEVER | FEVER | QuoraRetrieval | NQ | DBPedia | |---------------------------------------|----------|----------|---------|---------|----------|-----------|----------|------------|---------|--------------|-------|----------------|------|---------| | [lightonai/colbertv2.0](https://huggingface.co/lightonai/colbertv2.0) | 50.02 | 33.8 | 69.3 | 15.4 | 35.6 | 73.3 | 66.7 | 26.3 | 46.3 | 17.6 | 78.5 | 85.2 | 56.2 | 44.6 | | [answerdotai/answerai-colbert-small-v1](https://huggingface.co/answerdotai/answerai-colbert-small-v1) | 53.79 | 37.3 | 74.77 | 18.42 | 41.15 | 84.59 | 76.11 | 25.69 | 50.09 | 33.07 | 90.96 | 87.72 | 59.1 | 45.58 | -Please note that the `lightonai/colbertv2.0` is simply a translation of the original [ColBERTv2 model](https://huggingface.co/colbert-ir/colbertv2.0/tree/main) to work with PyLate and we thank Omar Khattab for allowing us to share the model on PyLate. +???+ note + PThe `lightonai/colbertv2.0` is the original [ColBERTv2 model](https://huggingface.co/colbert-ir/colbertv2.0/tree/main) made compatible with PyLate. We thank Omar Khattab for allowing us to share the model on PyLate. -We are planning to release various strong models in the near future, but feel free to contact us if you want to make your existing ColBERT compatible with PyLate! \ No newline at end of file + We are planning to release various strong models in the near future, but feel free to contact us if you want to make your existing ColBERT compatible with PyLate. \ No newline at end of file