-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[
feat
] Update mine_hard_negatives to using a full corpus and multip…
…le positives (#2848) * updated mine_hard_negatives method to include a seperate corpus for mining hard negatives. * Run 'make check' * Update "corpus" to just a list of strings * Prevent duplicate embeddings if no separate corpus * Deduplicate corpus Add a positive to corpus indices mapping, useful to get non-deduplicated positives and to filter away positives taken from the corpus * Skip rescoring positive pairs via pos_to_corpus_indices instead * Add a mine_hard_negatives_from_corpus util * Speedup pos_to_corpus_indices for large corpora * Fix range_max by number of max_positives in dataset * encode in chunks, ensure at least one positive per query always * Hard_negative_mining with corpus and multiple positives is possible * docstring * Fix for random sampling * fix for return_triplets=False * Typo on list * Fix bug with multiple positives. More efficient creation of some tensors. * Fix offset of positives scoring with multiple chunks * fix pytorch copy warning * Only embed each text once; no need for chunking if convert_to_numpy=True * Undo unintended changes * Fix mismatch in anchor/positive and negatives if multiple positives per query * Don't repeat positive_scores as it inflates the positive score counts * Remove the "Count" for Difference as it's rather confusing --------- Co-authored-by: Christian Geishauser <[email protected]> Co-authored-by: Tom Aarsen <[email protected]>
- Loading branch information
1 parent
8af7c5d
commit a3f2236
Showing
1 changed file
with
159 additions
and
60 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters