[feat] Add option to mask prompts with left-padded tokenizer and corpus and query prompts to IREvaluator #2951

ArthurCamara · 2024-09-23T08:11:41Z

Most models, specially based on Mistral, are trained with left-padding in the tokenizer, instead of the common right-padding when evaluating. (e.g., LLM2Vec), but this is not covered in the current prompt_length strategy.
Currently, If a batch with mismatched lengths is left-padded, the masking will mostly mask out padding tokens. This PR fixes that by adding a mask_prompt argument to the tokenize function. When this flag is set, the code tries to find the first non-padding token in each sentence, and will mask everything between that and prompt_length, adding a prompt_mask representation to the output dictionary.

Perhaps a more "elegant" solution would be to replace prompt_length entirely, but this could break Instructor models.

…-padded.

…tence-transformers into Prompting-on-evaluators

ArthurCamara added 4 commits September 23, 2024 07:55

Added the possibility of masking the prompts if the tokenizer is left…

7dc7990

…-padded.

Simplify code

8d7b88b

Remove unrelated changes

c92e334

Move prompt_mask into the Transformer model

005039f

ArthurCamara marked this pull request as draft September 23, 2024 09:01

ArthurCamara and others added 3 commits September 23, 2024 15:01

Merge branch 'UKPLab:master' into Prompting-on-evaluators

f95cb46

Added query and corpus prompts to Information Retrieval Evaluator

0effd4d

Merge branch 'Prompting-on-evaluators' of github.com:ArthurCamara/sen…

b653197

…tence-transformers into Prompting-on-evaluators

ArthurCamara changed the title ~~[feat] Add option to mask prompts with left-padded tokenizer~~ [feat] Add option to mask prompts with left-padded tokenizer and corpus and query prompts to IREvaluator Sep 23, 2024

ArthurCamara mentioned this pull request Sep 23, 2024

[feat] Update mine_hard_negatives to using a full corpus and multiple positives #2848

Merged

ArthurCamara added 3 commits September 23, 2024 13:29

Fix for failing test

d856b47

Fix for pooling when mask is not passed

ad21eb7

Fix device placement for prompt_mask

82b8c7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add option to mask prompts with left-padded tokenizer and corpus and query prompts to IREvaluator #2951

[feat] Add option to mask prompts with left-padded tokenizer and corpus and query prompts to IREvaluator #2951

ArthurCamara commented Sep 23, 2024

[feat] Add option to mask prompts with left-padded tokenizer and corpus and query prompts to IREvaluator #2951

Are you sure you want to change the base?

[feat] Add option to mask prompts with left-padded tokenizer and corpus and query prompts to IREvaluator #2951

Conversation

ArthurCamara commented Sep 23, 2024