Skip to content

Commit

Permalink
Merge pull request #12 from McGill-NLP/bi-models
Browse files Browse the repository at this point in the history
Bi models
  • Loading branch information
vaibhavad authored Apr 16, 2024
2 parents 6167aec + d2dad4f commit c442e71
Show file tree
Hide file tree
Showing 10 changed files with 888 additions and 96 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
build/
dist/
*.egg-info
*.egg-info
**/__pycache__
53 changes: 17 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,57 +29,38 @@ pip install -e .
LLM2Vec class is a wrapper on top of HuggingFace models to support sequence encoding and pooling operations. The steps below showcase an example on how to use the library.

### Preparing the model
Here, we first initialize the model and apply MNTP-trained LoRA weights on top. After merging the model with MNTP weights, we can
- either load the unsupervised-trained LoRA weights (trained with SimCSE objective and wiki corpus)
- or we can load the model with supervised-trained LoRA weights (trained with contrastive learning and public E5 data).
Initializing LLM2Vec model using pretrained LLMs is straightforward. The `from_pretrained` method of LLM2Vec takes a base model identifier/path and an optional PEFT model identifier/path. All HuggingFace model loading arguments can be passed to `from_pretrained` method (make sure the `llm2vec` package version is `>=0.1.3`).

Here, we first initialize the Mistral MNTP base model and load the unsupervised-trained LoRA weights (trained with SimCSE objective and wiki corpus).

```python
import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel

from llm2vec import LLM2Vec

# Loading base MNTP model, along with custom code that enables bidirectional connections in decoder-only LLMs
tokenizer = AutoTokenizer.from_pretrained(
"McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp"
)
config = AutoConfig.from_pretrained(
"McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp", trust_remote_code=True
)
model = AutoModel.from_pretrained(
l2v = LLM2Vec.from_pretrained(
"McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
trust_remote_code=True,
config=config,
torch_dtype=torch.bfloat16,
peft_model_name_or_path="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse",
device_map="cuda" if torch.cuda.is_available() else "cpu",
torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(
model,
"McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
)
model = model.merge_and_unload() # This can take several minutes on cpu

# Loading unsupervised-trained LoRA weights. This loads the trained LoRA weights on top of MNTP model. Hence the final weights are -- Base model + MNTP (LoRA) + SimCSE (LoRA).
model = PeftModel.from_pretrained(
model, "McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse"
)

# Or loading supervised-trained LoRA weights
model = PeftModel.from_pretrained(
model, "McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised"
)

```

### Applying `LLM2Vec` wrapper
Then, we define our LLM2Vec encoder model as follows:
We can also load the model with supervised-trained LoRA weights (trained with contrastive learning and public E5 data) by changing the `peft_model_name_or_path`.

```python
import torch
from llm2vec import LLM2Vec

l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
l2v = LLM2Vec.from_pretrained(
"McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
peft_model_name_or_path="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised",
device_map="cuda" if torch.cuda.is_available() else "cpu",
torch_dtype=torch.bfloat16,
)
```

By default the LLM2Vec model uses the `mean` pooling strategy. You can change the pooling strategy by passing the `pooling_mode` argument to the `from_pretrained` method. Similarly, you can change the maximum sequence length by passing the `max_length` argument (default is 512).

### Inference
This model now returns the text embedding for any input in the form of `[[instruction1, text1], [instruction2, text2]]` or `[text1, text2]`. While training, we provide instructions for both sentences in symmetric tasks, and only for for queries in asymmetric tasks.

Expand Down
2 changes: 1 addition & 1 deletion llm2vec/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
from .llm2vec import LLM2Vec
from .llm2vec import LLM2Vec
Loading

0 comments on commit c442e71

Please sign in to comment.