Skip to content

Commit

Permalink
Llama2 example + better no-code deployment docs (#433)
Browse files Browse the repository at this point in the history
* better examples

* default open

* llama2 lint
  • Loading branch information
margaretqian committed Aug 30, 2023
1 parent c233b54 commit 6c0ebe0
Show file tree
Hide file tree
Showing 17 changed files with 253 additions and 589 deletions.
12 changes: 6 additions & 6 deletions examples/inference-deployments/instructor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ If your model exists on a different cloud storage, then you can follow instructi
Once the deployment is ready, it's time to run inference!
<details>
<details open>
<summary> Using Python SDK </summary>
Expand All @@ -66,7 +66,7 @@ from mcli import predict

deployment = get_inference_deployment(<deployment-name>)
input = {
"input_strings": [
"inputs": [
["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]
]
}
Expand All @@ -80,7 +80,7 @@ predict(deployment, input)
<summary> Using MCLI </summary>

```bash
mcli predict <deployment-name> --input '{"input_strings": [["Represent the Science title:", "3D ActionSLAM: wearable person tracking"]]}'
mcli predict <deployment-name> --input '{"inputs": [["Represent the Science title:", "3D ActionSLAM: wearable person tracking"]]}'

```
</details>
Expand All @@ -91,7 +91,7 @@ mcli predict <deployment-name> --input '{"input_strings": [["Represent the Scie
```bash
curl https://<deployment-name>.inf.hosted-on.mosaicml.hosting/predict \
-H "Authorization: <your_api_key>" \
-d '{"input_strings": [["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]]}'
-d '{"inputs": [["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]]}'
```
</details>

Expand Down Expand Up @@ -127,13 +127,13 @@ print(f"Cosine similarity between document and query: {similarity}")
### Input parameters
| Parameters | Type | Required | Default | Description |
| --- | --- | --- | --- | --- |
| input_strings | List[Tuple[str, str]] | yes | N/A | A list of documents and instructions to embed. Each document is represented as tuple where the first item is the embedding instruction (e.g. "Represent the Science title:") and the second item is the document (e.g. "3D ActionSLAM: wearable person tracking in multi-floor environments"). |
| inputs | List[Tuple[str, str]] | yes | N/A | A list of documents and instructions to embed. Each document is represented as tuple where the first item is the embedding instruction (e.g. "Represent the Science title:") and the second item is the document (e.g. "3D ActionSLAM: wearable person tracking in multi-floor environments"). |


## Output
```
{
"data":[
"outputs":[
[
-0.06155527010560036,0.010419987142086029,0.005884397309273481...-0.03766140714287758,0.010227023623883724,0.04394740238785744
]
Expand Down
161 changes: 161 additions & 0 deletions examples/inference-deployments/llama2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
## Inference with Llama2

[MosaicML’s inference service](https://www.mosaicml.com/inference) allows users to deploy their ML models and run inference on them. In this folder, we provide an example of how to deploy any Llama2 model, A state-of-the-art 70B parameter language model with a context length of 4096 tokens, trained by Meta. Llama 2 is licensed under the [LLAMA 2 Community License](https://github.com/facebookresearch/llama/blob/main/LICENSE), Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses.

Check out MosaicML's [Llama2 blog post](https://www.mosaicml.com/blog/llama2-inferenceb) for more information!

You’ll find in this folder:

- Model YAMLS - read [docs](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html) for an explanation of each field.
- `llama2_7b_chat.yaml` - an optimized yaml to deploy [Llama2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
- `llama2_13b.yaml` - an optimized yaml to deploy [Llama2 13B Base](https://huggingface.co/meta-llama/Llama-2-13b-hf).

## Setup

Please follow instructions in the Inference Deployments [README](https://github.com/mosaicml/examples/tree/main/examples/inference-deployments/README.md) and make sure
- You have access to our inference service.
- Your dev environment is set up with `mcli`.
- You have a cluster to work with.

## Deploying your model

To deploy, simply run `mcli deploy -f llama2_7b_chat.yaml --cluster <your_cluster>`.

Run `mcli get deployments` on the command line or, using the Python SDK, `mcli.get_inference_deployments()` to get the name of your deployment.


Once deployed, you can ping the deployment using
```python
from mcli import ping
ping('deployment-name')
```
to check if it is ready (status 200).

More instructions can be found [here](https://docs.mosaicml.com/projects/mcli/en/latest/quick_start/quick_start_inference.html)

You can also check the deployment logs with `mcli get deployment logs <deployment name>`.

### Deploying from cloud storage
If your model exists on Amazon S3, GCP, or Hugging Face, you can edit the YAML's `checkpoint_path` to deploy it. Keep in mind that the checkpoint_path sources are mutually exclusive, so you can only set one of `hf_path`, `s3_path`, or `gcp_path`:

```yaml
default_model:
checkpoint_path:
hf_path: meta-llama/Llama-2-13b-hf
s3_path: s3://<your-s3-path>
gcp_path: gs://<your-gcp-path>

```

If your model exists on a different cloud storage, then you can follow instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#id1) on writing your custom downloader function, and deploy the model with the [custom yaml format](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html#custom-model).

## Sending requests to your deployment

Once the deployment is ready, it's time to run inference! Detailed information about the Llama2 prompt format can be found [here](https://www.mosaicml.com/blog/llama2-inference).

<details open>
<summary> Using Python SDK </summary>


```python
from mcli import predict

prompt = """[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
How do I make a customer support bot using my product docs? [/INST]"""

deployment = get_inference_deployment(<deployment-name>)
input = {
{
"inputs": prompt,
"temperature": 0.01
}
}
predict(deployment, input)

```
</details>

<details>
<summary> Using MCLI </summary>

```bash
mcli predict <deployment-name> --input '{"inputs": ["hello world!"]}'

```
</details>

<details>
<summary> Using Curl </summary>

```bash
curl https://<deployment-name>.inf.hosted-on.mosaicml.hosting/predict \
-H "Authorization: <your_api_key>" \
-d '{"inputs": ["hello world!"]}'
```
</details>

<details>
<summary> Using Langchain </summary>

```python
from getpass import getpass

MOSAICML_API_TOKEN = getpass()
import os

os.environ["MOSAICML_API_TOKEN"] = MOSAICML_API_TOKEN
from langchain.llms import MosaicML
from langchain import PromptTemplate, LLMChain
template = """Question: {question}"""

prompt = PromptTemplate(template=template, input_variables=["question"])
llm = MosaicML(inject_instruction_format=True, model_kwargs={'do_sample': False})
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "Write 3 reasons why you should train an AI model on domain specific data set."

llm_chain.run(question)

```
</details>

### Input parameters
| Parameters | Type | Required | Default | Description |
| --- | --- | --- | --- | --- |
| input_string | List[str] | yes | N/A | The prompt to generate a completion for. |
| top_p | float | no | 0.95 | Defines the tokens that are within the sample operation of text generation. Add tokens in the sample for more probable to least probable until the sum of the probabilities is greater than top_p |
| temperature | float | no | 0.8 | The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability |
| max_new_tokens | int | no | 256 | Defines the maximum length in tokens of the output summary |
| use_cache | bool | no | true | Whether to use KV caching during autoregressive decoding. This will use more memory but improve speed |
| do_sample | bool | no | true | Whether or not to use sampling, use greedy decoding otherwise |


## Output

```
{
'data': [
'1. The model will be more accurate.\n2. The model will be more efficient.\n3. The model will be more interpretable.'
]
}
```

## Before you go

Your deployments will be live and using resources until you manually shut them down. In order to delete your deployment, remember to run:
```
mcli delete deployment --name <deployment_name>
```

## What's Next
- Check out our [LLM foundry](https://github.com/mosaicml/llm-foundry), which contains code to train, fine-tune, evaluate and deploy LLMs.
- Check out the [Prompt Engineering Guide](https://www.promptingguide.ai) to better understand LLMs and how to use them.


## Additional Resources
- Check out the [MosaicML Blog](https://www.mosaicml.com/blog) to learn more about large scale AI
- Follow us on [Twitter](https://twitter.com/mosaicml) and [LinkedIn](https://www.linkedin.com/company/mosaicml)
- Join our community on [Slack](https://mosaicml.me/slack)
13 changes: 13 additions & 0 deletions examples/inference-deployments/llama2/llama2_13b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: llama2-13b
replicas: 1
command: |- # Note this command is a workaround until we build vllm into the inference image
pip install vllm==0.1.3
pip uninstall torch -y
pip install torch==2.0.1
compute:
gpus: 1
instance: oci.vm.gpu.a10.1
image: mosaicml/inference:0.1.37
cluster: r7z15
default_model:
model_type: llama2-13b
13 changes: 13 additions & 0 deletions examples/inference-deployments/llama2/llama2_7b_chat.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: llama2-7b-chat
replicas: 1
command: |- # Note this command is a workaround until we build vllm into the inference image
pip install vllm==0.1.3
pip uninstall torch -y
pip install torch==2.0.1
compute:
gpus: 1
instance: oci.vm.gpu.a10.1
image: mosaicml/inference:0.1.37
cluster: r7z15
default_model:
model_type: llama2-7b-chat
1 change: 1 addition & 0 deletions examples/inference-deployments/llama2/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
torch==1.13.1
60 changes: 32 additions & 28 deletions examples/inference-deployments/mpt/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,23 @@
## Inference with MPT-7B
> :exclamation: **If you are looking for the Faster Transformer model handler**: We have deprecated the `mpt_ft_handler.py` and the corresponding `mpt_7b_instruct_ft.yaml`. Instead, `mpt_7b_instruct.yaml` is the simplified replacement and it will spin up a deployment with the Faster Transformer backend.
[MosaicML’s inference service](https://www.mosaicml.com/inference) allows users to deploy their ML models and run inference on them. In this folder, we provide an example of how to use MPT-7B, a family of 6.7B parameter large language models, including the base model, an instruction fine-tuned variant, and a variant fine-tuned on long context books.
## Inference with MPT

Check out [this blog post](https://www.mosaicml.com/blog/mpt-7b) for more information!
[MosaicML’s inference service](https://www.mosaicml.com/inference) allows users to deploy their ML models and run inference on them. In this folder, we provide an example of how to deploy any MPT model, a family of large language models from 7B parameters to 30B parameters, including the base model, an instruction fine-tuned variant, and a variant fine-tuned on long context books.

Check out [the MPT-7B blog post](https://www.mosaicml.com/blog/mpt-7b) or [the MPT-30B blog post](https://www.mosaicml.com/blog/mpt-30b) for more information!

You’ll find in this folder:

- Model YAMLS - read [docs](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html) for an explanation of each field.
- `mpt_7b.yaml` - an optimized no-code yaml to deploy [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b).
- `mpt_30b.yaml` - an optimized no-code yaml to deploy [MPT-30B Base](https://huggingface.co/mosaicml/mpt-30b).
- `mpt_30b_ft.yaml` - a yaml to deploy [MPT-30B Base](https://huggingface.co/mosaicml/mpt-30b).
- `mpt_30b_instruct_ft.yaml` - a yaml to deploy [MPT-30B Instruct](https://huggingface.co/mosaicml/mpt-30b-instruct).
- `mpt_7b_custom.yaml` - a custom yaml to deploy [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b).
- `mpt_7b_instruct.yaml` - a yaml to deploy [MPT-7B Intstruct](https://huggingface.co/mosaicml/mpt-7b-instruct).
- `mpt_7b_storywriter.yaml` - a yaml to deploy [MPT-7B StoryWriter](https://huggingface.co/mosaicml/mpt-7b-storywriter).
- Model handlers - these define how your model should be loaded and how the model should be run when receiving a request. You can use the default handlers here or write your custom model handler as per instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#custom-model-handlers).
- `mpt_handler.py` - a model handler using DeepSpeed.
- `mpt_ft_handler.py` - a model handler using FasterTransformer.
- `requirements.txt` - package requirements to be able to run these models.

- `mpt_7b.yaml` - an optimized yaml to deploy [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b).
- `mpt_7b_instruct.yaml` - an optimized yaml to deploy [MPT-7B Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct).
- `mpt_7b_storywriter.yaml` - an optimized yaml to deploy [MPT-7B Storywriter](https://huggingface.co/mosaicml/mpt-7b-storywriter).
- `mpt_30b.yaml` - an optimized yaml to deploy [MPT-30B Base](https://huggingface.co/mosaicml/mpt-30b).
- `mpt_30b_instruct.yaml` - an optimized yaml to deploy [MPT-30B Instruct](https://huggingface.co/mosaicml/mpt-30b-instruct).
- `mpt_30b_chat.yaml` - an optimized yaml to deploy [MPT-30B Chat](https://huggingface.co/mosaicml/mpt-30b-chat).
- `mpt_7b_custom.yaml` - a custom yaml to deploy a vanilla [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b) without using an optimized backend.
- Model handlers - for custom models, these define how your model should be loaded and how the model should be run when receiving a request. You can use the default handlers here or write your custom model handler as per instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#custom-model-handlers).
- `mpt_handler.py` - an example model handler to load a huggingface MPT model. It is not recommended to use this handler in actual production deployments since it does not have the optimizations that we enable with the optimized yamls.

## Setup

Expand Down Expand Up @@ -46,33 +45,38 @@ More instructions can be found [here](https://docs.mosaicml.com/projects/mcli/en
You can also check the deployment logs with `mcli get deployment logs <deployment name>`.

### Deploying from cloud storage
If your model exists on Amazon S3 or Hugging Face, you can edit the YAML's model params to deploy it:
If your model exists on Amazon S3, GCP, or Hugging Face, you can edit the YAML's `checkpoint_path` to deploy it. Keep in mind that the checkpoint_path sources are mutually exclusive, so you can only set one of `hf_path`, `s3_path`, or `gcp_path`:

```yaml
model:
download_parameters:
s3_path: <your-s3-path>
model_parameters:
...
model_name_or_path: my/local/s3_path
default_model:
checkpoint_path:
hf_path: mosaicml/mpt-7b
s3_path: s3://<your-s3-path>
gcp_path: gs://<your-gcp-path>

```

If your model exists on a different cloud storage, then you can follow instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#id1) on writing your custom downloader function, and deploy the model.
If your model exists on a different cloud storage, then you can follow instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#id1) on writing your custom downloader function, and deploy the model with the [custom yaml format](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html#custom-model).

## Sending requests to your deployment

Once the deployment is ready, it's time to run inference!

<details>
<details open>
<summary> Using Python SDK </summary>


```python
from mcli import predict

prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: write 3 reasons why you should train an AI model on domain specific data set.
### Response: """

deployment = get_inference_deployment(<deployment-name>)
input = {
{
"input_strings": "Write 3 reasons why you should train an AI model on domain specific data set.",
"inputs": prompt,
"temperature": 0.01
}
}
Expand All @@ -85,7 +89,7 @@ predict(deployment, input)
<summary> Using MCLI </summary>

```bash
mcli predict <deployment-name> --input '{"input_strings": ["hello world!"]}'
mcli predict <deployment-name> --input '{"inputs": ["hello world!"]}'

```
</details>
Expand All @@ -96,7 +100,7 @@ mcli predict <deployment-name> --input '{"input_strings": ["hello world!"]}'
```bash
curl https://<deployment-name>.inf.hosted-on.mosaicml.hosting/predict \
-H "Authorization: <your_api_key>" \
-d '{"input_strings": ["hello world!"]}'
-d '{"inputs": ["hello world!"]}'
```
</details>

Expand Down Expand Up @@ -130,7 +134,7 @@ llm_chain.run(question)
| input_string | List[str] | yes | N/A | The prompt to generate a completion for. |
| top_p | float | no | 0.95 | Defines the tokens that are within the sample operation of text generation. Add tokens in the sample for more probable to least probable until the sum of the probabilities is greater than top_p |
| temperature | float | no | 0.8 | The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability |
| max_length | int | no | 256 | Defines the maximum length in tokens of the output summary |
| max_new_tokens | int | no | 256 | Defines the maximum length in tokens of the output summary |
| use_cache | bool | no | true | Whether to use KV caching during autoregressive decoding. This will use more memory but improve speed |
| do_sample | bool | no | true | Whether or not to use sampling, use greedy decoding otherwise |

Expand Down
13 changes: 2 additions & 11 deletions examples/inference-deployments/mpt/mpt_30b.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,8 @@
name: mpt-30b-simple
name: mpt-30b
compute:
gpus: 2
gpu_type: a100_40gb
image: mosaicml/inference:0.1.16
image: mosaicml/inference:0.1.37
replicas: 1
command: |
export PYTHONPATH=/code/llm-foundry:/code
integrations:
- integration_type: git_repo
git_repo: mosaicml/llm-foundry
git_commit: 496b50bd588b1a7231fe54b05d70babb3620fc72
ssh_clone: false
default_model:
model_type: mpt-30b
checkpoint_path:
hf_path: mosaicml/mpt-30b
8 changes: 8 additions & 0 deletions examples/inference-deployments/mpt/mpt_30b_chat.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: mpt-30b-chat
compute:
gpus: 2
gpu_type: a100_40gb
image: mosaicml/inference:0.1.37
replicas: 1
default_model:
model_type: mpt-30b-chat
Loading

0 comments on commit 6c0ebe0

Please sign in to comment.