Llama2 example + better no-code deployment docs (#433)

* better examples * default open * llama2 lint
mosaicml · Aug 30, 2023 · 6c0ebe0 · 6c0ebe0
1 parent c233b54
commit 6c0ebe0
Show file tree

Hide file tree

Showing 17 changed files with 253 additions and 589 deletions.
diff --git a/examples/inference-deployments/instructor/README.md b/examples/inference-deployments/instructor/README.md
@@ -57,7 +57,7 @@ If your model exists on a different cloud storage, then you can follow instructi
 
 Once the deployment is ready, it's time to run inference!
 
-<details>
+<details open>
 <summary> Using Python SDK </summary>
 
 
@@ -66,7 +66,7 @@ from mcli import predict
 
 deployment = get_inference_deployment(<deployment-name>)
 input = {
-    "input_strings": [
+    "inputs": [
         ["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]
     ]
 }
@@ -80,7 +80,7 @@ predict(deployment, input)
 <summary> Using MCLI </summary>
 
 ```bash
-mcli predict <deployment-name> --input '{"input_strings":  [["Represent the Science title:", "3D ActionSLAM: wearable person tracking"]]}'
+mcli predict <deployment-name> --input '{"inputs":  [["Represent the Science title:", "3D ActionSLAM: wearable person tracking"]]}'
 
 ```
 </details>
@@ -91,7 +91,7 @@ mcli predict <deployment-name> --input '{"input_strings":  [["Represent the Scie
 ```bash
 curl https://<deployment-name>.inf.hosted-on.mosaicml.hosting/predict \
 -H "Authorization: <your_api_key>" \
--d '{"input_strings":  [["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]]}'
+-d '{"inputs":  [["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]]}'
 ```
 </details>
 
@@ -127,13 +127,13 @@ print(f"Cosine similarity between document and query: {similarity}")
 ### Input parameters
 | Parameters | Type | Required | Default | Description |
 | --- | --- | --- | --- | --- |
-| input_strings | List[Tuple[str, str]] | yes | N/A | A list of documents and instructions to embed. Each document is represented as tuple where the first item is the embedding instruction (e.g. "Represent the Science title:") and the second item is the document (e.g. "3D ActionSLAM: wearable person tracking in multi-floor environments"). |
+| inputs | List[Tuple[str, str]] | yes | N/A | A list of documents and instructions to embed. Each document is represented as tuple where the first item is the embedding instruction (e.g. "Represent the Science title:") and the second item is the document (e.g. "3D ActionSLAM: wearable person tracking in multi-floor environments"). |
 
 
 ## Output
 ```
 {
-    "data":[
+    "outputs":[
         [
             -0.06155527010560036,0.010419987142086029,0.005884397309273481...-0.03766140714287758,0.010227023623883724,0.04394740238785744
         ]

diff --git a/examples/inference-deployments/llama2/README.md b/examples/inference-deployments/llama2/README.md
@@ -0,0 +1,161 @@
+## Inference with Llama2
+
+[MosaicML’s inference service](https://www.mosaicml.com/inference) allows users to deploy their ML models and run inference on them. In this folder, we provide an example of how to deploy any Llama2 model, A state-of-the-art 70B parameter language model with a context length of 4096 tokens, trained by Meta. Llama 2 is licensed under the [LLAMA 2 Community License](https://github.com/facebookresearch/llama/blob/main/LICENSE), Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses.
+
+Check out MosaicML's [Llama2 blog post](https://www.mosaicml.com/blog/llama2-inferenceb) for more information!
+
+You’ll find in this folder:
+
+- Model YAMLS - read [docs](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html) for an explanation of each field.
+    - `llama2_7b_chat.yaml` - an optimized yaml to deploy [Llama2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
+    - `llama2_13b.yaml` - an optimized yaml to deploy [Llama2 13B Base](https://huggingface.co/meta-llama/Llama-2-13b-hf).
+
+## Setup
+
+Please follow instructions in the Inference Deployments [README](https://github.com/mosaicml/examples/tree/main/examples/inference-deployments/README.md) and make sure
+- You have access to our inference service.
+- Your dev environment is set up with `mcli`.
+- You have a cluster to work with.
+
+## Deploying your model
+
+To deploy, simply run `mcli deploy -f llama2_7b_chat.yaml --cluster <your_cluster>`.
+
+Run `mcli get deployments` on the command line or, using the Python SDK, `mcli.get_inference_deployments()` to get the name of your deployment.
+
+
+Once deployed, you can ping the deployment using
+```python
+from mcli import ping
+ping('deployment-name')
+```
+to check if it is ready (status 200).
+
+More instructions can be found [here](https://docs.mosaicml.com/projects/mcli/en/latest/quick_start/quick_start_inference.html)
+
+You can also check the deployment logs with `mcli get deployment logs <deployment name>`.
+
+### Deploying from cloud storage
+If your model exists on Amazon S3, GCP, or Hugging Face, you can edit the YAML's `checkpoint_path` to deploy it. Keep in mind that the checkpoint_path sources are mutually exclusive, so you can only set one of `hf_path`, `s3_path`, or `gcp_path`:
+
+```yaml
+default_model:
+  checkpoint_path:
+    hf_path: meta-llama/Llama-2-13b-hf
+    s3_path: s3://<your-s3-path>
+    gcp_path: gs://<your-gcp-path>
+
+```
+
+If your model exists on a different cloud storage, then you can follow instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#id1) on writing your custom downloader function, and deploy the model with the [custom yaml format](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html#custom-model).
+
+## Sending requests to your deployment
+
+Once the deployment is ready, it's time to run inference! Detailed information about the Llama2 prompt format can be found [here](https://www.mosaicml.com/blog/llama2-inference).
+
+<details open>
+<summary> Using Python SDK </summary>
+
+
+```python
+from mcli import predict
+
+prompt = """[INST] <<SYS>>
+You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.
+Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
+<</SYS>>
+How do I make a customer support bot using my product docs? [/INST]"""
+
+deployment = get_inference_deployment(<deployment-name>)
+input = {
+    {
+        "inputs": prompt,
+        "temperature": 0.01
+    }
+}
+predict(deployment, input)
+
+```
+</details>
+
+<details>
+<summary> Using MCLI </summary>
+
+```bash
+mcli predict <deployment-name> --input '{"inputs": ["hello world!"]}'
+
+```
+</details>
+
+<details>
+<summary> Using Curl </summary>
+
+```bash
+curl https://<deployment-name>.inf.hosted-on.mosaicml.hosting/predict \
+-H "Authorization: <your_api_key>" \
+-d '{"inputs": ["hello world!"]}'
+```
+</details>
+
+<details>
+<summary> Using Langchain </summary>
+
+```python
+from getpass import getpass
+
+MOSAICML_API_TOKEN = getpass()
+import os
+
+os.environ["MOSAICML_API_TOKEN"] = MOSAICML_API_TOKEN
+from langchain.llms import MosaicML
+from langchain import PromptTemplate, LLMChain
+template = """Question: {question}"""
+
+prompt = PromptTemplate(template=template, input_variables=["question"])
+llm = MosaicML(inject_instruction_format=True, model_kwargs={'do_sample': False})
+llm_chain = LLMChain(prompt=prompt, llm=llm)
+question = "Write 3 reasons why you should train an AI model on domain specific data set."
+
+llm_chain.run(question)
+
+```
+</details>
+
+### Input parameters
+| Parameters | Type | Required | Default | Description |
+| --- | --- | --- | --- | --- |
+| input_string | List[str] | yes | N/A | The prompt to generate a completion for. |
+| top_p | float | no | 0.95 | Defines the tokens that are within the sample operation of text generation. Add tokens in the sample for more probable to least probable until the sum of the probabilities is greater than top_p |
+| temperature | float | no | 0.8 | The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability |
+| max_new_tokens | int | no | 256 | Defines the maximum length in tokens of the output summary |
+| use_cache | bool | no | true | Whether to use KV caching during autoregressive decoding. This will use more memory but improve speed |
+| do_sample | bool | no | true | Whether or not to use sampling, use greedy decoding otherwise |
+
+
+## Output
+
+```
+{
+    'data': [
+        '1. The model will be more accurate.\n2. The model will be more efficient.\n3. The model will be more interpretable.'
+    ]
+}
+```
+
+## Before you go
+
+Your deployments will be live and using resources until you manually shut them down. In order to delete your deployment, remember to run:
+```
+mcli delete deployment --name <deployment_name>
+```
+
+## What's Next
+ - Check out our [LLM foundry](https://github.com/mosaicml/llm-foundry), which contains code to train, fine-tune, evaluate and deploy LLMs.
+ - Check out the [Prompt Engineering Guide](https://www.promptingguide.ai) to better understand LLMs and how to use them.
+
+
+## Additional Resources
+- Check out the [MosaicML Blog](https://www.mosaicml.com/blog) to learn more about large scale AI
+- Follow us on [Twitter](https://twitter.com/mosaicml) and [LinkedIn](https://www.linkedin.com/company/mosaicml)
+- Join our community on [Slack](https://mosaicml.me/slack)
diff --git a/examples/inference-deployments/llama2/llama2_13b.yaml b/examples/inference-deployments/llama2/llama2_13b.yaml
@@ -0,0 +1,13 @@
+name: llama2-13b
+replicas: 1
+command: |- # Note this command is a workaround until we build vllm into the inference image
+  pip install vllm==0.1.3
+  pip uninstall torch -y
+  pip install torch==2.0.1
+compute:
+  gpus: 1
+  instance: oci.vm.gpu.a10.1
+image: mosaicml/inference:0.1.37
+cluster: r7z15
+default_model:
+  model_type: llama2-13b
diff --git a/examples/inference-deployments/llama2/llama2_7b_chat.yaml b/examples/inference-deployments/llama2/llama2_7b_chat.yaml
@@ -0,0 +1,13 @@
+name: llama2-7b-chat
+replicas: 1
+command: |- # Note this command is a workaround until we build vllm into the inference image
+  pip install vllm==0.1.3
+  pip uninstall torch -y
+  pip install torch==2.0.1
+compute:
+  gpus: 1
+  instance: oci.vm.gpu.a10.1
+image: mosaicml/inference:0.1.37
+cluster: r7z15
+default_model:
+  model_type: llama2-7b-chat
diff --git a/examples/inference-deployments/llama2/requirements.txt b/examples/inference-deployments/llama2/requirements.txt
@@ -0,0 +1 @@
+torch==1.13.1
diff --git a/examples/inference-deployments/mpt/README.md b/examples/inference-deployments/mpt/README.md
@@ -1,24 +1,23 @@
-## Inference with MPT-7B
+> :exclamation: **If you are looking for the Faster Transformer model handler**: We have deprecated the `mpt_ft_handler.py` and the corresponding `mpt_7b_instruct_ft.yaml`. Instead, `mpt_7b_instruct.yaml` is the simplified replacement and it will spin up a deployment with the Faster Transformer backend.
 
-[MosaicML’s inference service](https://www.mosaicml.com/inference) allows users to deploy their ML models and run inference on them. In this folder, we provide an example of how to use MPT-7B, a family of 6.7B parameter large language models, including the base model, an instruction fine-tuned variant, and a variant fine-tuned on long context books.
+## Inference with MPT
 
-Check out [this blog post](https://www.mosaicml.com/blog/mpt-7b) for more information!
+[MosaicML’s inference service](https://www.mosaicml.com/inference) allows users to deploy their ML models and run inference on them. In this folder, we provide an example of how to deploy any MPT model, a family of large language models from 7B parameters to 30B parameters, including the base model, an instruction fine-tuned variant, and a variant fine-tuned on long context books.
+
+Check out [the MPT-7B blog post](https://www.mosaicml.com/blog/mpt-7b) or [the MPT-30B blog post](https://www.mosaicml.com/blog/mpt-30b) for more information!
 
 You’ll find in this folder:
 
 - Model YAMLS - read [docs](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html) for an explanation of each field.
-    - `mpt_7b.yaml` - an optimized no-code yaml to deploy [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b).
-    - `mpt_30b.yaml` - an optimized no-code yaml to deploy [MPT-30B Base](https://huggingface.co/mosaicml/mpt-30b).
-    - `mpt_30b_ft.yaml` - a yaml to deploy [MPT-30B Base](https://huggingface.co/mosaicml/mpt-30b).
-    - `mpt_30b_instruct_ft.yaml` - a yaml to deploy [MPT-30B Instruct](https://huggingface.co/mosaicml/mpt-30b-instruct).
-    - `mpt_7b_custom.yaml` - a custom yaml to deploy [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b).
-    - `mpt_7b_instruct.yaml` - a yaml to deploy [MPT-7B Intstruct](https://huggingface.co/mosaicml/mpt-7b-instruct).
-    - `mpt_7b_storywriter.yaml` - a yaml to deploy [MPT-7B StoryWriter](https://huggingface.co/mosaicml/mpt-7b-storywriter).
-- Model handlers - these define how your model should be loaded and how the model should be run when receiving a request. You can use the default handlers here or write your custom model handler as per instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#custom-model-handlers).
-    - `mpt_handler.py` - a model handler using DeepSpeed.
-    - `mpt_ft_handler.py` - a model handler using FasterTransformer.
-- `requirements.txt` - package requirements to be able to run these models.
-
+    - `mpt_7b.yaml` - an optimized yaml to deploy [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b).
+    - `mpt_7b_instruct.yaml` - an optimized yaml to deploy [MPT-7B Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct).
+    - `mpt_7b_storywriter.yaml` - an optimized yaml to deploy [MPT-7B Storywriter](https://huggingface.co/mosaicml/mpt-7b-storywriter).
+    - `mpt_30b.yaml` - an optimized yaml to deploy [MPT-30B Base](https://huggingface.co/mosaicml/mpt-30b).
+    - `mpt_30b_instruct.yaml` - an optimized yaml to deploy [MPT-30B Instruct](https://huggingface.co/mosaicml/mpt-30b-instruct).
+    - `mpt_30b_chat.yaml` - an optimized yaml to deploy [MPT-30B Chat](https://huggingface.co/mosaicml/mpt-30b-chat).
+    - `mpt_7b_custom.yaml` - a custom yaml to deploy a vanilla [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b) without using an optimized backend.
+- Model handlers - for custom models, these define how your model should be loaded and how the model should be run when receiving a request. You can use the default handlers here or write your custom model handler as per instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#custom-model-handlers).
+    - `mpt_handler.py` - an example model handler to load a huggingface MPT model. It is not recommended to use this handler in actual production deployments since it does not have the optimizations that we enable with the optimized yamls.
 
 ## Setup
 
@@ -46,33 +45,38 @@ More instructions can be found [here](https://docs.mosaicml.com/projects/mcli/en
 You can also check the deployment logs with `mcli get deployment logs <deployment name>`.
 
 ### Deploying from cloud storage
-If your model exists on Amazon S3 or Hugging Face, you can edit the YAML's model params to deploy it:
+If your model exists on Amazon S3, GCP, or Hugging Face, you can edit the YAML's `checkpoint_path` to deploy it. Keep in mind that the checkpoint_path sources are mutually exclusive, so you can only set one of `hf_path`, `s3_path`, or `gcp_path`:
+
 ```yaml
-model:
-    download_parameters:
-        s3_path: <your-s3-path>
-    model_parameters:
-        ...
-        model_name_or_path: my/local/s3_path
+default_model:
+  checkpoint_path:
+    hf_path: mosaicml/mpt-7b
+    s3_path: s3://<your-s3-path>
+    gcp_path: gs://<your-gcp-path>
+
 ```
 
-If your model exists on a different cloud storage, then you can follow instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#id1) on writing your custom downloader function, and deploy the model.
+If your model exists on a different cloud storage, then you can follow instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#id1) on writing your custom downloader function, and deploy the model with the [custom yaml format](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html#custom-model).
 
 ## Sending requests to your deployment
 
 Once the deployment is ready, it's time to run inference!
 
-<details>
+<details open>
 <summary> Using Python SDK </summary>
 
 
 ```python
 from mcli import predict
 
+prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
+### Instruction: write 3 reasons why you should train an AI model on domain specific data set.
+### Response: """
+
 deployment = get_inference_deployment(<deployment-name>)
 input = {
     {
-        "input_strings": "Write 3 reasons why you should train an AI model on domain specific data set.",
+        "inputs": prompt,
         "temperature": 0.01
     }
 }
@@ -85,7 +89,7 @@ predict(deployment, input)
 <summary> Using MCLI </summary>
 
 ```bash
-mcli predict <deployment-name> --input '{"input_strings": ["hello world!"]}'
+mcli predict <deployment-name> --input '{"inputs": ["hello world!"]}'
 
 ```
 </details>
@@ -96,7 +100,7 @@ mcli predict <deployment-name> --input '{"input_strings": ["hello world!"]}'
 ```bash
 curl https://<deployment-name>.inf.hosted-on.mosaicml.hosting/predict \
 -H "Authorization: <your_api_key>" \
--d '{"input_strings": ["hello world!"]}'
+-d '{"inputs": ["hello world!"]}'
 ```
 </details>
 
@@ -130,7 +134,7 @@ llm_chain.run(question)
 | input_string | List[str] | yes | N/A | The prompt to generate a completion for. |
 | top_p | float | no | 0.95 | Defines the tokens that are within the sample operation of text generation. Add tokens in the sample for more probable to least probable until the sum of the probabilities is greater than top_p |
 | temperature | float | no | 0.8 | The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability |
-| max_length | int | no | 256 | Defines the maximum length in tokens of the output summary |
+| max_new_tokens | int | no | 256 | Defines the maximum length in tokens of the output summary |
 | use_cache | bool | no | true | Whether to use KV caching during autoregressive decoding. This will use more memory but improve speed |
 | do_sample | bool | no | true | Whether or not to use sampling, use greedy decoding otherwise |
 

diff --git a/examples/inference-deployments/mpt/mpt_30b.yaml b/examples/inference-deployments/mpt/mpt_30b.yaml
@@ -1,17 +1,8 @@
-name: mpt-30b-simple
+name: mpt-30b
 compute:
   gpus: 2
   gpu_type: a100_40gb
-image: mosaicml/inference:0.1.16
+image: mosaicml/inference:0.1.37
 replicas: 1
-command: |
-  export PYTHONPATH=/code/llm-foundry:/code
-integrations:
-- integration_type: git_repo
-  git_repo: mosaicml/llm-foundry
-  git_commit: 496b50bd588b1a7231fe54b05d70babb3620fc72
-  ssh_clone: false
 default_model:
   model_type: mpt-30b
-  checkpoint_path:
-    hf_path: mosaicml/mpt-30b
diff --git a/examples/inference-deployments/mpt/mpt_30b_chat.yaml b/examples/inference-deployments/mpt/mpt_30b_chat.yaml
@@ -0,0 +1,8 @@
+name: mpt-30b-chat
+compute:
+  gpus: 2
+  gpu_type: a100_40gb
+image: mosaicml/inference:0.1.37
+replicas: 1
+default_model:
+  model_type: mpt-30b-chat