diff --git a/examples/inference-deployments/instructor/README.md b/examples/inference-deployments/instructor/README.md
index 3b2bf9c12..7a46e8420 100644
--- a/examples/inference-deployments/instructor/README.md
+++ b/examples/inference-deployments/instructor/README.md
@@ -57,7 +57,7 @@ If your model exists on a different cloud storage, then you can follow instructi
 
 Once the deployment is ready, it's time to run inference!
 
-<details>
+<details open>
 <summary> Using Python SDK </summary>
 
 
@@ -66,7 +66,7 @@ from mcli import predict
 
 deployment = get_inference_deployment(<deployment-name>)
 input = {
-    "input_strings": [
+    "inputs": [
         ["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]
     ]
 }
@@ -80,7 +80,7 @@ predict(deployment, input)
 <summary> Using MCLI </summary>
 
 ```bash
-mcli predict <deployment-name> --input '{"input_strings":  [["Represent the Science title:", "3D ActionSLAM: wearable person tracking"]]}'
+mcli predict <deployment-name> --input '{"inputs":  [["Represent the Science title:", "3D ActionSLAM: wearable person tracking"]]}'
 
 ```
 </details>
@@ -91,7 +91,7 @@ mcli predict <deployment-name> --input '{"input_strings":  [["Represent the Scie
 ```bash
 curl https://<deployment-name>.inf.hosted-on.mosaicml.hosting/predict \
 -H "Authorization: <your_api_key>" \
--d '{"input_strings":  [["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]]}'
+-d '{"inputs":  [["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]]}'
 ```
 </details>
 
@@ -127,13 +127,13 @@ print(f"Cosine similarity between document and query: {similarity}")
 ### Input parameters
 | Parameters | Type | Required | Default | Description |
 | --- | --- | --- | --- | --- |
-| input_strings | List[Tuple[str, str]] | yes | N/A | A list of documents and instructions to embed. Each document is represented as tuple where the first item is the embedding instruction (e.g. "Represent the Science title:") and the second item is the document (e.g. "3D ActionSLAM: wearable person tracking in multi-floor environments"). |
+| inputs | List[Tuple[str, str]] | yes | N/A | A list of documents and instructions to embed. Each document is represented as tuple where the first item is the embedding instruction (e.g. "Represent the Science title:") and the second item is the document (e.g. "3D ActionSLAM: wearable person tracking in multi-floor environments"). |
 
 
 ## Output
 ```
 {
-    "data":[
+    "outputs":[
         [
             -0.06155527010560036,0.010419987142086029,0.005884397309273481...-0.03766140714287758,0.010227023623883724,0.04394740238785744
         ]
diff --git a/examples/inference-deployments/llama2/README.md b/examples/inference-deployments/llama2/README.md
new file mode 100644
index 000000000..8ffbec30a
--- /dev/null
+++ b/examples/inference-deployments/llama2/README.md
@@ -0,0 +1,161 @@
+## Inference with Llama2
+
+[MosaicML’s inference service](https://www.mosaicml.com/inference) allows users to deploy their ML models and run inference on them. In this folder, we provide an example of how to deploy any Llama2 model, A state-of-the-art 70B parameter language model with a context length of 4096 tokens, trained by Meta. Llama 2 is licensed under the [LLAMA 2 Community License](https://github.com/facebookresearch/llama/blob/main/LICENSE), Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses.
+
+Check out MosaicML's [Llama2 blog post](https://www.mosaicml.com/blog/llama2-inferenceb) for more information!
+
+You’ll find in this folder:
+
+- Model YAMLS - read [docs](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html) for an explanation of each field.
+    - `llama2_7b_chat.yaml` - an optimized yaml to deploy [Llama2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
+    - `llama2_13b.yaml` - an optimized yaml to deploy [Llama2 13B Base](https://huggingface.co/meta-llama/Llama-2-13b-hf).
+
+## Setup
+
+Please follow instructions in the Inference Deployments [README](https://github.com/mosaicml/examples/tree/main/examples/inference-deployments/README.md) and make sure
+- You have access to our inference service.
+- Your dev environment is set up with `mcli`.
+- You have a cluster to work with.
+
+## Deploying your model
+
+To deploy, simply run `mcli deploy -f llama2_7b_chat.yaml --cluster <your_cluster>`.
+
+Run `mcli get deployments` on the command line or, using the Python SDK, `mcli.get_inference_deployments()` to get the name of your deployment.
+
+
+Once deployed, you can ping the deployment using
+```python
+from mcli import ping
+ping('deployment-name')
+```
+to check if it is ready (status 200).
+
+More instructions can be found [here](https://docs.mosaicml.com/projects/mcli/en/latest/quick_start/quick_start_inference.html)
+
+You can also check the deployment logs with `mcli get deployment logs <deployment name>`.
+
+### Deploying from cloud storage
+If your model exists on Amazon S3, GCP, or Hugging Face, you can edit the YAML's `checkpoint_path` to deploy it. Keep in mind that the checkpoint_path sources are mutually exclusive, so you can only set one of `hf_path`, `s3_path`, or `gcp_path`:
+
+```yaml
+default_model:
+  checkpoint_path:
+    hf_path: meta-llama/Llama-2-13b-hf
+    s3_path: s3://<your-s3-path>
+    gcp_path: gs://<your-gcp-path>
+
+```
+
+If your model exists on a different cloud storage, then you can follow instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#id1) on writing your custom downloader function, and deploy the model with the [custom yaml format](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html#custom-model).
+
+## Sending requests to your deployment
+
+Once the deployment is ready, it's time to run inference! Detailed information about the Llama2 prompt format can be found [here](https://www.mosaicml.com/blog/llama2-inference).
+
+<details open>
+<summary> Using Python SDK </summary>
+
+
+```python
+from mcli import predict
+
+prompt = """[INST] <<SYS>>
+You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.
+Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
+<</SYS>>
+How do I make a customer support bot using my product docs? [/INST]"""
+
+deployment = get_inference_deployment(<deployment-name>)
+input = {
+    {
+        "inputs": prompt,
+        "temperature": 0.01
+    }
+}
+predict(deployment, input)
+
+```
+</details>
+
+<details>
+<summary> Using MCLI </summary>
+
+```bash
+mcli predict <deployment-name> --input '{"inputs": ["hello world!"]}'
+
+```
+</details>
+
+<details>
+<summary> Using Curl </summary>
+
+```bash
+curl https://<deployment-name>.inf.hosted-on.mosaicml.hosting/predict \
+-H "Authorization: <your_api_key>" \
+-d '{"inputs": ["hello world!"]}'
+```
+</details>
+
+<details>
+<summary> Using Langchain </summary>
+
+```python
+from getpass import getpass
+
+MOSAICML_API_TOKEN = getpass()
+import os
+
+os.environ["MOSAICML_API_TOKEN"] = MOSAICML_API_TOKEN
+from langchain.llms import MosaicML
+from langchain import PromptTemplate, LLMChain
+template = """Question: {question}"""
+
+prompt = PromptTemplate(template=template, input_variables=["question"])
+llm = MosaicML(inject_instruction_format=True, model_kwargs={'do_sample': False})
+llm_chain = LLMChain(prompt=prompt, llm=llm)
+question = "Write 3 reasons why you should train an AI model on domain specific data set."
+
+llm_chain.run(question)
+
+```
+</details>
+
+### Input parameters
+| Parameters | Type | Required | Default | Description |
+| --- | --- | --- | --- | --- |
+| input_string | List[str] | yes | N/A | The prompt to generate a completion for. |
+| top_p | float | no | 0.95 | Defines the tokens that are within the sample operation of text generation. Add tokens in the sample for more probable to least probable until the sum of the probabilities is greater than top_p |
+| temperature | float | no | 0.8 | The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability |
+| max_new_tokens | int | no | 256 | Defines the maximum length in tokens of the output summary |
+| use_cache | bool | no | true | Whether to use KV caching during autoregressive decoding. This will use more memory but improve speed |
+| do_sample | bool | no | true | Whether or not to use sampling, use greedy decoding otherwise |
+
+
+## Output
+
+```
+{
+    'data': [
+        '1. The model will be more accurate.\n2. The model will be more efficient.\n3. The model will be more interpretable.'
+    ]
+}
+```
+
+## Before you go
+
+Your deployments will be live and using resources until you manually shut them down. In order to delete your deployment, remember to run:
+```
+mcli delete deployment --name <deployment_name>
+```
+
+## What's Next
+ - Check out our [LLM foundry](https://github.com/mosaicml/llm-foundry), which contains code to train, fine-tune, evaluate and deploy LLMs.
+ - Check out the [Prompt Engineering Guide](https://www.promptingguide.ai) to better understand LLMs and how to use them.
+
+
+## Additional Resources
+- Check out the [MosaicML Blog](https://www.mosaicml.com/blog) to learn more about large scale AI
+- Follow us on [Twitter](https://twitter.com/mosaicml) and [LinkedIn](https://www.linkedin.com/company/mosaicml)
+- Join our community on [Slack](https://mosaicml.me/slack)
diff --git a/examples/inference-deployments/llama2/llama2_13b.yaml b/examples/inference-deployments/llama2/llama2_13b.yaml
new file mode 100644
index 000000000..a7fc1745e
--- /dev/null
+++ b/examples/inference-deployments/llama2/llama2_13b.yaml
@@ -0,0 +1,13 @@
+name: llama2-13b
+replicas: 1
+command: |- # Note this command is a workaround until we build vllm into the inference image
+  pip install vllm==0.1.3
+  pip uninstall torch -y
+  pip install torch==2.0.1
+compute:
+  gpus: 1
+  instance: oci.vm.gpu.a10.1
+image: mosaicml/inference:0.1.37
+cluster: r7z15
+default_model:
+  model_type: llama2-13b
diff --git a/examples/inference-deployments/llama2/llama2_7b_chat.yaml b/examples/inference-deployments/llama2/llama2_7b_chat.yaml
new file mode 100644
index 000000000..c9f8047d9
--- /dev/null
+++ b/examples/inference-deployments/llama2/llama2_7b_chat.yaml
@@ -0,0 +1,13 @@
+name: llama2-7b-chat
+replicas: 1
+command: |- # Note this command is a workaround until we build vllm into the inference image
+  pip install vllm==0.1.3
+  pip uninstall torch -y
+  pip install torch==2.0.1
+compute:
+  gpus: 1
+  instance: oci.vm.gpu.a10.1
+image: mosaicml/inference:0.1.37
+cluster: r7z15
+default_model:
+  model_type: llama2-7b-chat
diff --git a/examples/inference-deployments/llama2/requirements.txt b/examples/inference-deployments/llama2/requirements.txt
new file mode 100644
index 000000000..422166bae
--- /dev/null
+++ b/examples/inference-deployments/llama2/requirements.txt
@@ -0,0 +1 @@
+torch==1.13.1
diff --git a/examples/inference-deployments/mpt/README.md b/examples/inference-deployments/mpt/README.md
index b317e55b5..c05e4b998 100644
--- a/examples/inference-deployments/mpt/README.md
+++ b/examples/inference-deployments/mpt/README.md
@@ -1,24 +1,23 @@
-## Inference with MPT-7B
+> :exclamation: **If you are looking for the Faster Transformer model handler**: We have deprecated the `mpt_ft_handler.py` and the corresponding `mpt_7b_instruct_ft.yaml`. Instead, `mpt_7b_instruct.yaml` is the simplified replacement and it will spin up a deployment with the Faster Transformer backend.
 
-[MosaicML’s inference service](https://www.mosaicml.com/inference) allows users to deploy their ML models and run inference on them. In this folder, we provide an example of how to use MPT-7B, a family of 6.7B parameter large language models, including the base model, an instruction fine-tuned variant, and a variant fine-tuned on long context books.
+## Inference with MPT
 
-Check out [this blog post](https://www.mosaicml.com/blog/mpt-7b) for more information!
+[MosaicML’s inference service](https://www.mosaicml.com/inference) allows users to deploy their ML models and run inference on them. In this folder, we provide an example of how to deploy any MPT model, a family of large language models from 7B parameters to 30B parameters, including the base model, an instruction fine-tuned variant, and a variant fine-tuned on long context books.
+
+Check out [the MPT-7B blog post](https://www.mosaicml.com/blog/mpt-7b) or [the MPT-30B blog post](https://www.mosaicml.com/blog/mpt-30b) for more information!
 
 You’ll find in this folder:
 
 - Model YAMLS - read [docs](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html) for an explanation of each field.
-    - `mpt_7b.yaml` - an optimized no-code yaml to deploy [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b).
-    - `mpt_30b.yaml` - an optimized no-code yaml to deploy [MPT-30B Base](https://huggingface.co/mosaicml/mpt-30b).
-    - `mpt_30b_ft.yaml` - a yaml to deploy [MPT-30B Base](https://huggingface.co/mosaicml/mpt-30b).
-    - `mpt_30b_instruct_ft.yaml` - a yaml to deploy [MPT-30B Instruct](https://huggingface.co/mosaicml/mpt-30b-instruct).
-    - `mpt_7b_custom.yaml` - a custom yaml to deploy [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b).
-    - `mpt_7b_instruct.yaml` - a yaml to deploy [MPT-7B Intstruct](https://huggingface.co/mosaicml/mpt-7b-instruct).
-    - `mpt_7b_storywriter.yaml` - a yaml to deploy [MPT-7B StoryWriter](https://huggingface.co/mosaicml/mpt-7b-storywriter).
-- Model handlers - these define how your model should be loaded and how the model should be run when receiving a request. You can use the default handlers here or write your custom model handler as per instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#custom-model-handlers).
-    - `mpt_handler.py` - a model handler using DeepSpeed.
-    - `mpt_ft_handler.py` - a model handler using FasterTransformer.
-- `requirements.txt` - package requirements to be able to run these models.
-
+    - `mpt_7b.yaml` - an optimized yaml to deploy [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b).
+    - `mpt_7b_instruct.yaml` - an optimized yaml to deploy [MPT-7B Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct).
+    - `mpt_7b_storywriter.yaml` - an optimized yaml to deploy [MPT-7B Storywriter](https://huggingface.co/mosaicml/mpt-7b-storywriter).
+    - `mpt_30b.yaml` - an optimized yaml to deploy [MPT-30B Base](https://huggingface.co/mosaicml/mpt-30b).
+    - `mpt_30b_instruct.yaml` - an optimized yaml to deploy [MPT-30B Instruct](https://huggingface.co/mosaicml/mpt-30b-instruct).
+    - `mpt_30b_chat.yaml` - an optimized yaml to deploy [MPT-30B Chat](https://huggingface.co/mosaicml/mpt-30b-chat).
+    - `mpt_7b_custom.yaml` - a custom yaml to deploy a vanilla [MPT-7B Base](https://huggingface.co/mosaicml/mpt-7b) without using an optimized backend.
+- Model handlers - for custom models, these define how your model should be loaded and how the model should be run when receiving a request. You can use the default handlers here or write your custom model handler as per instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#custom-model-handlers).
+    - `mpt_handler.py` - an example model handler to load a huggingface MPT model. It is not recommended to use this handler in actual production deployments since it does not have the optimizations that we enable with the optimized yamls.
 
 ## Setup
 
@@ -46,33 +45,38 @@ More instructions can be found [here](https://docs.mosaicml.com/projects/mcli/en
 You can also check the deployment logs with `mcli get deployment logs <deployment name>`.
 
 ### Deploying from cloud storage
-If your model exists on Amazon S3 or Hugging Face, you can edit the YAML's model params to deploy it:
+If your model exists on Amazon S3, GCP, or Hugging Face, you can edit the YAML's `checkpoint_path` to deploy it. Keep in mind that the checkpoint_path sources are mutually exclusive, so you can only set one of `hf_path`, `s3_path`, or `gcp_path`:
+
 ```yaml
-model:
-    download_parameters:
-        s3_path: <your-s3-path>
-    model_parameters:
-        ...
-        model_name_or_path: my/local/s3_path
+default_model:
+  checkpoint_path:
+    hf_path: mosaicml/mpt-7b
+    s3_path: s3://<your-s3-path>
+    gcp_path: gs://<your-gcp-path>
+
 ```
 
-If your model exists on a different cloud storage, then you can follow instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#id1) on writing your custom downloader function, and deploy the model.
+If your model exists on a different cloud storage, then you can follow instructions [here](https://docs.mosaicml.com/projects/mcli/en/latest/inference/deployment_features.html#id1) on writing your custom downloader function, and deploy the model with the [custom yaml format](https://docs.mosaicml.com/projects/mcli/en/latest/inference/inference_schema.html#custom-model).
 
 ## Sending requests to your deployment
 
 Once the deployment is ready, it's time to run inference!
 
-<details>
+<details open>
 <summary> Using Python SDK </summary>
 
 
 ```python
 from mcli import predict
 
+prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
+### Instruction: write 3 reasons why you should train an AI model on domain specific data set.
+### Response: """
+
 deployment = get_inference_deployment(<deployment-name>)
 input = {
     {
-        "input_strings": "Write 3 reasons why you should train an AI model on domain specific data set.",
+        "inputs": prompt,
         "temperature": 0.01
     }
 }
@@ -85,7 +89,7 @@ predict(deployment, input)
 <summary> Using MCLI </summary>
 
 ```bash
-mcli predict <deployment-name> --input '{"input_strings": ["hello world!"]}'
+mcli predict <deployment-name> --input '{"inputs": ["hello world!"]}'
 
 ```
 </details>
@@ -96,7 +100,7 @@ mcli predict <deployment-name> --input '{"input_strings": ["hello world!"]}'
 ```bash
 curl https://<deployment-name>.inf.hosted-on.mosaicml.hosting/predict \
 -H "Authorization: <your_api_key>" \
--d '{"input_strings": ["hello world!"]}'
+-d '{"inputs": ["hello world!"]}'
 ```
 </details>
 
@@ -130,7 +134,7 @@ llm_chain.run(question)
 | input_string | List[str] | yes | N/A | The prompt to generate a completion for. |
 | top_p | float | no | 0.95 | Defines the tokens that are within the sample operation of text generation. Add tokens in the sample for more probable to least probable until the sum of the probabilities is greater than top_p |
 | temperature | float | no | 0.8 | The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability |
-| max_length | int | no | 256 | Defines the maximum length in tokens of the output summary |
+| max_new_tokens | int | no | 256 | Defines the maximum length in tokens of the output summary |
 | use_cache | bool | no | true | Whether to use KV caching during autoregressive decoding. This will use more memory but improve speed |
 | do_sample | bool | no | true | Whether or not to use sampling, use greedy decoding otherwise |
 
diff --git a/examples/inference-deployments/mpt/mpt_30b.yaml b/examples/inference-deployments/mpt/mpt_30b.yaml
index 13556e82d..3ba43376c 100644
--- a/examples/inference-deployments/mpt/mpt_30b.yaml
+++ b/examples/inference-deployments/mpt/mpt_30b.yaml
@@ -1,17 +1,8 @@
-name: mpt-30b-simple
+name: mpt-30b
 compute:
   gpus: 2
   gpu_type: a100_40gb
-image: mosaicml/inference:0.1.16
+image: mosaicml/inference:0.1.37
 replicas: 1
-command: |
-  export PYTHONPATH=/code/llm-foundry:/code
-integrations:
-- integration_type: git_repo
-  git_repo: mosaicml/llm-foundry
-  git_commit: 496b50bd588b1a7231fe54b05d70babb3620fc72
-  ssh_clone: false
 default_model:
   model_type: mpt-30b
-  checkpoint_path:
-    hf_path: mosaicml/mpt-30b
diff --git a/examples/inference-deployments/mpt/mpt_30b_chat.yaml b/examples/inference-deployments/mpt/mpt_30b_chat.yaml
new file mode 100644
index 000000000..771eed921
--- /dev/null
+++ b/examples/inference-deployments/mpt/mpt_30b_chat.yaml
@@ -0,0 +1,8 @@
+name: mpt-30b-chat
+compute:
+  gpus: 2
+  gpu_type: a100_40gb
+image: mosaicml/inference:0.1.37
+replicas: 1
+default_model:
+  model_type: mpt-30b-chat
diff --git a/examples/inference-deployments/mpt/mpt_30b_ft.yaml b/examples/inference-deployments/mpt/mpt_30b_ft.yaml
deleted file mode 100644
index ed6053969..000000000
--- a/examples/inference-deployments/mpt/mpt_30b_ft.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-name: mpt-30b-ft
-compute:
-  gpus: 2
-  gpu_type: a100_40gb
-image: mosaicml/inference:0.1.16
-replicas: 1
-command: |
-  export PYTHONPATH=/code/llm-foundry:/code/examples:/code
-integrations:
-- integration_type: git_repo
-  git_repo: mosaicml/examples
-  git_commit: df65ce9448f2e4c7803f7082930f80c8dc4e8fe1
-  ssh_clone: false
-- integration_type: git_repo
-  git_repo: mosaicml/llm-foundry
-  git_commit: 496b50bd588b1a7231fe54b05d70babb3620fc72
-  ssh_clone: false
-model:
-  backend: faster_transformers
-  downloader: examples.inference-deployments.mpt.mpt_ft_handler.download_convert
-  download_parameters:
-    hf_path: mosaicml/mpt-30b
-    gpus: 2
-    force_conversion: true
-  model_handler: examples.inference-deployments.mpt.mpt_ft_handler.MPTFTModelHandler
-  model_parameters:
-    model_name_or_path: mosaicml/mpt-30b
-    ft_lib_path: /code/FasterTransformer/build/lib/libth_transformer.so
-    gpus: 2
diff --git a/examples/inference-deployments/mpt/mpt_30b_instruct.yaml b/examples/inference-deployments/mpt/mpt_30b_instruct.yaml
new file mode 100644
index 000000000..ad0e0b37a
--- /dev/null
+++ b/examples/inference-deployments/mpt/mpt_30b_instruct.yaml
@@ -0,0 +1,8 @@
+name: mpt-30b-instruct
+compute:
+  gpus: 2
+  gpu_type: a100_40gb
+image: mosaicml/inference:0.1.37
+replicas: 1
+default_model:
+  model_type: mpt-30b-instruct
diff --git a/examples/inference-deployments/mpt/mpt_30b_instruct_ft.yaml b/examples/inference-deployments/mpt/mpt_30b_instruct_ft.yaml
deleted file mode 100644
index a4f1d4bfa..000000000
--- a/examples/inference-deployments/mpt/mpt_30b_instruct_ft.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-name: mpt-30b-instruct-ft
-compute:
-  gpus: 2
-  gpu_type: a100_40gb
-image: mosaicml/inference:0.1.16
-replicas: 1
-command: |
-  export PYTHONPATH=/code/llm-foundry:/code/examples:/code
-integrations:
-- integration_type: git_repo
-  git_repo: mosaicml/examples
-  git_commit: df65ce9448f2e4c7803f7082930f80c8dc4e8fe1
-  ssh_clone: false
-- integration_type: git_repo
-  git_repo: mosaicml/llm-foundry
-  git_commit: 496b50bd588b1a7231fe54b05d70babb3620fc72
-  ssh_clone: false
-model:
-  backend: faster_transformers
-  downloader: examples.inference-deployments.mpt.mpt_ft_handler.download_convert
-  download_parameters:
-    hf_path: mosaicml/mpt-30b-instruct
-    gpus: 2
-    force_conversion: true
-  model_handler: examples.inference-deployments.mpt.mpt_ft_handler.MPTFTModelHandler
-  model_parameters:
-    model_name_or_path: mosaicml/mpt-30b-instruct
-    ft_lib_path: /code/FasterTransformer/build/lib/libth_transformer.so
-    gpus: 2
diff --git a/examples/inference-deployments/mpt/mpt_7b.yaml b/examples/inference-deployments/mpt/mpt_7b.yaml
index 5524841b7..ecb195de5 100644
--- a/examples/inference-deployments/mpt/mpt_7b.yaml
+++ b/examples/inference-deployments/mpt/mpt_7b.yaml
@@ -1,12 +1,8 @@
-name: mpt-7b-simple
+name: mpt-7b
 compute:
   gpus: 1
   instance: oci.vm.gpu.a10.1
-image: mosaicml/inference:0.1.36
-command: |
-  export PYTHONPATH=/code
+image: mosaicml/inference:0.1.37
 replicas: 1
 default_model:
   model_type: mpt-7b
-  checkpoint_path:
-    hf_path: mosaicml/mpt-7b
diff --git a/examples/inference-deployments/mpt/mpt_7b_custom.yaml b/examples/inference-deployments/mpt/mpt_7b_custom.yaml
index d6d4241ce..4a9c4c8dd 100644
--- a/examples/inference-deployments/mpt/mpt_7b_custom.yaml
+++ b/examples/inference-deployments/mpt/mpt_7b_custom.yaml
@@ -1,4 +1,4 @@
-name: mpt-7b
+name: mpt-7b-custom
 compute:
   gpus: 1
   instance: oci.vm.gpu.a10.1
diff --git a/examples/inference-deployments/mpt/mpt_7b_instruct.yaml b/examples/inference-deployments/mpt/mpt_7b_instruct.yaml
index aded6ffb7..01f93a777 100644
--- a/examples/inference-deployments/mpt/mpt_7b_instruct.yaml
+++ b/examples/inference-deployments/mpt/mpt_7b_instruct.yaml
@@ -2,18 +2,7 @@ name: mpt-7b-instruct
 compute:
   gpus: 1
   instance: oci.vm.gpu.a10.1
-image: mosaicml/inference:0.1.16
+image: mosaicml/inference:0.1.37
 replicas: 1
-command: |
-  export PYTHONPATH=$PYTHONPATH:/code/examples
-integrations:
-- integration_type: git_repo
-  git_repo: mosaicml/examples
-  ssh_clone: false
-  git_commit: df65ce9448f2e4c7803f7082930f80c8dc4e8fe1
-model:
-  download_parameters:
-    hf_path: mosaicml/mpt-7b-instruct
-  model_handler: examples.inference-deployments.mpt.mpt_handler.MPTModelHandler
-  model_parameters:
-    model_name: mosaicml/mpt-7b-instruct
+default_model:
+  model_type: mpt-7b-instruct
diff --git a/examples/inference-deployments/mpt/mpt_7b_instruct_ft.yaml b/examples/inference-deployments/mpt/mpt_7b_instruct_ft.yaml
deleted file mode 100644
index 1ccbffc91..000000000
--- a/examples/inference-deployments/mpt/mpt_7b_instruct_ft.yaml
+++ /dev/null
@@ -1,26 +0,0 @@
-name: mpt-7b-instruct-ft
-compute:
-  gpus: 1
-  instance: oci.vm.gpu.a10.1
-image: mosaicml/inference:0.1.16
-replicas: 1
-command: |
-  export PYTHONPATH=/code/llm-foundry:/code/examples:/code
-integrations:
-- integration_type: git_repo
-  git_repo: mosaicml/examples
-  git_commit: df65ce9448f2e4c7803f7082930f80c8dc4e8fe1
-  ssh_clone: false
-- integration_type: git_repo
-  git_repo: mosaicml/llm-foundry
-  git_commit: 496b50bd588b1a7231fe54b05d70babb3620fc72
-  ssh_clone: false
-model:
-  backend: faster_transformers
-  downloader: examples.inference-deployments.mpt.mpt_ft_handler.download_convert
-  download_parameters:
-    hf_path: mosaicml/mpt-7b-instruct
-  model_handler: examples.inference-deployments.mpt.mpt_ft_handler.MPTFTModelHandler
-  model_parameters:
-    model_name_or_path: mosaicml/mpt-7b-instruct
-    ft_lib_path: /code/FasterTransformer/build/lib/libth_transformer.so
diff --git a/examples/inference-deployments/mpt/mpt_7b_storywriter.yaml b/examples/inference-deployments/mpt/mpt_7b_storywriter.yaml
index ba00b1ee4..9c235c1bf 100644
--- a/examples/inference-deployments/mpt/mpt_7b_storywriter.yaml
+++ b/examples/inference-deployments/mpt/mpt_7b_storywriter.yaml
@@ -2,18 +2,7 @@ name: mpt-7b-storywriter
 compute:
   gpus: 1
   instance: oci.vm.gpu.a10.1
-image: mosaicml/inference:0.1.16
+image: mosaicml/inference:0.1.37
 replicas: 1
-command: |
-  export PYTHONPATH=$PYTHONPATH:/code/examples
-integrations:
-- integration_type: git_repo
-  git_repo: mosaicml/examples
-  ssh_clone: false
-  git_commit: df65ce9448f2e4c7803f7082930f80c8dc4e8fe1
-model:
-  download_parameters:
-    hf_path: mosaicml/mpt-7b-storywriter
-  model_handler: examples.inference-deployments.mpt.mpt_handler.MPTModelHandler
-  model_parameters:
-    model_name: mosaicml/mpt-7b-storywriter
+default_model:
+  model_type: mpt-7b-storywriter
diff --git a/examples/inference-deployments/mpt/mpt_ft_handler.py b/examples/inference-deployments/mpt/mpt_ft_handler.py
deleted file mode 100644
index 2325eb9e7..000000000
--- a/examples/inference-deployments/mpt/mpt_ft_handler.py
+++ /dev/null
@@ -1,425 +0,0 @@
-# Copyright 2022 MosaicML Examples authors
-# SPDX-License-Identifier: Apache-2.0
-
-import argparse
-import configparser
-import copy
-import os
-from pathlib import Path
-from typing import Dict, List, Optional, Tuple
-from urllib.parse import urlparse
-
-import boto3
-import botocore
-import torch
-import torch.distributed as dist
-from huggingface_hub import snapshot_download
-from torch.nn.utils.rnn import pad_sequence
-from transformers import AutoTokenizer
-
-from FasterTransformer.examples.pytorch.gpt.utils import comm  # isort: skip # yapf: disable # type: ignore
-from FasterTransformer.examples.pytorch.gpt.utils.parallel_gpt import ParallelGPT  # isort: skip # yapf: disable # type: ignore
-from scripts.inference.convert_hf_mpt_to_ft import convert_mpt_to_ft  # isort: skip # yapf: disable # type: ignore
-
-LOCAL_CHECKPOINT_DIR = '/tmp/mpt'
-LOCAL_MODEL_PATH = os.path.join(LOCAL_CHECKPOINT_DIR, 'local_model')
-
-
-def download_convert(s3_path: Optional[str] = None,
-                     hf_path: Optional[str] = None,
-                     gcp_path: Optional[str] = None,
-                     gpus: int = 1,
-                     force_conversion: bool = False):
-    """Download model and convert to FasterTransformer format.
-
-    Args:
-        s3_path (str): Path for model location in an s3 bucket.
-        hf_path (str): Name of the model as on HF hub (e.g., mosaicml/mpt-7b-instruct) or local folder name containing
-            the model (e.g., mpt-7b-instruct)
-        gcp_path (str): Path for model location in a gcp bucket.
-        gpus (int): Number of gpus to use for inference (Default: 1)
-        force_conversion (bool): Force conversion to FT even if some features may not work as expected in FT (Default: False)
-    """
-    if not s3_path and not gcp_path and not hf_path:
-        raise RuntimeError(
-            'Either s3_path, gcp_path, or hf_path must be provided to download_convert'
-        )
-    model_name_or_path: str = ''
-
-    # If s3_path or gcp_path is provided, initialize the s3 client for download
-    s3 = None
-    download_from_path = None
-    if s3_path:
-        # s3 creds need to already be present as env vars
-        s3 = boto3.client('s3')
-        download_from_path = s3_path
-    if gcp_path:
-        s3 = boto3.client(
-            's3',
-            region_name='auto',
-            endpoint_url='https://storage.googleapis.com',
-            aws_access_key_id=os.environ['GCS_KEY'],
-            aws_secret_access_key=os.environ['GCS_SECRET'],
-        )
-        download_from_path = gcp_path
-
-    # If either s3_path or gcp_path is provided, download files
-    if s3:
-        model_name_or_path = LOCAL_MODEL_PATH
-
-        # Download model files
-        if os.path.exists(LOCAL_MODEL_PATH):
-            print(
-                f'[+] Path {LOCAL_MODEL_PATH} already exists, skipping download'
-            )
-        else:
-            Path(LOCAL_MODEL_PATH).mkdir(parents=True, exist_ok=True)
-
-            print(f'Downloading model from path: {download_from_path}')
-
-            parsed_path = urlparse(download_from_path)
-            prefix = parsed_path.path.lstrip('/')  # type: ignore
-
-            objs = s3.list_objects_v2(
-                Bucket=parsed_path.netloc,
-                Prefix=prefix,
-            )
-            downloaded_file_set = set(os.listdir(LOCAL_MODEL_PATH))
-            for obj in objs['Contents']:
-                file_key = obj['Key']
-                try:
-                    file_name = os.path.basename(file_key)
-                    if not file_name or file_name.startswith('.'):
-                        # Ignore hidden files
-                        continue
-                    if file_name not in downloaded_file_set:
-                        print(
-                            f'Downloading {os.path.join(LOCAL_MODEL_PATH, file_name)}...'
-                        )
-                        s3.download_file(Bucket=parsed_path.netloc,
-                                         Key=file_key,
-                                         Filename=os.path.join(
-                                             LOCAL_MODEL_PATH, file_name))
-                except botocore.exceptions.ClientError as e:
-                    print(
-                        f'Error downloading file with key: {file_key} with error: {e}'
-                    )
-    elif hf_path:
-        print(f'Downloading HF model with name: {hf_path}')
-        model_name_or_path = hf_path
-        snapshot_download(repo_id=hf_path)
-
-    # This is the format the the conversion script saves the converted checkpoint in
-    local_ft_model_path = os.path.join(LOCAL_CHECKPOINT_DIR, f'{gpus}-gpu')
-    ckpt_config_path = os.path.join(local_ft_model_path, 'config.ini')
-
-    # Convert model to FT format
-    # If FT checkpoint doesn't exist, create it.
-    if not os.path.isfile(ckpt_config_path):
-        print('Converting model to FT format')
-        # Datatype of weights in the HF checkpoint
-        weight_data_type = 'fp32'
-        convert_mpt_to_ft(model_name_or_path, LOCAL_CHECKPOINT_DIR, gpus,
-                          weight_data_type, force_conversion)
-        if not os.path.isfile(ckpt_config_path):
-            raise RuntimeError('Failed to create FT checkpoint')
-    else:
-        print(f'Reusing existing FT checkpoint at {local_ft_model_path}')
-
-
-class MPTFTModelHandler:
-
-    DEFAULT_GENERATE_KWARGS = {
-        # Output sequence length to generate.
-        'output_len': 256,
-        # Beam width for beam search
-        'beam_width': 1,
-        # top k candidate number
-        'top_k': 0,
-        # top p probability threshold
-        'top_p': 0.95,
-        # temperature parameter
-        'temperature': 0.8,
-        # Penalty for repetitions
-        'repetition_penalty': 1.0,
-        # Presence penalty. Similar to repetition, but additive rather than multiplicative.
-        'presence_penalty': 0.0,
-        'beam_search_diversity_rate': 0.0,
-        'len_penalty': 0.0,
-        'bad_words_list': None,
-        # A minimum number of tokens to generate.
-        'min_length': 0,
-        # if True, use different random seed for sentences in a batch.
-        'random_seed': True
-    }
-
-    INPUT_KEY = 'input'
-    PARAMETERS_KEY = 'parameters'
-
-    def __init__(self,
-                 model_name_or_path: str,
-                 ft_lib_path: str,
-                 inference_data_type: str = 'bf16',
-                 int8_mode: int = 0,
-                 gpus: int = 1):
-        """Fastertransformer model handler for MPT foundation series.
-
-        Args:
-            model_name_or_path (str): Name of the model as on HF hub (e.g., mosaicml/mpt-7b-instruct) or local model name (e.g., mpt-7b-instruct)
-            ft_lib_path (str): Path to the libth_transformer dynamic lib file(.e.g., build/lib/libth_transformer.so).
-            inference_data_type (str): Data type to use for inference (Default: bf16)
-            int8_mode (int): The level of quantization to perform. 0: No quantization. All computation in data_type,
-                1: Quantize weights to int8, all compute occurs in fp16/bf16. Not supported when data_type is fp32
-            gpus (int): Number of gpus to use for inference (Default: 1)
-        """
-        self.model_name_or_path = model_name_or_path
-
-        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path,
-                                                       trust_remote_code=True)
-
-        # Make sure the seed on all ranks is the same. This is important.
-        # Multi-gpu generate calls will hang without this.
-        torch.manual_seed(0)
-
-        model_path = os.path.join(LOCAL_CHECKPOINT_DIR, f'{gpus}-gpu')
-        ckpt_config_path = os.path.join(model_path, 'config.ini')
-
-        ckpt_config = configparser.ConfigParser()
-        ckpt_config.read(ckpt_config_path)
-
-        # Disable this optimization.
-        # https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md#advanced-features
-        shared_contexts_ratio = 0.0
-
-        if 'gpt' in ckpt_config.keys():
-            head_num = ckpt_config.getint('gpt', 'head_num')
-            size_per_head = ckpt_config.getint('gpt', 'size_per_head')
-            vocab_size = ckpt_config.getint('gpt', 'vocab_size')
-            start_id = ckpt_config.getint('gpt', 'start_id')
-            end_id = ckpt_config.getint('gpt', 'end_id')
-            layer_num = ckpt_config.getint('gpt', 'num_layer')
-            max_seq_len = ckpt_config.getint('gpt', 'max_pos_seq_len')
-            weights_data_type = ckpt_config.get('gpt', 'weight_data_type')
-            tensor_para_size = ckpt_config.getint('gpt', 'tensor_para_size')
-            pipeline_para_size = ckpt_config.getint('gpt',
-                                                    'pipeline_para_size',
-                                                    fallback=1)
-            layernorm_eps = ckpt_config.getfloat('gpt',
-                                                 'layernorm_eps',
-                                                 fallback=1e-5)
-            use_attention_linear_bias = ckpt_config.getboolean(
-                'gpt', 'use_attention_linear_bias')
-            has_positional_encoding = ckpt_config.getboolean(
-                'gpt', 'has_positional_encoding')
-        else:
-            raise RuntimeError(
-                'Unexpected config.ini for the FT checkpoint. Expected FT checkpoint to contain the `gpt` key.'
-            )
-
-        self.end_id = end_id
-
-        if not comm.is_model_parallel_initailized():
-            comm.initialize_model_parallel(tensor_para_size, pipeline_para_size)
-
-        print('Initializing FasterTransformer')
-        self.model = ParallelGPT(
-            head_num,
-            size_per_head,
-            vocab_size,
-            start_id,
-            end_id,
-            layer_num,
-            max_seq_len,
-            tensor_para_size,
-            pipeline_para_size,
-            lib_path=ft_lib_path,
-            inference_data_type=inference_data_type,
-            int8_mode=int8_mode,
-            weights_data_type=weights_data_type,
-            layernorm_eps=layernorm_eps,
-            use_attention_linear_bias=use_attention_linear_bias,
-            has_positional_encoding=has_positional_encoding,
-            shared_contexts_ratio=shared_contexts_ratio)
-        print(f'Loading FT checkpoint from {model_path}')
-        if not self.model.load(ckpt_path=model_path):
-            raise RuntimeError(
-                'Could not load model from a FasterTransformer checkpoint')
-        print('FT initialization complete')
-
-        self.device = comm.get_device()
-
-    def _parse_model_request(self, model_request: Dict) -> Tuple[str, Dict]:
-        if self.INPUT_KEY not in model_request:
-            raise RuntimeError(
-                f'"{self.INPUT_KEY}" must be provided to generate call')
-
-        generate_input = model_request[self.INPUT_KEY]
-
-        # Set default generate kwargs
-        generate_kwargs = copy.deepcopy(self.DEFAULT_GENERATE_KWARGS)
-        # If request contains any additional kwargs, add them to generate_kwargs
-        for k, v in model_request.get(self.PARAMETERS_KEY, {}).items():
-            generate_kwargs[k] = v
-
-        return generate_input, generate_kwargs
-
-    def _convert_kwargs(self, generate_inputs: List[str],
-                        generate_kwargs: Dict):
-        """Converts generate_kwargs into required torch types."""
-        batch_size = len(generate_inputs)
-
-        # Allow 'max_length' to be an alias for 'output_len'. Makes it less
-        # likely clients break when we swap in the FT handler.
-        if 'max_length' in generate_kwargs:
-            generate_kwargs['output_len'] = generate_kwargs['max_length']
-            del generate_kwargs['max_length']
-
-        # Integer args may be floats if the values are from a json payload.
-        generate_kwargs['output_len'] = int(generate_kwargs['output_len'])
-        generate_kwargs['top_k'] = int(generate_kwargs['top_k']) * torch.ones(
-            batch_size, dtype=torch.int32)
-        generate_kwargs['top_p'] *= torch.ones(batch_size, dtype=torch.float32)
-        generate_kwargs['temperature'] *= torch.ones(batch_size,
-                                                     dtype=torch.float32)
-        repetition_penalty = generate_kwargs['repetition_penalty']
-        generate_kwargs[
-            'repetition_penalty'] = None if repetition_penalty == 1.0 else repetition_penalty * torch.ones(
-                batch_size, dtype=torch.float32)
-        presence_penalty = generate_kwargs['presence_penalty']
-        generate_kwargs[
-            'presence_penalty'] = None if presence_penalty == 0.0 else presence_penalty * torch.ones(
-                batch_size, dtype=torch.float32)
-        generate_kwargs['beam_search_diversity_rate'] *= torch.ones(
-            batch_size, dtype=torch.float32)
-        generate_kwargs['len_penalty'] *= torch.ones(size=[batch_size],
-                                                     dtype=torch.float32)
-        generate_kwargs['min_length'] = int(
-            generate_kwargs['min_length']) * torch.ones(size=[batch_size],
-                                                        dtype=torch.int32)
-        if generate_kwargs['random_seed']:
-            generate_kwargs['random_seed'] = torch.randint(0,
-                                                           10000,
-                                                           size=[batch_size],
-                                                           dtype=torch.int64)
-
-    def _parse_model_requests(
-            self, model_requests: List[Dict]) -> Tuple[List[str], Dict]:
-        """Splits requests into a flat list of inputs and merged kwargs."""
-        generate_inputs = []
-        generate_kwargs = {}
-        for req in model_requests:
-            generate_input, generate_kwarg = self._parse_model_request(req)
-            generate_inputs += [generate_input]
-
-            for k, v in generate_kwarg.items():
-                if k in generate_kwargs and generate_kwargs[k] != v:
-                    raise RuntimeError(
-                        f'Request has conflicting values for kwarg {k}')
-                generate_kwargs[k] = v
-
-        return generate_inputs, generate_kwargs
-
-    @torch.no_grad()
-    def predict(self, model_requests: List[Dict]) -> List[str]:
-        generate_inputs, generate_kwargs = self._parse_model_requests(
-            model_requests)
-        self._convert_kwargs(generate_inputs, generate_kwargs)
-
-        start_ids = [
-            torch.tensor(self.tokenizer.encode(c),
-                         dtype=torch.int32,
-                         device=self.device) for c in generate_inputs
-        ]
-        start_lengths = [len(ids) for ids in start_ids]
-        start_ids = pad_sequence(start_ids,
-                                 batch_first=True,
-                                 padding_value=self.end_id)
-        start_lengths = torch.IntTensor(start_lengths)
-        tokens_batch = self.model(start_ids, start_lengths, **generate_kwargs)
-        outputs = []
-        for tokens in tokens_batch:
-            for beam_id in range(generate_kwargs['beam_width']):
-                # Do not exclude context input from the output
-                # token = tokens[beam_id][start_lengths[i]:]
-                token = tokens[beam_id]
-                # stop at end_id; This is the same as eos_token_id
-                token = token[token != self.end_id]
-                output = self.tokenizer.decode(token, skip_special_tokens=True)
-                outputs.append(output)
-        return outputs
-
-    def predict_stream(self, **model_requests: Dict):
-        raise RuntimeError('Streaming is not supported with FasterTransformer!')
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(
-        formatter_class=argparse.RawTextHelpFormatter)
-
-    parser.add_argument(
-        '--ft_lib_path',
-        type=str,
-        required=True,
-        help=
-        'Path to the libth_transformer dynamic lib file(e.g., build/lib/libth_transformer.so.'
-    )
-    parser.add_argument(
-        '--name_or_dir',
-        '-i',
-        type=str,
-        help=
-        'HF hub Model name (e.g., mosaicml/mpt-7b) or local dir path to load checkpoint from',
-        required=True)
-    parser.add_argument('--inference_data_type',
-                        '--data_type',
-                        type=str,
-                        choices=['fp32', 'fp16', 'bf16'],
-                        default='bf16')
-    parser.add_argument(
-        '--int8_mode',
-        type=int,
-        default=0,
-        choices=[0, 1],
-        help=
-        'The level of quantization to perform. 0: No quantization. All computation in data_type. 1: Quantize weights to int8, all compute occurs in fp16/bf16. Not supported when data_type is fp32'
-    )
-    parser.add_argument('--gpus',
-                        type=int,
-                        default=1,
-                        help='The number of gpus to use for inference.')
-
-    parser.add_argument(
-        '--force',
-        action='store_true',
-        help=
-        'Force conversion to FT even if some features may not work as expected in FT'
-    )
-
-    args = parser.parse_args()
-
-    s3_path = None
-    hf_path = None
-    if 's3' in args.name_or_dir:
-        s3_path = args.name_or_dir
-    else:
-        hf_path = args.name_or_dir
-
-    if not comm.is_model_parallel_initailized():
-        # pipeline parallelism is 1 for now
-        comm.initialize_model_parallel(tensor_para_size=args.gpus,
-                                       pipeline_para_size=1)
-
-    if comm.get_rank() == 0:
-        download_convert(s3_path=s3_path,
-                         hf_path=hf_path,
-                         gpus=args.gpus,
-                         force_conversion=args.force)
-    if dist.is_initialized():
-        dist.barrier()
-
-    model_handle = MPTFTModelHandler(args.name_or_dir, args.ft_lib_path,
-                                     args.inference_data_type, args.int8_mode,
-                                     args.gpus)
-    inputs = {'input': 'Who is the president of the USA?'}
-    out = model_handle.predict([inputs])
-    print(out[0])