perf(decoder): use newest models in benchmarks

huggingface · Sep 26, 2024 · c3eed64 · c3eed64
1 parent 6e2c988
commit c3eed64
Show file tree

Hide file tree

Showing 23 changed files with 25 additions and 153 deletions.
diff --git a/docs/assets/benchmarks/inferentia-llama2-13b/latency.png b/docs/assets/benchmarks/inferentia-llama2-13b/latency.png
diff --git a/docs/assets/benchmarks/inferentia-llama2-13b/throughput.png b/docs/assets/benchmarks/inferentia-llama2-13b/throughput.png
diff --git a/docs/assets/benchmarks/inferentia-llama2-13b/ttft.png b/docs/assets/benchmarks/inferentia-llama2-13b/ttft.png
diff --git a/docs/assets/benchmarks/inferentia-llama2-7b/latency.png b/docs/assets/benchmarks/inferentia-llama2-7b/latency.png
diff --git a/docs/assets/benchmarks/inferentia-llama2-7b/throughput.png b/docs/assets/benchmarks/inferentia-llama2-7b/throughput.png
diff --git a/docs/assets/benchmarks/inferentia-llama2-7b/ttft.png b/docs/assets/benchmarks/inferentia-llama2-7b/ttft.png
diff --git a/docs/assets/benchmarks/inferentia-llama3-8b/latency.png b/docs/assets/benchmarks/inferentia-llama3-8b/latency.png
diff --git a/docs/assets/benchmarks/inferentia-llama3-8b/throughput.png b/docs/assets/benchmarks/inferentia-llama3-8b/throughput.png
diff --git a/docs/assets/benchmarks/inferentia-llama3-8b/ttft.png b/docs/assets/benchmarks/inferentia-llama3-8b/ttft.png
diff --git a/docs/assets/benchmarks/inferentia-llama3.1-8b/latency.png b/docs/assets/benchmarks/inferentia-llama3.1-8b/latency.png
diff --git a/docs/assets/benchmarks/inferentia-llama3.1-8b/throughput.png b/docs/assets/benchmarks/inferentia-llama3.1-8b/throughput.png
diff --git a/docs/assets/benchmarks/inferentia-llama3.1-8b/ttft.png b/docs/assets/benchmarks/inferentia-llama3.1-8b/ttft.png
diff --git a/docs/assets/benchmarks/inferentia-mistral-small/latency.png b/docs/assets/benchmarks/inferentia-mistral-small/latency.png
diff --git a/docs/assets/benchmarks/inferentia-mistral-small/throughput.png b/docs/assets/benchmarks/inferentia-mistral-small/throughput.png
diff --git a/docs/assets/benchmarks/inferentia-mistral-small/ttft.png b/docs/assets/benchmarks/inferentia-mistral-small/ttft.png
diff --git a/docs/assets/benchmarks/inferentia-mistral-v2/latency.png b/docs/assets/benchmarks/inferentia-mistral-v2/latency.png
diff --git a/docs/assets/benchmarks/inferentia-mistral-v2/throughput.png b/docs/assets/benchmarks/inferentia-mistral-v2/throughput.png
diff --git a/docs/assets/benchmarks/inferentia-mistral-v2/ttft.png b/docs/assets/benchmarks/inferentia-mistral-v2/ttft.png
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -46,14 +46,10 @@
       title: NeuronX Text-generation-inference for AWS inferentia2
     title: How-To Guides
   - sections:
-    - local: benchmarks/inferentia-llama2-7b
-      title: Llama2 7b on AWS Inferentia2
-    - local: benchmarks/inferentia-llama2-13b
-      title: Llama2 13b on AWS Inferentia2
-    - local: benchmarks/inferentia-mistral-v2
-      title: Mistral v0.2 7b on AWS Inferentia2
-    - local: benchmarks/inferentia-llama3-8b
-      title: Llama-3 8B on AWS Inferentia2
+    - local: benchmarks/inferentia-mistral-small
+      title: Mistral Small on AWS Inferentia2
+    - local: benchmarks/inferentia-llama3.1-8b
+      title: Llama-3.1 8B on AWS Inferentia2
     title: Benchmarks
   - sections:
     - local: community/contributing

diff --git a/docs/source/benchmarks/inferentia-llama2-13b.mdx b/docs/source/benchmarks/inferentia-llama2-13b.mdx
diff --git a/...ource/benchmarks/inferentia-llama3-8b.mdx → ...rce/benchmarks/inferentia-llama3.1-8b.mdx b/...ource/benchmarks/inferentia-llama3-8b.mdx → ...rce/benchmarks/inferentia-llama3.1-8b.mdx
@@ -14,19 +14,19 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
-# Llama-3-8b performance on AWS Inferentia2 (Latency & Througput)
+# Llama-3.1-8b performance on AWS Inferentia2 (Latency & Througput)
 
-How fast is Llama-3-8b on Inferentia2?  Let's figure out!
+How fast is Llama-3.1-8b on Inferentia2?  Let's figure out!
 
 For this benchmark we will use the following configurations:
 
-| Model type     | batch_size | sequence_length |
-|----------------|------------|-----------------|
-| Llama3 8b BS1  | 1          | 4096            |
-| Llama3 8b BS4  | 4          | 4096            |
-| Llama3 8b BS8  | 8          | 4096            |
-| Llama3 8b BS16 | 16         | 4096            |
-| Llama3 8b BS32 | 32         | 4096            |
+| Model type       | batch_size | sequence_length |
+|------------------|------------|-----------------|
+| Llama3.1 8b BS1  | 1          | 4096            |
+| Llama3.1 8b BS4  | 4          | 4096            |
+| Llama3.1 8b BS8  | 8          | 4096            |
+| Llama3.1 8b BS16 | 16         | 4096            |
+| Llama3.1 8b BS32 | 32         | 4096            |
 
 *Note: all models are compiled to use 4 devices corresponding to 8 cores on the `inf2.48xlarge` instance.*
 
@@ -41,15 +41,15 @@ We test the time to first token for increasing context sizes, from a typical Q/A
 
 Time to first token is expressed in **seconds**.
 
-![Llama3 8b inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3-8b/ttft.png "Time to first token")
+![Llama3.1 8b inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3.1-8b/ttft.png "Time to first token")
 
 ## Inter-token Latency
 
 The inter-token latency corresponds to the average time elapsed between two generated tokens.
 
 It is expressed in **milliseconds**.
 
-![Llama3 8b inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3-8b/latency.png "Inter-token latency")
+![Llama3.1 8b inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3.1-8b/latency.png "Inter-token latency")
 
 ### Throughput
 
@@ -58,4 +58,4 @@ by the end-to-end latency.
 
 Throughput is expressed in **tokens/second**.
 
-![Llama3 8b inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3-8b/throughput.png "Throughput")
+![Llama3.1 8b inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3.1-8b/throughput.png "Throughput")
diff --git a/...ource/benchmarks/inferentia-llama2-7b.mdx → ...e/benchmarks/inferentia-mistral-small.mdx b/...ource/benchmarks/inferentia-llama2-7b.mdx → ...e/benchmarks/inferentia-mistral-small.mdx
@@ -14,19 +14,16 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
-# Llama-2-7b performance on AWS Inferentia2 (Latency & Througput)
+# Mistral-Small-Instruct performance on AWS Inferentia2 (Latency & Througput)
 
-How fast is Llama-2-7b on Inferentia2?  Let's figure out!
+How fast is Mistral on Inferentia2?  Let's figure out!
 
 For this benchmark we will use the following configurations:
 
-| Model type     | batch_size | sequence_length |
-|----------------|------------|-----------------|
-| Llama2 7B BS1  | 1          | 4096            |
-| Llama2 7B BS4  | 4          | 4096            |
-| Llama2 7B BS8  | 8          | 4096            |
-| Llama2 7B BS16 | 16         | 4096            |
-| Llama2 7B BS32 | 24         | 4096            |
+| Model type         | batch_size | sequence_length |
+|--------------------|------------|-----------------|
+| Mistral-Small BS1  | 1          | 4096            |
+| Mistral-Small BS4  | 4          | 4096            |
 
 *Note: all models are compiled to use 6 devices corresponding to 12 cores on the `inf2.48xlarge` instance.*
 
@@ -41,15 +38,15 @@ We test the time to first token for increasing context sizes, from a typical Q/A
 
 Time to first token is expressed in **seconds**.
 
-![Llama2 7b inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2-7b/ttft.png "Time to first token")
+![Mistral Small inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-mistral-small/ttft.png "Time to first token")
 
 ## Inter-token Latency
 
 The inter-token latency corresponds to the average time elapsed between two generated tokens.
 
 It is expressed in **milliseconds**.
 
-![Llama2 7b inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2-7b/latency.png "Inter-token latency")
+![Mistral Small inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-mistral-small/latency.png "Inter-token latency")
 
 ### Throughput
 
@@ -58,4 +55,4 @@ by the end-to-end latency.
 
 Throughput is expressed in **tokens/second**.
 
-![Llama2 7b inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2-7b/throughput.png "Throughput")
+![Mistral Small inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-mistral-small/throughput.png "Throughput")
diff --git a/docs/source/benchmarks/inferentia-mistral-v2.mdx b/docs/source/benchmarks/inferentia-mistral-v2.mdx