Skip to content

Commit

Permalink
perf(decoder): use newest models in benchmarks
Browse files Browse the repository at this point in the history
  • Loading branch information
dacorvo committed Sep 26, 2024
1 parent 6e2c988 commit c3eed64
Show file tree
Hide file tree
Showing 23 changed files with 25 additions and 153 deletions.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed docs/assets/benchmarks/inferentia-llama2-7b/ttft.png
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed docs/assets/benchmarks/inferentia-llama3-8b/ttft.png
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
12 changes: 4 additions & 8 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,10 @@
title: NeuronX Text-generation-inference for AWS inferentia2
title: How-To Guides
- sections:
- local: benchmarks/inferentia-llama2-7b
title: Llama2 7b on AWS Inferentia2
- local: benchmarks/inferentia-llama2-13b
title: Llama2 13b on AWS Inferentia2
- local: benchmarks/inferentia-mistral-v2
title: Mistral v0.2 7b on AWS Inferentia2
- local: benchmarks/inferentia-llama3-8b
title: Llama-3 8B on AWS Inferentia2
- local: benchmarks/inferentia-mistral-small
title: Mistral Small on AWS Inferentia2
- local: benchmarks/inferentia-llama3.1-8b
title: Llama-3.1 8B on AWS Inferentia2
title: Benchmarks
- sections:
- local: community/contributing
Expand Down
60 changes: 0 additions & 60 deletions docs/source/benchmarks/inferentia-llama2-13b.mdx

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,19 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# Llama-3-8b performance on AWS Inferentia2 (Latency & Througput)
# Llama-3.1-8b performance on AWS Inferentia2 (Latency & Througput)

How fast is Llama-3-8b on Inferentia2? Let's figure out!
How fast is Llama-3.1-8b on Inferentia2? Let's figure out!

For this benchmark we will use the following configurations:

| Model type | batch_size | sequence_length |
|----------------|------------|-----------------|
| Llama3 8b BS1 | 1 | 4096 |
| Llama3 8b BS4 | 4 | 4096 |
| Llama3 8b BS8 | 8 | 4096 |
| Llama3 8b BS16 | 16 | 4096 |
| Llama3 8b BS32 | 32 | 4096 |
| Model type | batch_size | sequence_length |
|------------------|------------|-----------------|
| Llama3.1 8b BS1 | 1 | 4096 |
| Llama3.1 8b BS4 | 4 | 4096 |
| Llama3.1 8b BS8 | 8 | 4096 |
| Llama3.1 8b BS16 | 16 | 4096 |
| Llama3.1 8b BS32 | 32 | 4096 |

*Note: all models are compiled to use 4 devices corresponding to 8 cores on the `inf2.48xlarge` instance.*

Expand All @@ -41,15 +41,15 @@ We test the time to first token for increasing context sizes, from a typical Q/A

Time to first token is expressed in **seconds**.

![Llama3 8b inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3-8b/ttft.png "Time to first token")
![Llama3.1 8b inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3.1-8b/ttft.png "Time to first token")

## Inter-token Latency

The inter-token latency corresponds to the average time elapsed between two generated tokens.

It is expressed in **milliseconds**.

![Llama3 8b inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3-8b/latency.png "Inter-token latency")
![Llama3.1 8b inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3.1-8b/latency.png "Inter-token latency")

### Throughput

Expand All @@ -58,4 +58,4 @@ by the end-to-end latency.

Throughput is expressed in **tokens/second**.

![Llama3 8b inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3-8b/throughput.png "Throughput")
![Llama3.1 8b inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3.1-8b/throughput.png "Throughput")
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,16 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# Llama-2-7b performance on AWS Inferentia2 (Latency & Througput)
# Mistral-Small-Instruct performance on AWS Inferentia2 (Latency & Througput)

How fast is Llama-2-7b on Inferentia2? Let's figure out!
How fast is Mistral on Inferentia2? Let's figure out!

For this benchmark we will use the following configurations:

| Model type | batch_size | sequence_length |
|----------------|------------|-----------------|
| Llama2 7B BS1 | 1 | 4096 |
| Llama2 7B BS4 | 4 | 4096 |
| Llama2 7B BS8 | 8 | 4096 |
| Llama2 7B BS16 | 16 | 4096 |
| Llama2 7B BS32 | 24 | 4096 |
| Model type | batch_size | sequence_length |
|--------------------|------------|-----------------|
| Mistral-Small BS1 | 1 | 4096 |
| Mistral-Small BS4 | 4 | 4096 |

*Note: all models are compiled to use 6 devices corresponding to 12 cores on the `inf2.48xlarge` instance.*

Expand All @@ -41,15 +38,15 @@ We test the time to first token for increasing context sizes, from a typical Q/A

Time to first token is expressed in **seconds**.

![Llama2 7b inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2-7b/ttft.png "Time to first token")
![Mistral Small inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-mistral-small/ttft.png "Time to first token")

## Inter-token Latency

The inter-token latency corresponds to the average time elapsed between two generated tokens.

It is expressed in **milliseconds**.

![Llama2 7b inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2-7b/latency.png "Inter-token latency")
![Mistral Small inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-mistral-small/latency.png "Inter-token latency")

### Throughput

Expand All @@ -58,4 +55,4 @@ by the end-to-end latency.

Throughput is expressed in **tokens/second**.

![Llama2 7b inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2-7b/throughput.png "Throughput")
![Mistral Small inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-mistral-small/throughput.png "Throughput")
61 changes: 0 additions & 61 deletions docs/source/benchmarks/inferentia-mistral-v2.mdx

This file was deleted.

0 comments on commit c3eed64

Please sign in to comment.