From 5ffaa4a2ba480ca09a0359270c0acc0a0f39b82a Mon Sep 17 00:00:00 2001 From: Sasha Doubov Date: Thu, 29 Jun 2023 11:13:03 -0700 Subject: [PATCH] Update scripts/inference/benchmarking/README.md Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com> --- scripts/inference/benchmarking/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/inference/benchmarking/README.md b/scripts/inference/benchmarking/README.md index 2cf92952e8..70237a4711 100644 --- a/scripts/inference/benchmarking/README.md +++ b/scripts/inference/benchmarking/README.md @@ -63,7 +63,7 @@ Benchmark Setup: #### Long Inputs (2048 input tokens) on MPT-30B ![assets](assets/Latency-for-MPT-30B,-n_input_tok=2048.svg) -Our real-world results match the theory! The latency grows nearly linearly with the number of output tokens, which is the _decode_ stage time. For large batch sizes and output lengths, the latency follows more of a quadratic, which is the attention operation overhead kicking in. +Our real-world results match the theory! The latency grows nearly linearly with the number of output tokens, which is the _decode_ stage time. For large batch sizes and output lengths, the latency looks more quadratic, which shows the quadratic compute complexity of the attention operation. For longer input lengths and batch sizes, the _prefill_ stage starts to become more important, given that the model has to process a lot of input tokens in the forward pass. Despite the _prefill_ stage being highly efficient, the model still needs to perform a lot of computation in one forward pass, which results in the higher latency when increasing batch size and input length.