Update scripts/inference/benchmarking/README.md

Co-authored-by: Vitaliy Chiley <[email protected]>
mosaicml · Jun 29, 2023 · 5ffaa4a · 5ffaa4a
1 parent de1fc7c
commit 5ffaa4a
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/scripts/inference/benchmarking/README.md b/scripts/inference/benchmarking/README.md
@@ -63,7 +63,7 @@ Benchmark Setup:
 #### Long Inputs (2048 input tokens) on MPT-30B
 ![assets](assets/Latency-for-MPT-30B,-n_input_tok=2048.svg)
 
-Our real-world results match the theory! The latency grows nearly linearly with the number of output tokens, which is the _decode_ stage time. For large batch sizes and output lengths, the latency follows more of a quadratic, which is the attention operation overhead kicking in.
+Our real-world results match the theory! The latency grows nearly linearly with the number of output tokens, which is the _decode_ stage time. For large batch sizes and output lengths, the latency looks more quadratic, which shows the quadratic compute complexity of the attention operation.
 
 For longer input lengths and batch sizes, the _prefill_ stage starts to become more important, given that the model has to process a lot of input tokens in the forward pass.
 Despite the _prefill_ stage being highly efficient, the model still needs to perform a lot of computation in one forward pass, which results in the higher latency when increasing batch size and input length.