Skip to content

Commit

Permalink
Update scripts/inference/benchmarking/README.md
Browse files Browse the repository at this point in the history
Co-authored-by: Vitaliy Chiley <[email protected]>
  • Loading branch information
sashaDoubov and vchiley committed Jun 29, 2023
1 parent de1fc7c commit 5ffaa4a
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion scripts/inference/benchmarking/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ Benchmark Setup:
#### Long Inputs (2048 input tokens) on MPT-30B
![assets](assets/Latency-for-MPT-30B,-n_input_tok=2048.svg)

Our real-world results match the theory! The latency grows nearly linearly with the number of output tokens, which is the _decode_ stage time. For large batch sizes and output lengths, the latency follows more of a quadratic, which is the attention operation overhead kicking in.
Our real-world results match the theory! The latency grows nearly linearly with the number of output tokens, which is the _decode_ stage time. For large batch sizes and output lengths, the latency looks more quadratic, which shows the quadratic compute complexity of the attention operation.

For longer input lengths and batch sizes, the _prefill_ stage starts to become more important, given that the model has to process a lot of input tokens in the forward pass.
Despite the _prefill_ stage being highly efficient, the model still needs to perform a lot of computation in one forward pass, which results in the higher latency when increasing batch size and input length.
Expand Down

0 comments on commit 5ffaa4a

Please sign in to comment.