LLMs in TT-NN

Authors:

Contents

LLMs in TT-NN

1. Overview

2. Modules

2.1 Embedding

2.2 RoPE

Iterative update system
When to use our fused op

2.3 Norm

Replicated layernorm vs distributed layernorm
- Layernorm/rmsnorm weights in row major / wrapped around tile size trick

2.4 Attention

Flash Attention and Flash Decode
- general description
- limitations
- which dims are parallelized

2.5 MLP

2.6 Decoder

2.7 LM Head

3. Features

3.1 Generative Decoding

3.2 Prefill and Decode

submodules, tests
how to combine prefill and decode,
slicing prefill to fit in L1

3.3 Multi-Device

device mesh
column parallel followed by row parallel
sharding, CCL ops, reducing CCL overheads, etc.

3.4 Continuous Batching

quick intro and how it is implemented in demos.

3.5 vLLM Integration

Our vLLM repo and what's needed to integrate with it.

4. Best Practices and Optimizations

4.1 Tracing

link to existing doc, why it helps decode more

4.2 Async Mode

4.3 Multiple CQs

how to feed back output to input and read output asyncronously

4.4 Op Configs

Writing correct program configs and shard specs
Deciding how many cores to run an op on
- Why did we use 16 cores for MLP
Which matmul to use when @Colman Glagovich
- 1d, 2d, dram-sharded, ...
Implicitly padding weights in program config for matmuls

4.5 Accuracy

How we measure it (PCC, perplexity, top-1/top-5, end-user tests, benchmarking)
How much PCC is enough? Rules of thumb.
Accuracy tests
Debugging PCC issues

4.6 Performance Analysis

Performance tooling, tracy

4.7 Misc. Performance Optimizations

Which dim to shard matmuls on
DRAM-sharding
Avoiding sharded to interleaved calls

4.8 Module Tests

4.9 Performance Testing

4.10 Common Pitfalls

4.10.1 Error Messages

Running out of L1
Shard spec and program config mismatches
For some TTNN ops (e.g. ttnn.all_gather) it's not supported to pass -1 in the dim argument.
- You'll see an error related to op invocation where the arguments don't match

4.10.2 Shard Spec Mismatches

4.10.3 Ethernet Dispatch Cores

link to any other description, and mention it is needed for N300 and T3K

4.10.4 Hangs

4.10.4.1 Tracing

Host communications cause tracing to hang
Running without async mode enabled causes tracing to hang
Careful with print in tracing

4.10.4.2 Large Matmuls

Large matmuls hanging? Link to appropriate ticket with workaround
Issue is being investigated with a workaround of setting the output subblock to 1,1 and grid size to 8x7