Authors:
- LLMs in TT-NN
- Iterative update system
- When to use our fused op
- Replicated layernorm vs distributed layernorm
- Layernorm/rmsnorm weights in row major / wrapped around tile size trick
- Flash Attention and Flash Decode
- general description
- limitations
- which dims are parallelized
- submodules, tests
- how to combine prefill and decode,
- slicing prefill to fit in L1
- device mesh
- column parallel followed by row parallel
- sharding, CCL ops, reducing CCL overheads, etc.
- quick intro and how it is implemented in demos.
- Our vLLM repo and what's needed to integrate with it.
- link to existing doc, why it helps decode more
- how to feed back output to input and read output asyncronously
- Writing correct program configs and shard specs
- Deciding how many cores to run an op on
- Why did we use 16 cores for MLP
- Which matmul to use when @Colman Glagovich
- 1d, 2d, dram-sharded, ...
- Implicitly padding weights in program config for matmuls
- How we measure it (PCC, perplexity, top-1/top-5, end-user tests, benchmarking)
- How much PCC is enough? Rules of thumb.
- Accuracy tests
- Debugging PCC issues
- Performance tooling, tracy
- Which dim to shard matmuls on
- DRAM-sharding
- Avoiding sharded to interleaved calls
- Running out of L1
- Shard spec and program config mismatches
- For some TTNN ops (e.g. ttnn.all_gather) it's not supported to pass -1 in the dim argument.
- You'll see an error related to op invocation where the arguments don't match
- link to any other description, and mention it is needed for N300 and T3K
- Host communications cause tracing to hang
- Running without async mode enabled causes tracing to hang
- Careful with print in tracing
- Large matmuls hanging? Link to appropriate ticket with workaround
- Issue is being investigated with a workaround of setting the output subblock to 1,1 and grid size to 8x7