Skip to content

Latest commit

 

History

History
112 lines (111 loc) · 4.14 KB

llms.md

File metadata and controls

112 lines (111 loc) · 4.14 KB

LLMs in TT-NN

Authors:

Contents

1. Overview

2. Modules

2.1 Embedding

2.2 RoPE

  • Iterative update system
  • When to use our fused op

2.3 Norm

  • Replicated layernorm vs distributed layernorm
    • Layernorm/rmsnorm weights in row major / wrapped around tile size trick

2.4 Attention

  • Flash Attention and Flash Decode
    • general description
    • limitations
    • which dims are parallelized

2.5 MLP

2.6 Decoder

2.7 LM Head

3. Features

3.1 Generative Decoding

3.2 Prefill and Decode

  • submodules, tests
  • how to combine prefill and decode,
  • slicing prefill to fit in L1

3.3 Multi-Device

  • device mesh
  • column parallel followed by row parallel
  • sharding, CCL ops, reducing CCL overheads, etc.

3.4 Continuous Batching

  • quick intro and how it is implemented in demos.

3.5 vLLM Integration

  • Our vLLM repo and what's needed to integrate with it.

4. Best Practices and Optimizations

4.1 Tracing

  • link to existing doc, why it helps decode more

4.2 Async Mode

4.3 Multiple CQs

  • how to feed back output to input and read output asyncronously

4.4 Op Configs

  • Writing correct program configs and shard specs
  • Deciding how many cores to run an op on
    • Why did we use 16 cores for MLP
  • Which matmul to use when @Colman Glagovich
    • 1d, 2d, dram-sharded, ...
  • Implicitly padding weights in program config for matmuls

4.5 Accuracy

  • How we measure it (PCC, perplexity, top-1/top-5, end-user tests, benchmarking)
  • How much PCC is enough? Rules of thumb.
  • Accuracy tests
  • Debugging PCC issues

4.6 Performance Analysis

  • Performance tooling, tracy

4.7 Misc. Performance Optimizations

  • Which dim to shard matmuls on
  • DRAM-sharding
  • Avoiding sharded to interleaved calls

4.8 Module Tests

4.9 Performance Testing

4.10 Common Pitfalls

4.10.1 Error Messages

  • Running out of L1
  • Shard spec and program config mismatches
  • For some TTNN ops (e.g. ttnn.all_gather) it's not supported to pass -1 in the dim argument.
    • You'll see an error related to op invocation where the arguments don't match

4.10.2 Shard Spec Mismatches

4.10.3 Ethernet Dispatch Cores

  • link to any other description, and mention it is needed for N300 and T3K

4.10.4 Hangs

4.10.4.1 Tracing
  • Host communications cause tracing to hang
  • Running without async mode enabled causes tracing to hang
  • Careful with print in tracing
4.10.4.2 Large Matmuls
  • Large matmuls hanging? Link to appropriate ticket with workaround
  • Issue is being investigated with a workaround of setting the output subblock to 1,1 and grid size to 8x7