Tokenizers

This is a repo for implementing tokenizers for fun.

Tokenization is at the heart of much weirdness of LLMs.

LLMs struggle to spell words
LLMs struggle to reverse a string
LLM is bad at simple arithmetic

Cool link to visualize different tokenizers.

Character level encoding

With this tokenizer, each character in the text will correspond to an integer. The training is incredibly simple: go through each character in the training text and assign an integer.

Advantages:

Very small vocabulary size, which leads to small embedding tables in transformers.
Fast to train. Requires one run through the text.

Disadvantages:

There is no compression of the length of the sequence, which reduces the context lenght in a transformer and increases the complexity.

Byte pair encoding

Algorithm: iteratively replace the most common contiguous sequences of characters in a target piece of text with unused 'placeholder' bytes.

Considerations

Vocabulary size

The number of parameters on a transformer model is dependent on the vocabulary size. So higher vocab_size will mean more parameters. Specifically, in the embedding table and the head linear layer at the end of the decoder.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
bpe.py		bpe.py
char-tokenizer.py		char-tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenizers

Character level encoding

Byte pair encoding

Considerations

Vocabulary size

About

Releases

Packages

Languages

danicrg/tokenizers

Folders and files

Latest commit

History

Repository files navigation

Tokenizers

Character level encoding

Byte pair encoding

Considerations

Vocabulary size

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages