Speculative Sampling for Faster LLM Inference

About

This repository contains the implementation of the speculative sampling method for faster LLM inference with a draft model.

The implementation is based on the my own interpretation of the paper – Accelerating Large Language Model Decoding with Speculative Sampling by Deepmind.

Installation

Setting Up the Environment

This project uses uv for dependency management. To install UV, run the following command:

# On macOS and Linux.
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows.
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# With pip.
pip install uv

# With pipx.
pipx install uv

# With Homebrew.
brew install uv

# With Pacman.
pacman -S uv

Thereafter, install the rest of the dependencies using uv:

# create a virtual env
uv venv

# install dependencies
uv pip install -r requirements.txt  # Install from a requirements.txt file.

Usage

# check cli options
python main.py --help

usage: main.py [-h] --target-model TARGET_MODEL --draft-model DRAFT_MODEL --input-str INPUT_STR [--num-runs NUM_RUNS] [--N N] [--K K] [--temperature TEMPERATURE]
               [--top-k TOP_K] [--top-p TOP_P]

optional arguments:
  -h, --help            show this help message and exit
  --target-model TARGET_MODEL
                        Target model
  --draft-model DRAFT_MODEL
                        Draft model
  --input-str INPUT_STR
                        Input string
  --num-runs NUM_RUNS   Number of LLM inference runs
  --N N                 Number of tokens to generate
  --K K                 Number of tokens to speculate
  --temperature TEMPERATURE
                        Temperature
  --top-k TOP_K         Top k sampling
  --top-p TOP_P         Top p sampling

Running LLM inference comparison script:

python main.py --target-model gpt2-xl \
      --draft-model gpt2 \
      --input-str "Alan Turing theorized that computers would one day become" \
      --num-runs 50 \
      --N 40 \
      --K 4 \
      --temperature 0.6 \
      --top-k 25 \
      --top-p 0.9

With --num-runs 1, the script will run the LLM inference num-runs + 1 times to account for the warmup time.

Results

MPS

The following results are obtained on a MacBook Pro M2 Pro Max with 32GB RAM comparing speculative sampling with naive autoregressive sampling in LLM inference over multiple iterations:

N: 40
K: 4
temperature: 0.6
top-k: 25
top-p: 0.9

Target Model: gpt2-xl and Draft Model: gpt2-xl

Note

This serves as a sanity check for the speculative sampling method.

In this case, since the target model and draft model are the same, there should be no rejection of the speculative samples.

Method	num_runs	time	+/- std	speedup
Autoregressive Sampling	50	2.84	0.20	1.00
Speculative Sampling	50	2.96	0.22	0.96

Target Model: gpt2-xl and Draft Model: gpt2

Method	num_runs	time	+/- std	speedup
Autoregressive Sampling	50	2.86	0.16	1.00
Speculative Sampling	50	2.17	0.32	1.31

CUDA

The following results are obtained when running on 2x A6000 comparing speculative sampling with naive autoregressive sampling in LLM inference in a single iteration:

N: 50
K: 4
temperature: 0
top-k: 0
top-p: 0

Target Model: Meta-Llama-3.1-70B-bnb-4bit and Draft Model: Meta-Llama-3.1-8B-bnb-4bit

autogressive-vs-speculative.mp4

Method	time	token/sec	speedup
Autoregressive Sampling	27.7	1.79	1.00
Speculative Sampling	11.3	4.68	~2.61x

Based on the mini experiment and results above, we observed that the speculative sampling method offers a significant speedup compared to the autoregressive sampling method.

In our sanity check, we confirmed that when the target and draft models are identical, the speculative sampling method does not produce any rejected samples, since the tokens are sampled from the exact same probability distribution. Additionally, because the models are identical in size and we're essentially running the same model twice (more forward passes in the draft model), the speculative sampling method is expected to be slower than the autoregressive sampling method.

References

@misc{chen2023acceleratinglargelanguagemodel,
      title={Accelerating Large Language Model Decoding with Speculative Sampling}, 
      author={Charlie Chen and Sebastian Borgeaud and Geoffrey Irving and Jean-Baptiste Lespiau and Laurent Sifre and John Jumper},
      year={2023},
      eprint={2302.01318},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2302.01318}, 
}

Acknowledgements

The implementation for speculative sampling is build upon the following repository:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speculative Sampling for Faster LLM Inference

About

Installation

Setting Up the Environment

Usage

Results

MPS

CUDA

References

Acknowledgements

About

Releases

Packages

Languages

License

wtlow003/speculative-sampling

Folders and files

Latest commit

History

Repository files navigation

Speculative Sampling for Faster LLM Inference

About

Installation

Setting Up the Environment

Usage

Results

MPS

CUDA

References

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages