Skip to content

Latest commit

 

History

History
58 lines (47 loc) · 2.78 KB

File metadata and controls

58 lines (47 loc) · 2.78 KB

Step-by-Step

This document describes the step-by-step instructions to run large language models(LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.

We now support two models, and we are adding more models and more advanced techniques(distributed inference, model compressions etc.) to better unleash LLM inference on Intel platforms.

Prerequisite

Create Environment

conda install mkl mkl-include -y
conda install jemalloc gperftools -c conda-forge -y
pip install torch==1.13.1 --extra-index-url https://download.pytorch.org/whl/cpu
pip install intel_extension_for_pytorch==1.13.0
pip install -r requirements.txt

Setup Environment Variables

export KMP_BLOCKTIME=1
export KMP_SETTINGS=1
export KMP_AFFINITY=granularity=fine,compact,1,0

# IOMP
export OMP_NUM_THREADS=< Cores number to use >
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so

Performance Benchmark

GPT-J

Performance

# use jemalloc
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000"

# default is beam search with num_beams=4, if you need to use greedy search for comparison, add "--greedy" in args.
numactl -m <node N> -C <cpu list> \
    python run_gptj.py \
        --precision <fp32/bf16> \
        --max-new-tokens 32

BLOOM-176B

Performance

We don't enable jemalloc here since BLOOM-176B requires lots of memory and will have memory contention w/ jemalloc.

numactl -m <node N> -C <cpu list> python3 run_bloom.py --batch_size 1 --benchmark

By default searcher is set to beam searcher with num_beams = 4, if you'd like to use greedy search for comparison, add "--greedy" in args.

Note: Inference performance speedup with Intel DL Boost (VNNI/AMX) on Intel(R) Xeon(R) hardware, Please refer to Performance Tuning Guide for more optimizations.