Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT-LLM对starcoder7b模型无加速效果 (Hackathon 2023) #98

Open
young-955 opened this issue Sep 21, 2023 · 1 comment
Open

Comments

@young-955
Copy link

Environment

CPU architecture: x86_64
GPU name: NVIDIA A10
TensorRT branch: 9.0.0
TensorRT LLM: 0.1.3
Cuda: 12.1.66
Cudnn: 8.9.0
Container: registry.cn-hangzhou.aliyuncs.com/trt-hackathon/trt-hackathon:final_v1
NVIDIA driver version: 525.105.17
OS: Ubuntu 22.04.3 LTS x86_64
Kernel: 5.15.0-73-generic

问题简要描述

拉取https://huggingface.co/bigcode/starcoderbase-7b模型,直接使用pytorch进行推理,与将模型转化为TensorRT-LLM后进行推理,性能无明显差异

复现代码

pytorch版本推理代码

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "./"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).half().cuda()

end_token = "<fim_suffix>"

import time
t1 = time.time()
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").cuda()
outputs = model.generate(inputs, max_new_tokens=20, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids(end_token))
print(tokenizer.decode(outputs[0]))
t2 = time.time()
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").cuda()
outputs = model.generate(inputs, max_new_tokens=20, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids(end_token))
print(tokenizer.decode(outputs[0]))
t3 = time.time()
print(f'cost: 1st infer: {t2-t1}, 2nd infer: {t3-t2}')

pytorch性能

pytorch-starcoder7b

TensorRT-LLM模型转换与推理代码

python3 hf_gpt_convert.py -p 1 --model starcoder -i ../../starcoderbase-7b -o ./c-model/starcoder --tensor-parallelism 1 --storage-type float16

python3 build.py \
    --model_dir ./c-model/starcoder/1-gpu \
    --use_gpt_attention_plugin \
    --enable_context_fmha \
    --use_layernorm_plugin \
    --use_gemm_plugin \
    --parallel_build \
    --output_dir starcoder_outputs_tp1 \
    --world_size 1

mpirun -np 1 --allow-run-as-root python3 run.py --engine_dir starcoder_outputs_tp1 --tokenizer ../../starcoderbase-7b --input_text "def print_hello_world():" --max_output_len 20

TensorRT-LLM 性能

trtllm-starcoder7b

根据上述结果,可以看出第二次推理时pytorch版本与TensorRT-LLM版本无明显差异

@shm007g
Copy link

shm007g commented Oct 19, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants