Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is 7B llama speed expected to be slow? #19

Open
w32zhong opened this issue Aug 30, 2024 · 2 comments
Open

Is 7B llama speed expected to be slow? #19

w32zhong opened this issue Aug 30, 2024 · 2 comments

Comments

@w32zhong
Copy link

w32zhong commented Aug 30, 2024

Hello, thank you for opening source such a solid work! Feel free to add my wechat (hellozhongwei) for an offline chat!

I know that, in the paper, the inference speed in Figure 2 is measured only by the gate_proj linear operation speed for 70B LLaMA. The speed bar looks impressive although I assume de-quantization and re-scaling in the CUDA kernel has huge overheads.

My hypothesis is the speed is due to single-batch memory-bound slowdown? But if this is the case, the full model inference for single batch should be faster as well? I do not have enough hardware resources, so I tested the smaller LLaMA 7B checkpoint: ChenMnZ/Llama-2-7b-EfficientQAT-w2g64-BitBLAS. However, the 2bit BitBLAS version is only around 14.5 tokens / s, but the huggingface native fp16 is faster (20 tokens / s) even if the latter one is operating in model parallelism.

My question is whether this is expected. Because I think BitBLAS has applied efficient schedulers on CUDA code already, it should have higher inference speed as you have reported in Figure 2. But why?

Test devices: 2x RTX3060

Test code:

import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import TextStreamer
from gptqmodel import GPTQModel

# ref model
ref_model_path = "NousResearch/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(ref_model_path)
model = AutoModelForCausalLM.from_pretrained(ref_model_path,
    torch_dtype=torch.float16, device_map='auto', load_in_8bit=False)
streamer = TextStreamer(tokenizer)

start = time.time()
output = model.generate(
    **tokenizer("Solar eclipse is ", return_tensors="pt").to(model.device),
    max_new_tokens=256, streamer=streamer, use_cache=True
)
end = time.time()

output_len = output.shape[-1]
delta_time = end - start
print(output_len, delta_time, output_len / delta_time)

# 2-bit model in BitBLAS
model_path = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g64-BitBLAS"

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = GPTQModel.from_quantized(model_path)
streamer = TextStreamer(tokenizer)

start = time.time()
output = model.generate(
    **tokenizer("Solar eclipse is ", return_tensors="pt").to(model.device),
    max_new_tokens=256, streamer=streamer, use_cache=True
)
end = time.time()

output_len = output.shape[-1]
delta_time = end - start
print(output_len, delta_time, output_len / delta_time)
@ChenMnZ
Copy link
Collaborator

ChenMnZ commented Sep 11, 2024

Yeah, It is weird, waiting BitBLAS to solve this problem (refer microsoft/BitBLAS#90 for details).

@w32zhong
Copy link
Author

@ChenMnZ thanks for sharing this related post. Feel free to share my code to them for replicating.

I am learning Nsight System to better understand the trade-offs recently. Hardware specs may play a large role here, although it still boil down to dequantization overhead vs. memory transfer reduce due to memory saving. I may be able to share some insights here once I learned more about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants