You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thank you for opening source such a solid work! Feel free to add my wechat (hellozhongwei) for an offline chat!
I know that, in the paper, the inference speed in Figure 2 is measured only by the gate_proj linear operation speed for 70B LLaMA. The speed bar looks impressive although I assume de-quantization and re-scaling in the CUDA kernel has huge overheads.
My hypothesis is the speed is due to single-batch memory-bound slowdown? But if this is the case, the full model inference for single batch should be faster as well? I do not have enough hardware resources, so I tested the smaller LLaMA 7B checkpoint: ChenMnZ/Llama-2-7b-EfficientQAT-w2g64-BitBLAS. However, the 2bit BitBLAS version is only around 14.5 tokens / s, but the huggingface native fp16 is faster (20 tokens / s) even if the latter one is operating in model parallelism.
My question is whether this is expected. Because I think BitBLAS has applied efficient schedulers on CUDA code already, it should have higher inference speed as you have reported in Figure 2. But why?
@ChenMnZ thanks for sharing this related post. Feel free to share my code to them for replicating.
I am learning Nsight System to better understand the trade-offs recently. Hardware specs may play a large role here, although it still boil down to dequantization overhead vs. memory transfer reduce due to memory saving. I may be able to share some insights here once I learned more about it.
Hello, thank you for opening source such a solid work! Feel free to add my wechat (hellozhongwei) for an offline chat!
I know that, in the paper, the inference speed in Figure 2 is measured only by the
gate_proj
linear operation speed for 70B LLaMA. The speed bar looks impressive although I assume de-quantization and re-scaling in the CUDA kernel has huge overheads.My hypothesis is the speed is due to single-batch memory-bound slowdown? But if this is the case, the full model inference for single batch should be faster as well? I do not have enough hardware resources, so I tested the smaller LLaMA 7B checkpoint:
ChenMnZ/Llama-2-7b-EfficientQAT-w2g64-BitBLAS
. However, the 2bit BitBLAS version is only around 14.5 tokens / s, but the huggingface native fp16 is faster (20 tokens / s) even if the latter one is operating in model parallelism.My question is whether this is expected. Because I think BitBLAS has applied efficient schedulers on CUDA code already, it should have higher inference speed as you have reported in Figure 2. But why?
Test devices: 2x RTX3060
Test code:
The text was updated successfully, but these errors were encountered: