Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.2.0(cuda12)对比 v0.1.13(cuda11)表现下降 #74

Open
invisifire opened this issue Jun 26, 2024 · 1 comment
Open

v0.2.0(cuda12)对比 v0.1.13(cuda11)表现下降 #74

invisifire opened this issue Jun 26, 2024 · 1 comment

Comments

@invisifire
Copy link

环境配置
A 环境 cuda12.1 v0.2.0
B 环境 cuda11.8 v0.1.13
硬件
A800单卡测试

模型 qwen14B
单卡加载 int8推理 环境变量如下配置
export CUDA_VISIBLE_DEVICES=1
export MODEL_TYPE=qwen_2
export ACT_TYPE=BF16
export WEIGHT_TYPE=INT8
export INT8_KV_CACHE=1
export MAX_SEQ_LEN=32000
export CONCURRENCY_LIMIT=50
export TOKENIZER_PATH="/data/models/Qwen1.5-14B-Chat"
export CHECKPOINT_PATH="/data/models/Qwen1.5-14B-Chat"
export START_PORT=8020
export KV_CACHE_MEM_MB=8000
export PP_SIZE=1
export TP_SIZE=1

python -m maga_transformer.start_server

测试数据 10输入 50输出 超短场景

image

经测试 deepseek 等其余模型也有一定的速度下降

@invisifire
Copy link
Author

特殊补充 v0.2.0 加载是会有许多v0.1.13没有的warning 如下所示
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant