Releases: alibaba/rtp-llm
v0.2.0
We are release the new 0.2.0 version of rtp-llm, featuring some major updates:
- rpc mode of scheduler
- device backend implementation of models
- more quantization methods
rpc mode
Rpc mode refactored inference scheduler with c++, eliminating the performance bottleneck of query batching.
To use rpc mode, start with env USE_RPC_MODEL=1
.
device backend with fully managed gpu memory
The newly introduced device implementation preallocates all gpu memory and optimized gpu memory usage.
To use device backend, you must enable rpc mode, then start with env USE_NEW_DEVICE_IMPL=1
to enable.
Set DEVICE_RESERVE_MEMORY_BYTES
to change the bytes of gpu memory reserved for rtp-llm. A negative value means reserving all available memory but leave these bytes free. Default is -134217728 (preallocate all gpu memories but leave 128MB free).
Set HOST_RESERVE_MEMORY_BYTES
is similar but reserves host memory. This improves framework performance, default is 2GB.
quantization
Smoothquant and omniquant are supported on llama and qwen models.
Using smoothquant requires smoothquant.ini
under checkpoint dir.
Using omniquant, GPTQ or AWQ requires adding quant fields in config:
"quantization_config": {
"bits": 8,
"quant_method": "omni_quant"
}
Now all quantization methods support start from SM70.
other improvements
- GLM4, GLM4V, llava-next, Qwen2 supported
- optimized performance on nvidia A100
v0.1.13
v0.1.12
v0.1.11
v0.1.10
feat
- sp support TP
- suport tie_word_embeddings option in hf config.json
- update transformers version to 4.39.3
refactor
- add log for weight load: lora apply success / miss weight
fix
- lora support one q/k/v weight is miss
docs
- add Quantization docs
v0.1.9
feat
- support awq
- mv attention mask when use FMHA
- support sparse&robert embedding, support calc similarity
refactor
- use asyncio.future to avoid resource exclusivity
- mv asyncio lock to asyncmodel
fix
- tmp fix filelock version
- moe model size
- add headers for image downloading
- update whl version
- cutlass interface
docs
- update pipeline usage
v0.1.8
v0.1.7
features
- support int4 (experimental) on Qwen GPTQ
- support V100 fmha
- support Bert
- Optimize VIT Engine by TensorRT
refactor
- refactor schedule strategy, malloc kv cache in schedule new stream
- refactor MOE
docs
- update supported models