Skip to content

Releases: alibaba/rtp-llm

v0.2.0

24 Jun 06:31
Compare
Choose a tag to compare

We are release the new 0.2.0 version of rtp-llm, featuring some major updates:

  • rpc mode of scheduler
  • device backend implementation of models
  • more quantization methods

rpc mode

Rpc mode refactored inference scheduler with c++, eliminating the performance bottleneck of query batching.

To use rpc mode, start with env USE_RPC_MODEL=1.

device backend with fully managed gpu memory

The newly introduced device implementation preallocates all gpu memory and optimized gpu memory usage.

To use device backend, you must enable rpc mode, then start with env USE_NEW_DEVICE_IMPL=1 to enable.

Set DEVICE_RESERVE_MEMORY_BYTES to change the bytes of gpu memory reserved for rtp-llm. A negative value means reserving all available memory but leave these bytes free. Default is -134217728 (preallocate all gpu memories but leave 128MB free).

Set HOST_RESERVE_MEMORY_BYTES is similar but reserves host memory. This improves framework performance, default is 2GB.

quantization

Smoothquant and omniquant are supported on llama and qwen models.

Using smoothquant requires smoothquant.ini under checkpoint dir.

Using omniquant, GPTQ or AWQ requires adding quant fields in config:

"quantization_config": {
    "bits": 8,
    "quant_method": "omni_quant"
}

Now all quantization methods support start from SM70.

other improvements

  • GLM4, GLM4V, llava-next, Qwen2 supported
  • optimized performance on nvidia A100

v0.1.13

30 Apr 06:38
Compare
Choose a tag to compare

feat

  • support gte-Qwen1.5-7B-instruct
  • support Qwen1.5-MoE

fix

  • fix V100 performance
  • fix MULTI_TASK_PROMPT and MULTI_TASK_PROMPT_STR env
  • fix starcode-7b load failed
  • fix llava renderer sep
  • fix split_k_factor

v0.1.12

21 Apr 11:08
Compare
Choose a tag to compare

feature:

  • 支持新模型llama3/code-qwen2/cohere
    bug fix:
  • bloom weight加载错误
  • temperature不生效

v0.1.11

12 Apr 09:50
Compare
Choose a tag to compare

fix

  • int4 tp issue

v0.1.10

07 Apr 15:05
Compare
Choose a tag to compare

feat

  • sp support TP
  • suport tie_word_embeddings option in hf config.json
  • update transformers version to 4.39.3

refactor

  • add log for weight load: lora apply success / miss weight

fix

  • lora support one q/k/v weight is miss

docs

  • add Quantization docs

v0.1.9

01 Apr 03:42
Compare
Choose a tag to compare

feat

  • support awq
  • mv attention mask when use FMHA
  • support sparse&robert embedding, support calc similarity

refactor

  • use asyncio.future to avoid resource exclusivity
  • mv asyncio lock to asyncmodel

fix

  • tmp fix filelock version
  • moe model size
  • add headers for image downloading
  • update whl version
  • cutlass interface

docs

  • update pipeline usage

v0.1.8

25 Mar 13:32
Compare
Choose a tag to compare

feat

  • support qwen2 gptq
  • update multi_task_prompt create
  • speculative support tp
  • support roberta

refactor

  • refactor multimodal model process

fix

  • fix kv cache int8 bug: add dequantization method in reuse block scenario
  • fix stream output stop words
  • fix lora

v0.1.7

19 Mar 02:53
Compare
Choose a tag to compare

features

  • support int4 (experimental) on Qwen GPTQ
  • support V100 fmha
  • support Bert
  • Optimize VIT Engine by TensorRT

refactor

  • refactor schedule strategy, malloc kv cache in schedule new stream
  • refactor MOE

docs

  • update supported models

v0.1.6

09 Mar 07:06
Compare
Choose a tag to compare

features

  • support starcoder2
  • support gemma

fixs

  • fix lora merge
  • fix num_return_sequences 1
  • fix query cancel not release source
  • fix tp block num sync
  • fix some model rotary embedding dim 64

v0.1.5

01 Mar 09:25
Compare
Choose a tag to compare

features

  • refactor large amount of server code

fixs

  • fix inference server concurrency limit no decrease
  • cancel request correctly when client disconnected
  • fix ptuning with separate path