24 Jun 06:31

netaddi

afeb2b3

v0.2.0 Latest

Latest

We are release the new 0.2.0 version of rtp-llm, featuring some major updates:

rpc mode of scheduler
device backend implementation of models
more quantization methods

rpc mode

Rpc mode refactored inference scheduler with c++, eliminating the performance bottleneck of query batching.

To use rpc mode, start with env USE_RPC_MODEL=1.

device backend with fully managed gpu memory

The newly introduced device implementation preallocates all gpu memory and optimized gpu memory usage.

To use device backend, you must enable rpc mode, then start with env USE_NEW_DEVICE_IMPL=1 to enable.

Set DEVICE_RESERVE_MEMORY_BYTES to change the bytes of gpu memory reserved for rtp-llm. A negative value means reserving all available memory but leave these bytes free. Default is -134217728 (preallocate all gpu memories but leave 128MB free).

Set HOST_RESERVE_MEMORY_BYTES is similar but reserves host memory. This improves framework performance, default is 2GB.

quantization

Smoothquant and omniquant are supported on llama and qwen models.

Using smoothquant requires smoothquant.ini under checkpoint dir.

Using omniquant, GPTQ or AWQ requires adding quant fields in config:

"quantization_config": {
    "bits": 8,
    "quant_method": "omni_quant"
}

Now all quantization methods support start from SM70.

other improvements

GLM4, GLM4V, llava-next, Qwen2 supported
optimized performance on nvidia A100

Assets 3

30 Apr 06:38

zerozw

v0.1.13

366e1b5

v0.1.13

feat

support gte-Qwen1.5-7B-instruct
support Qwen1.5-MoE

fix

fix V100 performance
fix MULTI_TASK_PROMPT and MULTI_TASK_PROMPT_STR env
fix starcode-7b load failed
fix llava renderer sep
fix split_k_factor

Assets 4

21 Apr 11:08

baowendin

v0.1.12

d10c98c

v0.1.12

feature:

支持新模型llama3/code-qwen2/cohere
bug fix:
bloom weight加载错误
temperature不生效

Assets 4

12 Apr 09:50

netaddi

v0.1.11

01b7919

v0.1.11

fix

int4 tp issue

Assets 4

07 Apr 15:05

jianglan89

v0.1.10

3a81bbe

v0.1.10

feat

sp support TP
suport tie_word_embeddings option in hf config.json
update transformers version to 4.39.3

refactor

add log for weight load: lora apply success / miss weight

fix

lora support one q/k/v weight is miss

docs

add Quantization docs

Assets 4

01 Apr 03:42

ySingularity

v0.1.9

6def4b3

v0.1.9

feat

support awq
mv attention mask when use FMHA
support sparse&robert embedding, support calc similarity

refactor

use asyncio.future to avoid resource exclusivity
mv asyncio lock to asyncmodel

fix

tmp fix filelock version
moe model size
add headers for image downloading
update whl version
cutlass interface

docs

update pipeline usage

Assets 4

25 Mar 13:32

xinfeishi

v0.1.8

796698a

v0.1.8

feat

support qwen2 gptq
update multi_task_prompt create
speculative support tp
support roberta

refactor

refactor multimodal model process

fix

fix kv cache int8 bug: add dequantization method in reuse block scenario
fix stream output stop words
fix lora

Assets 4

19 Mar 02:53

dongjiyingdjy

v0.1.7

5b6b9f2

v0.1.7

features

support int4 (experimental) on Qwen GPTQ
support V100 fmha
support Bert
Optimize VIT Engine by TensorRT

refactor

refactor schedule strategy, malloc kv cache in schedule new stream
refactor MOE

docs

update supported models

Assets 4

09 Mar 07:06

zerozw

v0.1.6

de4761a

v0.1.6

features

support starcoder2
support gemma

fixs

fix lora merge
fix num_return_sequences 1
fix query cancel not release source
fix tp block num sync
fix some model rotary embedding dim 64

Assets 4

01 Mar 09:25

baowendin

v0.1.5

688b31e

v0.1.5

features

refactor large amount of server code

fixs

fix inference server concurrency limit no decrease
cancel request correctly when client disconnected
fix ptuning with separate path

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc mode

device backend with fully managed gpu memory

quantization

other improvements

feat

fix

fix

feat

refactor

fix

docs

feat

refactor

fix

docs

feat

refactor

fix

features

refactor

docs

features

fixs

features

fixs

Releases: alibaba/rtp-llm

v0.2.0

rpc mode

device backend with fully managed gpu memory

quantization

other improvements

v0.1.13

feat

fix

v0.1.12

v0.1.11

fix

v0.1.10

feat

refactor

fix

docs

v0.1.9

feat

refactor

fix

docs

v0.1.8

feat

refactor

fix

v0.1.7

features

refactor

docs

v0.1.6

features

fixs

v0.1.5

features

fixs