16 Aug 15:57

siddvenk

c343d60

DJLServing v0.29.0 Release Latest

Latest

Key Features

Details regarding the latest LMI container image_uris can be found here

DJL Serving Changes (applicable to all containers)

Allows configuring health checks to fail based on various types of error rates
When not streaming responses, all invocation errors will respond with the appropriate 4xx or 5xx HTTP response code
- Previously, for some inference backends (vllm, lmi-dist, tensorrt-llm) the behavior was to return 2xx HTTP responses when errors occurred during inference
HTTP Response Codes are now configurable if you require a specific 4xx or 5xx status to be returned in certain situations
Introduced annotations @input_formatter and @output_formatter to bring your own script for pre- and post-postprocessing.

LMI Container (vllm, lmi-dist)

vLLM updated to version 0.5.3.post1
Added MultiModal Support for Vision Language Models using the OpenAI Chat Completions Schema.
- More details available here
Supports Llama 3.1 models
Supports beam search, best_of and n with non streaming output.
Supports chunked prefill support in both vllm and lmi-dist.

TensorRT-LLM Container

TensorRT-LLM updated to version 0.11.0
[Breaking change] Flan-T5 is now supported with C++ triton backend. Removed Flan-T5 support for TRTLLM python backend.

Transformers NeuronX Container

Upgraded to Transformers NeuronX 2.19.1

Text Embedding (using the LMI container)

Various performance improvements

Enhancements

best_of support in output formatter by @sindhuvahinis in #1992
Convert cuda env tgi variables to lmi by @sindhuvahinis in #2013
[serving] Update python typing to support py38 by @tosterberg in #2034
Refactor lmi_dist and vllm to support best_of with RequestOutput by @sindhuvahinis in #2011
[metrics] Improve prometheus metric type handling by @frankfliu in #2039
[serving] Fail ping if error rate exceed by @frankfliu in #2040
[serving] Update default max worker to 1 for GPU by @xyang16 in #2048
[Serving] Implement SageMaker Secure Mode & support for multiple data sources by @ethnzhng in #2042
AutoAWQ Integration Script by @a-ys in #2038
[secure-mode] Refactor secure mode plugin by @frankfliu in #2058
[vllm, lmi-dist] add support for top_n_tokens by @sindhuvahinis in #2051
[python] refactor rolling batch inference method by @sindhuvahinis in #2090
Record telemetry including acceptance rate by @zachgk in #2088
bump up trtllm to 0.10.0 by @ydm-amazon in #2043
[fix] Set tokenizer on output_formatter for TRT-LLM Handlers by @maaquib in #2100
[dockerfile] pin datasets to 2.19.1 in trtllm by @sindhuvahinis in #2104
[serving] make http response codes configurable for exception cases by @siddvenk in #2114
update flags to prevent deprecation by @lanking520 in #2118
[docker] Update DJL to 0.29.0-SNAPSHOT by @frankfliu in #2119
[awscurl] Handles Bedrock special url case by @frankfliu in #2120
[secure-mode] Add properties allowlist validation by @ethnzhng in #2129
[secure-mode] add per-model configs to allowlist by @ethnzhng in #2132
[docker] remove tensorflow native from cpu-full image by @frankfliu in #2136
[onnx] Allows to customize onnxruntime optimization level by @frankfliu in #2137
[python] add support for 3p use-case by @siddvenk in #2122
[python] move parse input functions to input_parser.py by @sindhuvahinis in #2092
[python] log exception stacktrace for exceptions in python, improve r… by @siddvenk in #2142
[engine] include lmi recommended entrypoint when model.py exists by @sindhuvahinis in #2148
[3p][python]add metering and error details to 3p outputs by @siddvenk in #2143
Support multi node for lmi-dist by @xyang16 in #2125
[python] refactor input parser to support Request by @sindhuvahinis in #2145
[python] add max_logprobs vllm configuration to EngineArgs by @sindhuvahinis in #2154
[python] parse input only when new requests are received by @sindhuvahinis in #2155
[lmi] remove redundant auto logic from python handler by @siddvenk in #2152
[python] support multimodal models openai api in vllm by @sindhuvahinis in #2147
Add stronger typing for chat completions use-cases by @siddvenk in #2161
[awscurl] Supports full jsonquery syntax by @frankfliu in #2163
[Neo] Neo compilation/quantization script bugfixes by @a-ys in #2115
[docker] bump neuron to 2.19 SDK by @tosterberg in #2160
[python] add input formatter decorator by @sindhuvahinis in #2158
[docker] bump neuron vllm to 5.0 by @tosterberg in #2169
[lmi] Upgrade lmi dockerfile for 0.29.0 release by @maaquib in #2156
add 0.5.1 supported models by @lanking520 in #2151
[python] update max_logprobs default for vllm 0.5.1 by @sindhuvahinis in #2159
[wlm] Trim whitespce for model_id by @frankfliu in #2175
[fix] optimum update stable diffusion support by @tosterberg in #2179
[serving][python] Support non 200 HTTP response codes for non-streami… by @siddvenk in #2173
[awscurl] Includes input data in output file by @frankfliu in #2184
[multimodal] support specifying image_token, infering default if not … by @siddvenk in #2183
[Neo] Refactor Neo TRT-LLM partition script by @ethnzhng in #2166
[Partiton] Don't output option.parallel_loading when partitioning by @a-ys in #2189
Introduce pipeline parallel degree config by @nikhil-sk in #2171
add trtllm container update by @lanking520 in #2191
[serving] Adds mutliple node cluster configuration support by @frankfliu in #2190
[aot] fix aot partition args, add pipeline parallel by @tosterberg in #2196
bump up bench to 0.29.0 by @ydm-amazon in #2199
[serving] Download model while initialize multi-node cluster by @frankfliu in #2198
[lmi] Dependencies upgrade for 0.29.0 by @maaquib in #2194
[chat][lmi] use generation prompt in tokenizer to avoid bot prompt re… by @siddvenk in #2195
[lmi] support multimodal in lmi-dist by @siddvenk in #2182
Revert "Update max_model_len for llama-3 lora test" by @lanking520 in #2207
[wlm] Fixes retrieve config.json error by @frankfliu in #2212
determine bedrock usage based on explicit property rather than inferr… by @siddvenk in #2214
[post-7/22]Add chunked prefill support in vllm and lmi-dist by @rohithkrn in #2202
lazy compute input ids by @lanking520 in #2216
[docker] neuron bump to 2.19.1 by @tosterberg in https://...

Contributors

maaquib, zachgk, and 14 other contributors

Assets 2

19 Jun 21:16

tosterberg

v0.28.0

73a782d

v0.28.0

Key Features

Check out our latest Large Model Inference Containers.

LMI container

Provided general performance optimization.
Added text embedding support
- Our solution for text embedding is 5% faster than HF TEI solution.
Multi-LoRA feature now supports LLama3 and AWS models

TensorRT-LLM container

Upgraded to TensorRT-LLM 0.9.0
AWQ, FP8 support for Llama3 models on G6/P5 machines
Now, default max_new_tokens=16384
Bugfix for critical memory leaks on long run.
Bugfix for model hanging issues.

Transformers NeuronX container

Upgraded to Transformers NeuronX 2.18.2

DeepSpeed container (deprecated)

DeepSpeed container is now deprecated. If you are not using deepspeed engine, all you need is 0.28.0-lmi container and continue using it.

New Model Support

LMI container
- Artic, DBRX, Falcon 2, Command-R, InternLM2, Phi-3, Qwen2MoE, StableLM, StarCoder2, Xverse, and Jais
TensorRT-LLM container
- Gemma

CX Usability Enhancements/Changes

Model loading CX:
- SERVING_LOAD_MODELS env is deprecated, use HF_MODEL_ID instead.
Inference CX:
- Input/Output schema changes:
  - Speculative decoding now in streaming, returns multiple jsonlines tokens at each generation step
  - Standardized the output formatter signature:
    - We reduced the parameters of output_formatter by introducing RequestOutput class.
    - RequestOutput contains all input information such as text, token_ids and parameters and also output information such as output tokens, log probabilities, and other details like finish reason. Check this doc to know more.
    - Introduced prompt details in the details of the response for vLLM and lmi-dist rolling batch options. These prompt details contains the prompt token_ids and their corresponding text and log probability. Check this doc to know more.
- New error handling mechanism:
  - Improved our error handling for container responses for rolling batch. Check this doc to know more
- New CX capability:
  - We introduce OPTION_TGI_COMPAT env which enables you to get the same response format as TGI.
  - We also now support SSE text/event-stream data format.

Breaking Changes

Inference CX for rolling batch:
- Token id changed from list into integer in rolling batch response.
- Error handling: “finish_reason: error” during rolling batch inference
DeepSpeed container has been deprecated, functionality is generally available in the LMI container now

Known Issues

TRTLLM periodically crashes during model compilation when using A100 GPU
TRTLLM AWQ quantization currently crashes due to an internal error

Enhancements

[serving] Avoid unecessary copy plugin jars. by @frankfliu in #1793
[serving] Refactor rolling batch detection logic by @frankfliu in #1781
[serving] Uses Utils.openUrl() to download files by @frankfliu in #1810
[docker] Install onnxruntime in docker by @frankfliu in #1790
[lmi] infer entry point for lmi models in LmiConfigRecommender by @siddvenk in #1779
[trt][lmi] always set mpi mode for trt container by @siddvenk in #1803
[awscurl] Support TGI response with coral streaming by @frankfliu in #1769
[awscurl] Fix output for server sent event content-type by @frankfliu in #1784
[tnx] installing tiktoken and blobfile for TNX by @lanking520 in #1762
[tnx] upgrade neuron sdk to 2.18.1 by @tosterberg in #1765
[tnx] add torchvision for resnet tests by @tosterberg in #1768
[tnx] adding additional neuron config options by @tosterberg in #1777
[tnx] add vllm to container by @tosterberg in #1786
update lora test to tp=max and variable prompt lengths by @rohithkrn in #1766
[Neo] Add more logging to Neo Neuron partition script by @a-ys in #1802
Migrate to pydantic v2 by @sindhuvahinis in #1764
replace peft with open source version by @lanking520 in #1813
use corretto java by @lanking520 in #1818
[Java][DLC] try with CA cert update by @lanking520 in #1820
upgrade trtllm to 0.9.0 by @lanking520 in #1795
Remove libnccl-dev installation in trtllm container by @nskool in #1822
[tnx] version bump Neuron SDK and Optimum by @tosterberg in #1826
[serving] Support set arguments via env var by @frankfliu in #1829
Creates initial benchmark suite by @zachgk in #1831
Resolve protected_namespaces warning for pydantic by @sindhuvahinis in #1834
[chat] Remove unused parameters by @xyang16 in #1835
Updated s3url gpt4all lora adapter_config.json by @sindhuvahinis in #1836
Refactor parse_input for handlers by @sindhuvahinis in #1788
[awscurl] Adds seed and test duration by @frankfliu in #1844
[djl-bench] Refactor benchmark for arbitary inputs by @frankfliu in #1845
[djl-bench] Uses Shape.parseShapes() by @frankfliu in #1849
[Neo] Add JumpStart Integration to SM Neo Neuron AOT compilation flow by @a-ys in #1854
update container for LMI by @lanking520 in #1814
[tnx] Default to generation.config generation settings by @tosterberg in #1894
[tnx] Read-only Neuron cache workaround for SM by @a-ys in #1879
[tnx] add greedy speculative decoding by @tosterberg in #1902
[awscurl] Adds extra parameters to dataset by @frankfliu in #1862
[awscurl] allow write to json file by @lanking520 in #1859
[DLC] update deps and wheel by @lanking520 in #1860
Stop the model server when model download fails by @sindhuvahinis in #1842
[vllm, lmi-dist] Support input log_probs by @sindhuvahinis in #1861
Use s3 models for llama gptq by @rohithkrn in #1872
[serving] Avoid using tab in logging by @frankfliu in #1871
Add placeholder workflow by @ydm-amazon in #1874
[T5] TRTLLM python repo model check by @sindhuvahinis in #1870
add TGI compat feature for rollingbatch by @lanking520 in #1866
Add better token handling under hazardous condition by @lanking520 in #1875
Update TRT-LLM Container to build with Triton 24.04 by @nskool in #1878
Remove docker daemon restarts from p4d workflow by @nskool in #1882
[RollingBatch] allow appending for token by @lanking520 in #1885
[RollingBatch] remove pending requests by @lanking520 in #1884
[awscurl] Allows override stream parameter for dataset by @frankfliu in #1895
allow TGI compat to work with output token ids by @lanking520 in #1900
Add max num tokens workflow by @ydm-amazon in #1899
Convert huggingface model to onnx by @xyang16 in #1888
Improve the artifact...

Contributors

maaquib, zachgk, and 11 other contributors

Assets 2

15 Apr 21:03

xyang16

v0.27.0

ad6cc1f

DJLServing v0.27.0 Release

Key Changes

Large Model Inference Containers 0.27.0 release
- DeepSpeed container
  - Added DBRX and Gemma model support.
  - Provided general performance optimization.
  - Added new performance enhancing features support like Speculative Decoding.
- TensorRT-LLM container
  - Upgraded to TensorRT-LLM 0.8.0
- Transformers NeuronX container
  - Upgraded to Transformers NeuronX 2.18.0
Multi-Adapter LoRA Support
- Provided multi-adapter inference functionality in LMI DLC.
CX Usability Enhancements
- Provided a seamless migration experience across different LMI DLCs.
- Implemented the Low code No code experience.
- Supported OpenAI compatible chat completions API.

Enhancement

[LMIDist] Allow passing in ignore_eos_token param by @xyang16 in #1489
[LMIDist] Make ignore_eos_token default to false by @xyang16 in #1492
Makes netty buffer size configurable by @frankfliu in #1494
translate few vllm and trtllm params to HF format by @sindhuvahinis in #1500
Align properties parsing to be similar to java by @ydm-amazon in #1502
check for rolling batch "disable" value by @sindhuvahinis in #1506
add max model length support on vLLM by @lanking520 in #1510
Creates auto increment ID for models by @zachgk in #1109
making default dtype to fp16 for compilation by @lanking520 in #1512
Use server-provided seed if not in request params in deepspeed handler by @davidthomas426 in #1520
[Neuron][WIP] add log probs calulation in Neuron by @lanking520 in #1516
remove a bit of unused code by @ydm-amazon in #1536
Remove unused device parameter from all rolling batch classes by @ydm-amazon in #1538
dump OPTION_ env vars at startup by @siddvenk in #1541
[vLLM] add enforce eager as an option by @lanking520 in #1547
[awscurl] Prints invaild response by @frankfliu in #1550
roll back enum changes by @ydm-amazon in #1551
[awscurl] Refactor time to first byte calculation by @frankfliu in #1557
[serving] Allows configure log level at runtime by @frankfliu in #1560
Stream returns only after putting placeholder finishes by @zachgk in #1567
type hints and comments for trtllm handler by @ydm-amazon in #1558
[vLLM] add speculative configs by @lanking520 in #1553
Add rolling batch type hints: Part 1 by @ydm-amazon in #1564
Remove adapters preview flag by @zachgk in #1573
[serving] Adds model event listener by @frankfliu in #1570
[serving] Add model loading metrics by @frankfliu in #1576
Speculative decoding in LMI-Dist by @KexinFeng in #1505
type hints for scheduler rolling batch by @ydm-amazon in #1577
[serving] Uses model.intProperty() api by @frankfliu in #1582
[serving] Ignore CUDA OOM when collecting metrics by @frankfliu in #1581
Remove test as the model is incompatible with transformers upgrade by @rohithkrn in #1575
[serving] Adds rolling batch metrics by @frankfliu in #1583
[serving] Uses dimension for model metric by @frankfliu in #1587
rolling batch type hints part 3 by @ydm-amazon in #1584
[SD][vLLM] record acceptance by @lanking520 in #1586
[serving] Adds promtheus metrics support by @frankfliu in #1593
[feat] Benchmark code for speculative decoding in lmi-dist by @KexinFeng in #1591
[lmi] add generated token count to details by @siddvenk in #1600
[console] use StandardCharset instead of deprecated Charset by @siddvenk in #1601
[awscurl] add download steps to README.md by @siddvenk in #1605
[lmi][deprecated] remove option.s3url since it has been deprecated fo… by @siddvenk in #1610
[serving] Skip testPrometheusMetrics when run in IDE by @frankfliu in #1611
Use workflow template for workflow model_dir by @zachgk in #1612
[Partition] Remove redudant model splitting, Improve Input Model Parsing by @a-ys in #1609
Add handler for new lmi-dist by @rohithkrn in #1595
[lmi] add parameter to allow full text including prompt to be returne… by @siddvenk in #1602
support cuda driver on sagemaker by @lanking520 in #1618
remove checker for awq with enforce eager by @lanking520 in #1620
Add pytorch-gpu for security patching by @maaquib in #1621
Refactor vllm and rubikon engine rolling batch by @rohithkrn in #1623
Update TRT-LLM Dockerfile for v0.8.0 by @nskool in #1622
[UX] sampling with vllm by @sindhuvahinis in #1624
[vLLM] reduce speculative decoding gpu util to leave room for draft model by @lanking520 in #1628
[lmi] update auto engine logic for vllm and lmi-dist by @siddvenk in #1617
[python] Encode error in single line for jsonlines case. by @frankfliu in #1630
Single model adapter API by @zachgk in #1616
remove all current no-code test cases by @siddvenk in #1635
Update the build script to use vLLM 0.3.3 by @lanking520 in #1637
Update lmi-dist rolling batch to use rubikon engine by @rohithkrn in #1639
Adds adapter registration options by @zachgk in #1634
Supports vLLM LoRA adapters by @zachgk in #1633
add customer required field by @lanking520 in #1640
[tnx] bump optimum version by @tosterberg in #1632
Updates dependencies version to latest by @frankfliu in #1647
updated dependencies for LMI by @lanking520 in #1648
[DO NOT MERGE][CAN APPROVE]change flash attn url by @lanking520 in #1650
[cache] Remove gson from fatjar of cache by @frankfliu in #1649
[python] Move output formatter to request level by @xyang16 in #1644
[tnx] improve model partitioning time by @tosterberg in #1652
[tnx] support codellama 70b instruct tokenizer by @tosterberg in #1653
[python] Remove output_formatter from vllm and lmi-dist sampling para… by @xyang16 in #1654
[wlm] Makes generateHuggingFaceConfigUri public by @frankfliu in #1656
[tnx] fix output formatter as param implementation by @tosterberg in #1657
[lmi] use hf token to get model config for gated/private models by @siddvenk in #1658
[UX] Changing some default parameters by @sindhuvahinis in #1659
add parameters to part of the field by @lanking520 in https://github.com/deepja...

Contributors

maaquib, zachgk, and 15 other contributors

Assets 2

27 Feb 21:02

siddvenk

v0.26.0

b909f12

DJLServing v0.26.0 Release

Key Changes

TensorRT-LLM 0.7.1 Upgrade, including support for Mixtral 8x7B MOE model
Optimum Neuron Support
Transformers-NeuronX 2.16 Upgrade, including support for continuous batching
LlamaCPP support
Many Documentation updates with updated model deployment configurations
Refactor of configuration management across different backends
CUDA 12.1 support for DeepSpeed and TensorRT-LLM containers

Enhancements

[UX][RollingBatch] add details function to the rolling batch by @lanking520 in #1353
[TRTLLM][UX] add trtllm changes to support stop reason and also log prob by @lanking520 in #1355
[Docker] upgrade cuda 12.1 support for DJLServing by @lanking520 in #1370
[feat] optimum handler creation by @tosterberg in #1362
[python] Update lmi_dist warmup logic by @xyang16 in #1367
[RollingBatch] optimize rolling batch result by @lanking520 in #1372
[python] Sets mpi_model property for python to consume by @frankfliu in #1360
[vLLM] add load_format to support for mixtral model by @lanking520 in #1391
[python] Sets rolling batch threads as daemon thread by @frankfliu in #1371
[awscurl] Adds awscurl to repo by @frankfliu in #1408
Add config passing in lmi-dist by @xyang16 in #1382
Upgrade flash attention to 2.3.0 by @xyang16 in #1402
[TRTLLM] Bump up trtllm to version 0.7.1 by @ydm-amazon in #1452
[tnx] add gqa to properties by @siddvenk in #1478
[TRTLLM] add enable kv cache reuse by @lanking520 in #1460
[serving] Adds llama.cpp support by @frankfliu in #1464
[serving] Allows plugin to override default HTTP handler by @frankfliu in #1424
[wlm] enable max workers env var for MPI mode by @frankfliu in #1438
Support AWQ quantization in LMI Dist by @xyang16 in #1435
[python] Excludes test code from jar by @frankfliu in #1449
[Refactor][UX] Refactoring vllm rolling batch properties by @sindhuvahinis in #1369
[DLC][TNX] inf2 stable diffusion handler refactor by @tosterberg in #1393
[Refactor] lmi dist rolling batch properties by @sindhuvahinis in #1409
[Refactor] scheduler rolling batch refactor by @sindhuvahinis in #1411
[awscurl] Allows search nested json key by @frankfliu in #1453
make jsonline outputs generated tokens by @lanking520 in #1454
[serving] Loads model zoo and engine from deps folder on startup by @frankfliu in #1457
[RollingBatch] add customized rollingbatch by @lanking520 in #1468

Bug Fixes

Fix rolling batch properties by @xyang16 in #1326
[fix] tnx quantization and docs by @tosterberg in #1332
[fix] Context length estimate datatype by @sindhuvahinis in #1350
[UX][CI] fix a few bugs by @lanking520 in #1357
[fix] inf2 container freeze compiler versions by @tosterberg in #1389
[Fix] fix the lmi dist device by @sindhuvahinis in #1387
[vllm] pass hf revision to vllm engine, pin phi2 model revision for test by @siddvenk in #1485
[python] Fixes mpi_mode properties by @frankfliu in #1368
[python] Fixes mpi_mode issues by @frankfliu in #1373
[wlm] Fixes get maxWorkers bug for python engine by @frankfliu in #1375
[RollingBatch] fix request id in rolling batch by @lanking520 in #1481
[TRTLLM] Fix bug in handler by @ydm-amazon in #1459
Works with manual initialization by @zachgk in #1473
[TNX] version update to 2.16.0 sdk and continuous batching by @tosterberg in #1437

Documentation Updates

[doc] Update current properties for TNX handler by @tosterberg in #1322
[doc] lmi configurations readme by @sindhuvahinis in #1323
[doc] Placeholder for TrtLLM tutorial and tuning guide by @sindhuvahinis in #1333
[doc] LMI environment variable instruction by @sindhuvahinis in #1334
[doc] TransformerNeuronX tuning guide by @sindhuvahinis in #1335
[doc] TensorRt-Llm tuning guide by @sindhuvahinis in #1339
[doc] Updating new TensorRT-LLM configurations by @sindhuvahinis in #1340
[doc] DeepSpeed tuning guide by @sindhuvahinis in #1342
[doc] LMI dist tuning guide by @sindhuvahinis in #1341
[doc] seq_scheduler_document by @KexinFeng in #1336
[doc] large model inference document by @sindhuvahinis in #1343
[doc] fix docker image uri for trtllm tutorial by @sindhuvahinis in #1348
[docs] Adds option.max_output_size document by @frankfliu in #1354
[docs] fix tnx n_positions description by @tosterberg in #1401
[doc] instruction on adding new properties to default handlers by @sindhuvahinis in #1419
Add AOT Tutorial by @ydm-amazon in #1338
[docker] Avoid JVM consume GPU memory by @frankfliu in #1365
[LMI] DJLServing side placeholder by @lanking520 in #1330
[Tutorials] add tensorrt llm manual by @lanking520 in #1412
[TRTLLM] add line in docs for chatglm by @ydm-amazon in #1425
Update LMI dist tuning guide by @xyang16 in #1428
[TRTLLM] update the docs and more model support by @lanking520 in #1415
[TRT-LLM] Update docs for newly added TRT-LLM build args in 0.7.1 by @rohithkrn in #1461
[TNX][config] update rolling batch batch size behavior and docs by @tosterberg in #1404
[TRTLLM] Update the docs - add mixtral by @ydm-amazon in #1434
[TRTLLM] Add gpt model to docs and ci by @ydm-amazon in #1475

CI/CD Updates

[tnx] version bump to 2.15.2 by @tosterberg in #1363
[CI][IB] Support variables by @zachgk in #1356
Bump up DJL version to 0.26.0 by @xyang16 in #1364
[ci] Fixes nightly gpu integration test by @frankfliu in #1378
[CI] update the model to fp16 by @lanking520 in #1390
update models for TRT-LLM 0.6.1 by @rohithkrn in #1392
[CI][fix] Sagemaker integration test cloudwatch metrics fix by @sindhuvahinis in #1385
[CI][fix] Inf2 AOT integration test fix by @tosterberg in #1395
[ci] Fixes flaky async token test by @frankfliu in #1429
[ci] Fixes merge conflict issue by @frankfliu in #1431
[ci] Upgrades CI to use JDK 17 by @frankfliu in #1413
[CI][fix] remove g5xl and introduce rolling batch in lmic by @sindhuvahinis in htt...

Contributors

zachgk, rohithkrn, and 8 other contributors

Assets 2

20 Dec 19:41

siddvenk

v0.25.0

043e7af

DJLServing v0.25.0 Release

Key Changes

TensorRT LLM Integration. DJLServing now supports using the TensorRT LLM backend to deploy Large Language Models.
- See the documentation here
- Llama2-13b usint TRTLLM example notebook
SmoothQuant support in DeepSpeed
- Llama2-13b using SmoothQuant with DeepSpeed example notebook
Rolling batch support in DeepSpeed to boost throughput
Updated Documentation on using DJLServing to deploy LLMs
- We have added documentation for supported configurations per container, as well as many new examples

Enhancements

Add context length estimate for Neuron handler by @lanking520 in #1184
[INF2] allow neuron to load split model directly by @lanking520 in #1186
Adding INF2 (transformers-neuronx) compilation latencies to SageMaker Health Metrics by @Lokiiiiii in #1185
[serving] Auto detect XGBoost engine with .xgb extension by @frankfliu in #1196
add memory checking in place to identify max by @lanking520 in #1191
[python] Do not set default value for truncate by @xyang16 in #1193
Add aiccl support by @maaquib in #1179
Setting default datatype for deepspeed handlers by @sindhuvahinis in #1203
add trtllm container build by @lanking520 in #1215
Add TRTLLM TRT build from our managed source by @lanking520 in #1199
[python] Remove generation_dict in lmi_dist_rolling_batch by @xyang16 in #1217
install s5cmd to trtllm by @lanking520 in #1219
Update mpirun options by @xyang16 in #1220
[python] Optimize batch serialization by @frankfliu in #1223
upgrade vllm by @lanking520 in #1238
Supports docker build with local .deb by @zachgk in #1231
Do warmup in multiple requests by @xyang16 in #1216
[python] Update PublisherBytesSupplier API by @frankfliu in #1242
remove tensorrt installation by @lanking520 in #1243
Use CUDA runtime image instead of CUDA devel. by @chen3933 in #1201
remove unused components by @lanking520 in #1245
[DeepSpeed DLC] separate container build with multi-layers by @lanking520 in #1246
New PR for tensorrt llm by @ydm-amazon in #1240
[python] Buffer tokens for rolling batch by @frankfliu in #1249
Add trt-llm engine build step during model initialization by @rohithkrn in #1235
[serving] Adds token latency metric by @frankfliu in #1251
install trtllm toolkit by @lanking520 in #1254
[TRTLLM] some clean up on trtllm handler by @lanking520 in #1248
[TRTLLM] use tensorrt wheel by @lanking520 in #1255
Adds versions as labels in dockerfiles by @zachgk in #1160
[TRTLLM] add trtllm with no deps by @lanking520 in #1256
[TRT partition] add realtime stream reader for the conversion script by @lanking520 in #1259
[TRTLLM] always setting request output length by @lanking520 in #1258
Update trtllm toolkit path by @rohithkrn in #1260
allow gpu detection by @lanking520 in #1261
add trtllm cuda-compat by @lanking520 in #1247
[feat] Add serving.properties parameter for compiled graph path inf2 by @tosterberg in #1262
Inf2 properties refactoring using pydantic by @sindhuvahinis in #1252
MME - deviceId while creating workers by @sindhuvahinis in #1257
[serving] Refactor TensorRT-LLM partition code by @frankfliu in #1267
[DS] Deepspeed rolling batch support by @maaquib in #1295
Allow user to pass in max_batch_prefill_tokens by @xyang16 in #1320
add smoothquant as options by @lanking520 in #1285
Deepspeed configurations refactoring by @sindhuvahinis in #1280
update smoothquant arg by @rohithkrn in #1291
[python] Adds do_sample support for trtllm by @frankfliu in #1290
[wlm] Supports model_id point to a local directory by @frankfliu in #1276
[SageMaker Galactus developer experience] model load integration to DJL serving by @haNa-meister in #1230
[feat] Better output format from seq-scheduler by @KexinFeng in #1305
[serving] Upgrades AWSSDK version to 2.21.19 by @frankfliu in #1313
[serving] Uses seconds for ChunkedBytesSupplier timeout by @frankfliu in #1311
install datasets in trtllm container by @rohithkrn in #1270
TensorRrt Configs refactoring by @sindhuvahinis in #1275
[TRTLLM] fix corner case that model_id point to local path by @lanking520 in #1317
Huggingface configurations refactoring by @sindhuvahinis in #1283
Calculate max_seq_length in warmup dynamically by @xyang16 in #1298
Increase memory limit for rolling batch integration octocoder model by @xyang16 in #1319
[TRTLLM] remove default repetition penalty by @lanking520 in #1321
[feat] Expose max sparse params by @KexinFeng in #1273
[NeuronX] add attention mask porting from optimum-neuron by @lanking520 in #1206
[partition] extract properties files by @sindhuvahinis in #1293
add checkpoint to ds properties by @sindhuvahinis in #1296
[vllm] standardize input parameters by @frankfliu in #1301
[TRTLLM] format better for logging by @lanking520 in #1309
Change default top_k and temperature parameters in TRTLLM rolling batch by @ydm-amazon in #1312
Add tokenizer check for triton repo by @rohithkrn in #1274
[SageMaker Galactus developer experience] use python backend when schema is customized by @haNa-meister in #1286

Bug Fixes

[bug fix] add entrypoint camel case recovery by @lanking520 in #1181
Fix max tensor_parallel_degree by @zachgk in #1182
Fix lmi_dist garbage output issue by @xyang16 in #1187
[fix] update context estimate interface by @tosterberg in #1194
Check logs for aiccl usage in integ test by @maaquib in #1202
[serving] Revert management URI matching regex by @frankfliu in #1209
Update datasets version in deepspeed.Dockerfile by @maaquib in #1211
[console] Fixes bug for docker port mapping case by @frankfliu in https://github.com/deepjavalibrary/djl-ser...

Contributors

maaquib, zachgk, and 11 other contributors

Assets 2

17 Oct 22:10

zachgk

v0.24.0

9481927

DJLServing v0.24.0 release

Key Features

Updates Components
- Updates Neuron to 2.14.1
- Updates DeepSpeed to 0.10.0
Improved Python logging
Improved SeqScheduler
Adds DeepSpeed dynamic int8 quantization with SmoothQuant
Supports for llama 2
Supports Safetensors
Adds Neuron dynamic batching and rolling batch
Adds Adapter API Preview
Supports HuggingFace Stopwords

Enhancement

Allow overriding truncate parameter in request by @maaquib in #953
Enable multi-gpu inference (device_map='auto') on seq_batch_scheduler by @KexinFeng in #960
[wlm] Allows set defatul options with environment variable by @frankfliu in #961
Enable MPI model by environment variable by @frankfliu in #964
Add built-in json formatter by @frankfliu in #965
[serving] Update tnx handler for 2.12 supported models by @tosterberg in #896
[serving] Adds more built-in logging options by @frankfliu in #974
Bump up DJL version to 0.24.0 by @frankfliu in #979
[serving] Print out CUDA and Neuron device information by @frankfliu in #978
[docker] bump transformers-neuronx for small llama-2 support by @tosterberg in #980
[python] Update lmi-dist by @xyang16 in #975
Install flash attention using wheel by @xyang16 in #982
[python] Make paged attention configurable by @xyang16 in #986
[python] Refactor lmi_dist rolling batch by @xyang16 in #987
[docker] Upgrade to DJL 0.24.0 by @frankfliu in #989
Set jsonlines formatter for lmi-dist rolling batch test by @xyang16 in #991
Install FasterTransformer libs with llama support by @rohithkrn in #993
Add trust_remote_code to ft handler by @siddvenk in #994
[serving] Improves PyProcess lifecycle logging by @frankfliu in #996
[python] Adds pid to python process log by @frankfliu in #997
[python] Includes individual headers for server side batching by @frankfliu in #1001
update ft python wheel with llama support by @rohithkrn in #1002
[serving] Install commong-loggings dependency for XGBoost engine by @frankfliu in #1004
[python] Finds optimal batch partition by @bryanktliu in #984
add error handling for rolling batch by @lanking520 in #1005
[serving] Allows print access log to console by @frankfliu in #1009
[serving] Adds unregister model log by @frankfliu in #1010
[python] validate each request in the batch by @frankfliu in #1008
Update dependencies version by @frankfliu in #1012
[serving] Return proper HTTP status code for each batch by @frankfliu in #1013
[HF Streaming] use decode instead batch decode for streaming by @lanking520 in #1016
[docker] disable TORCH_CUDNN_V8_API_DISABLED for PyTorch 2.0.1 by @frankfliu in #1018
Allows set TENSOR_PARALLEL_DEGREE=max by @frankfliu in #1019
Simplify handling of min/max workers by @zachgk in #1021
[docker] Updates cache directory by @frankfliu in #1027
[benchmark] Adds safetensors support by @frankfliu in #1031
[VLLM] use more complex logic to ensure all result are captured by @lanking520 in #1035
[VLLM] add option to set batched tokens by @lanking520 in #1036
update inf2 dependencies to 2.13.1 by @lanking520 in #1044
add data collection and some inf2 bug fixes by @lanking520 in #1047
[RollingBatch] create request simulator to batch by @lanking520 in #1050
[DeepSpeed] upgrade dependencies by @lanking520 in #1049
[docker] Upgrades to inf2 2.13.2 version by @frankfliu in #1052
add revision to handler by @lanking520 in #1056
[docker] Change default OMP_NUM_THREADS back to 1 for GPU by @frankfliu in #1073
Worker type by @zachgk in #1022
[Handler] add dynamic batching to transformers neuronx by @lanking520 in #1076
add Neuron RollingBatch implementation by @lanking520 in #1078
[Neuron] upgrade to Neuron 2.14.0 SDK by @lanking520 in #1089
[vLLM] add pyarrow dependency by @lanking520 in #1093
[Handler] formalize all engines with same settings by @lanking520 in #1077
Removes quick abort of python reader threads by @zachgk in #1095
Adds adapter support by @zachgk in #1082
Add unmerged lora support in HF handler by @rohithkrn in #1088
Cleans some unused pieces of PyProcess by @zachgk in #1100
Creates adapters by directory by @zachgk in #1094
Use custom peft wheel by @rohithkrn in #1103
[feature] Enable model sharding on seq_scheduler tested on gpt_neox_20B by @KexinFeng in #1086
[vLLM] capture max_rolling_batch settting issues by @lanking520 in #1112
[RollingBatch] add active requests and pending requests for skip tokens by @lanking520 in #1113
Upgrade lmi_dist by @xyang16 in #1108
[INF2][Handler] added optimization level per Neuron instruction by @lanking520 in #1107
[Handler] add neuron int8 quantization by @lanking520 in #1115
[Docker] upgrade dependencies version by @lanking520 in #1119
Upgrade flash attention v2 version to 2.3.0 by @xyang16 in #1123
[Handler] bump up vllm version and fix some bugs by @lanking520 in #1124
Integrate with seq_scheduler wheel by @KexinFeng in #1122
[INF2] remove neuron settings on cache hit for the folder by @lanking520 in #1126
[python] Make rolling batch output not escape unicode characters by @xyang16 in #1135
[vLLM][Handler] add quantization option for vLLM by @lanking520 in #1136
[INF2][Handler] remove type conversion in Neuron by @lanking520 in #1134
Update vllm_rolling_batch.py by @lanking520 in #1140
Add support for stopwords in huggingface handler by @ydm-amazon in #1118
Give a version of seq scheduler by @KexinFeng in #1146
Support adapters by properties by @zachgk in #1148
[serving] Allow model_id point to djl model zoo by @frankfliu in #1150
Assert local lora models in the handler by @rohithkrn in #1153
Block remote adapter url and handler override by @zachgk in #1147
Add feature flag for adapters by @zachgk in #1152
[feat] Modify deepspeed handler to support smoothQuant. by @chen3933 in https://github.com/deepjavalibrary/djl-servi...

Contributors

maaquib, zachgk, and 13 other contributors

Assets 2

18 Jul 15:22

sindhuvahinis

v0.23.0

e69b646

DJLServing v0.23.0 release

Key Features

Introduces Roling Batch
- SeqBatchScheduler with rolling batch #803
- Sampling SeqBatcher design #842
- Max Seqbatcher number threshold api #843
- Adds rolling batch support #828
- Max new length #845
- Rolling batch for huggingface handler #857
- Compute kv cache utility function #863
- Sampling decoding implementation #878
- Uses multinomial to choose from topK samples and improve topP sampling #891
- Falcon support #890
- Unit test with random seed failure #909
- KV cache support in default handler #929
Introduces LMI Dist library for rolling batch
- Rolling batch support for flash models #865
- Assign random seed for lmi dist #912
- JSON format for rolling batch #899
- Add quantization parameter for lmi_dist rolling batch backend for HF #888
Introduces vLLM library for rolling batch
- [VLLM] add vllm rolling batch and add hazard handling #877
Introduces PEFT and LoRA support in handlers
- Add peft to fastertransformer container #889
- Add peft support to default deepspeed and huggingface handlers #884
- Add lora support to ft default handler #932
Introduces streaming support to FasterTransformer
- Add Streaming support #820
Introduces S3 Cache Engine
- S3 Cache Engine #719
Upgrades component versions:
- Upgrade PyTorch to 2.0.1 #804
- Update Neuron to 2.10 #681
- Upgrade deepspeed to 0.9.5 #804

Enhancement

Serving and python engine enhancements

Adds workflow model loading for SageMaker #661
Allows model being shared between workflows #665
Prints out error message if pip install failed #666
Install fixed version for transformers and accelerate #672
Add numpy fix #674
SM Training job changes for AOT #667
Creates model dir to prevent issues with no code experience in SageMaker #675
Don't mount model dir for no code tests #676
AOT upload checkpoints tests #678
Add stable diffusion support on INF2 #683
Unset omp thread to prevent CLIP model delay #688
Update ChunkedBytesSupplier API #692
Fixes log file charset issue in management console #693
Adds neuronx new feature for generation #694
[INF2] adding clip model support #696
[plugin] Include djl s3 extension in djl-serving distribution #699
[INF2] add bf16 support to SD #700
Adds support for streaming Seq2Seq models #698
Add SageMaker MCE support #706
[INF2] give better room for more tokens #710
[INF2] Bump up n positions #713
Refactor logic for supporting HF_MODEL_ID to support MME use case #712
Support load model from workflow directory #714
Add support for se2seq model loading in HF handler #715
Load function from workflow directory #718
Add vision components for DeepSpeed and inf2 #725
Support pip install in offline mode #729
Add --no-index to pip install in offline mode #731
Adding llama model support #727
Change the dependencies so for FasterTransformer #734
Adds text/plain content-type support #741
Skeleton structure for sequence batch scheduler #745
Handles torch.cuda.OutOfMemoryError #749
Improves model loading logging #750
Asynchronous with PublisherBytesSupplier #730
Renames env var DDB_TABLE_NAME to SERVING_DDB_TABLE_NAME #753
Sets default minWorkers to 1 for GPU python model #755
Fixes log message #765
Adds more logs to LMI engine detection #766
Uses predictable model name for HF model #771
Adds parallel loading support for Python engine #770
Updates management console UI: file input are not required in form data #773
Sets default maxWorkers based on OMP_NUM_THREADS #776
Support non-gpu models for huggingface #772
Use huggingface standard generation for tnx streaming #778
Add trust remote code option #781
Handles invalid retrun type case #790
Add application/jsonlines as content-type for streaming #791
Fixes trust_remote_code issue #793
Add einops for supporting falcon models #792
Adds content-type response for DeepSpeed and FasterTransformer handler #797
Sets default maxWorkers the same as earlier version #799
Add stream generation for huggingface streamer #801
Add server side batching #795
Add safetensors #808
Improvements in AOT UX #787
Add pytorch kernel cache default directory #810
Improves partition script error message #826
Add -XX:-UseContainerSupport flag only for SageMaker #868
Move TP detection logic to PyModel from LmiUtils #840
Set tensor_parallel_degree property when not specified #847
Add workflow dispatch #870
Create model level virtualenv #811
Refactor createVirtualEnv() #875
Add MPI Engine as generic name for dist...

Contributors

maaquib, zachgk, and 10 other contributors

Assets 2

14 Jun 00:54

lanking520

v0.23.0-alpha

80a7154

0.23.0-Alpha Release Pre-release

Pre-release

This release solved several issues on DJLServing library and also brings some new features.

Supporting load from workflow directory #714
Fixed MME support with HF_MODEL_ID #712
Added parallel loading for python models #770
Fixed device mismatch issue #805
And more

What's Changed

[serving] Adds workflow model loading for SageMaker by @frankfliu in #661
[workflow] Allows model being shared between workflows by @frankfliu in #665
[python] prints out error message if pip install failed by @frankfliu in #666
update to djl 0.23.0 by @siddvenk in #668
[docker] Fixes fastertransformer docker file by @frankfliu in #671
[kserve] Fixes unit test for extra data type by @frankfliu in #673
install fixed version for transformers and accelerate by @lanking520 in #672
[ci] add performance testing by @tosterberg in #558
add numpy fix by @lanking520 in #674
SM Training job changes for AOT by @sindhuvahinis in #667
Create model dir to prevent issues with no code experience in SageMaker by @siddvenk in #675
Don't mount model dir for no code tests by @siddvenk in #676
AOT upload checkpoints tests by @sindhuvahinis in #678
[INF2][DLC] Update Neuron to 2.10 by @lanking520 in #681
add stable diffusion support on INF2 by @lanking520 in #683
[CI] add small fixes by @lanking520 in #684
Add HuggingFace TGI publish and test pipeline by @xyang16 in #650
Add shared memory arg to docker launch command in README by @rohithkrn in #685
Update github-slug-action to v4.4.1 by @xyang16 in #686
unset omp thread to prevent CLIP model delay by @lanking520 in #688
Change the bucket for different object by @sindhuvahinis in #691
[ci] make performance tests run in parallel by @tosterberg in #690
[api] Update ChunkedBytesSupplier API by @frankfliu in #692
[console] Fixes log file charset issue by @frankfliu in #693
add neuronx new feature for generation by @lanking520 in #694
[tgi] Add more models to TGI test pipeline by @xyang16 in #695
[INF2] adding clip model support by @lanking520 in #696
[plugin] Include djl s3 extension in djl-serving distribution by @frankfliu in #699
[INF2] add bf16 support to SD by @lanking520 in #700
[ci] Upgrade spotbugs to 5.0.14 by @frankfliu in #704
Add support for streaming Seq2Seq models by @rohithkrn in #698
add SageMaker MCE support by @lanking520 in #706
fix the device mapping issue if visible devices is set by @lanking520 in #707
fix the start gpu bug by @lanking520 in #709
[INF2] give better room for more tokens by @lanking520 in #710
bump up n positions by @lanking520 in #713
Refactor logic for supporting HF_MODEL_ID to support MME use case by @siddvenk in #712
[ci] reconfigure performance test time and machines by @tosterberg in #711
[workflow] Support load model from workflow directory by @frankfliu in #714
Add support for se2seq model loading in HF handler by @rohithkrn in #715
Add unit test for empty model store initialization by @siddvenk in #716
Fix no code tests in lmi test suite by @siddvenk in #717
[serving] Load function from workflow directory by @frankfliu in #718
[test] Reformat python code by @frankfliu in #720
Creates S3 Cache Engine by @zachgk in #719
[test] Refactor client.py by @frankfliu in #721
update fastertransformers build instruction by @lanking520 in #722
Add seq2seq streaming integ test by @rohithkrn in #724
[test] Update tranformser-neuxornx gpt-j-b mode options by @frankfliu in #723
[DeepSpeed][INF2] add vision components by @lanking520 in #725
[python] Support pip install in offline mode by @frankfliu in #729
[python] Add --no-index to pip install in offline mode by @frankfliu in #731
adding llama model support by @lanking520 in #727
tokenizer bug fixes by @lanking520 in #732
[FT] change the dependencies so by @lanking520 in #734
Remove TGI build and test pipeline by @xyang16 in #735
ft_handler fix by @rohithkrn in #736
[docker] Uses the same convention as tritonserver by @frankfliu in #738
[ci] Upgrade jacoco to 0.8.8 to support JDK17+ by @frankfliu in #739
[python] Fixes typo in fastertransformer handler by @frankfliu in #740
[python] Adds text/plain content-type support by @frankfliu in #741
[serving] Avoid unit-test hang by @frankfliu in #744
Skeleton structure for sequence batch scheduler by @sindhuvahinis in #745
update the wheel to have path fixed by @lanking520 in #747
Adding project diagrams link to architecture.md by @alexkarezin in #742
Add SageMaker integration test by @siddvenk in #705
[python] Handle torch.cuda.OutOfMemoryError by @frankfliu in #749
fix permissions for sm pysdk install script by @siddvenk in #751
[serving] Improves model loading logging by @frankfliu in #750
Asynchronous with PublisherBytesSupplier by @zachgk in #730
[cache] Rename evn var DDB_TABLE_NAME to SERVING_DDB_TABLE_NAME by @frankfliu in #753
[serving] Sets default minWorkers to 1 for GPU python model by @frankfliu in #755
SM AOT Tests by @sindhuvahinis in #756
[docker] Pin bitsandbytes version to 0.38.1 by @xyang16 in #754
[fix] bump versions for new deepspeed wheel by @tosterberg in #733
[fix] Fix bitsandbytes pip install by @xyang16 in #758
[serving] Fixes log message by @frankfliu in #765
add triton components in the nightly by @lanking520 in #767
Add mme tests to sagemaker tests by @siddvenk in #763
[wlm] Adds more logs to LMI engine detection by @frankfliu in #766
fix typos with get default bucket prefix for sm session by @siddvenk in #768
[serving] Uses predictable model name for HF model by @frankfliu in #771
[serving] Adds parallel loading support for Python engine by @frankfliu in https://github.com/deepjavalibrary/d...

Contributors

zachgk, rohithkrn, and 7 other contributors

Assets 2

25 Apr 22:55

siddvenk

v0.22.1

e5c2113

DJLServing v0.22.1 release

Key Features

Add pytorch inf2 by @lanking520 in #535
Adds chunked encoding support by @frankfliu in #551
Ahead of Time Partitioning Support in FT default handler and test cases by @sindhuvahinis in #539
Python engine streaming initial support by @rohithkrn in #573
Adds async inference API by @frankfliu in #570
Optimize batch inference for text generation by @siddvenk in #586
Add default handler for AOT by @sindhuvahinis in #588
Support text2text-generation task in deepspeed by @siddvenk in #606
Throttles request if all workers are busy by @frankfliu in #656
Infer recommended LMI engine by @siddvenk in #623

Bug Fixes

[fix] requirements.txt install check testcase by @sindhuvahinis in #537
[python] Fixes typo in unit test by @frankfliu in #554
[serving] Fixes GPU auto scaling bug by @frankfliu in #561
Fix typo in streaming utils by @rohithkrn in #581
KServe data to bytes fix by @sindhuvahinis in #577
[serving] Fixes NeuronUtils for SageMaker by @frankfliu in #583
[python] Fixes python startup race condition by @frankfliu in #589
[serving] Avoid download from s3 multiple time by @frankfliu in #596
make output consistent by @lanking520 in #616
[workflow] Fixes workflow loading issue by @frankfliu in #662

Enhancement

[ci] Upgrades gradle to 8.0.2 by @frankfliu in #540
[ci] Uses recommended way to create task in build.gradle by @frankfliu in #541
update deepspeed container python version to 3.9 by @rohithkrn in #546
[inf2] Adding gptj to transformers handler by @maaquib in #542
install git by default for all python releases by @lanking520 in #555
Load external dependencies for workflows by @xyang16 in #556
[python] Infer default entryPoint if not provided by @frankfliu in https://github.com/deepjavalibrary/djl-serving/pull/5631
[python] flush logging output before process end by @frankfliu in #567
[serving] support load entryPoint with url by @frankfliu in #566
[serving] deprecate s3Url and replace it with model_id by @frankfliu in #568
Sets huggingface cache directory to /tmp in container by @lanking520 in #571
add finalize callback function by @lanking520 in #572
add pad token if not set by @lanking520 in #550
Include Kserve plugins to distribution by @sindhuvahinis in #552
[python] Passing arguments to model.py by @frankfliu in #560
update pytorch docker to py3.9 by @rohithkrn in #547
[serving] Detect triton engine by @frankfliu in #574
[python] Refactor PyEngine with PassthroughNDManager by @frankfliu in #578
Minimal followup for BytesSupplier changes by @zachgk in #580
[serving] Sets djl cache directory to /tmp by @frankfliu in #585
[python] Makes download entryPoint atomic by @frankfliu in #587
[python] Use NeuronUtils to detect neuron cores by @frankfliu in #593
[python] Fixes visible neuron cores environment variable by @frankfliu in #595
[serving] Refactor per model configuration initialization by @frankfliu in #594
Refactor CacheManager, Working Async by @zachgk in #591
[ci] bump up deepspeed version by @tosterberg in #597
[serving] Avoid compile time dependency on log4j by @frankfliu in #603
[serving] add default dtype when running in deepspeed by @tosterberg in #617
[serving] Adds deps folder to classpath in MutableClassLoader constructor by @frankfliu in #611
Add support for streaming batch size > 1 by @rohithkrn in #605
add ddb paginator for DJLServing by @lanking520 in #609
update fastertransformer to follow huggingface parameters by @lanking520 in #610
Change billing model to pay per request by @frankfliu in #612
Upgrade dependencies version by @frankfliu in #613
clean up docker build script and remove transformers docker image build by @lanking520 in #61
[AOT] Upload sharded checkpoints to S3 by @sindhuvahinis in #604
[serving] Upgrade to DJL 0.22.0 by @frankfliu in #622
Unify tnx experience by @lanking520 in #619
[serving] Update DJL version to 0.22.1 by @frankfliu in #627
[Docker] update a few versions by @lanking520 in #620
[serving] Make chunked read timeout configurable by @frankfliu in #652
[python][streaming]Do best effort model type validation to fix configs without arch list by @rohithkrn in #649
[AOT] Entrypoint download from url by @sindhuvahinis in #628
[wlm] Moves LmiUtils.inferLmiEngine() into separate class by @frankfliu in #630
[python][streaming]Batching fix and validate model architecture by @rohithkrn in #626
skip special tokens by default by @lanking520 in #635
[serving] Read x-synchronus and x-starting-token from input payload by @frankfliu in #637
add torchvision by @lanking520 in #638
[serving] Keep original content-type header by @frankfliu in #642
[serving] Override inferred options in criteria by @frankfliu in #644
Pinning aws-neuronx-* packages for Inf2 containers by @maaquib in #621
[serving] Stop model server if plugin init failed by @frankfliu in #655

Documentation

[docs] Fix serving doc by @xyang16 in #548
[docs] Adds streaming configuration document by @rohithkrn in #659
update docs to djl 0.22.1 by @siddvenk in #664

Full Changelog: v0.21.0...v0.22.1

Contributors

maaquib, zachgk, and 7 other contributors

Assets 2

25 Feb 18:14

frankfliu

v0.21.0

583dbc0

DJLServing v0.21.0 release

Key Features

Adds faster transformer support (#424)
Adds Deepspeed ahead of time partition script in DLC (#466)
Adds SageMaker MME support (#479)
Adds support for stable-diffusion-2-1-base model (#484)
Adds support for stable diffusion depth model (#488)
Adds out of memory protection for modle loading (#496)
Makes load_on_devices per model setting (#493)
Improves several per model settings
Improves management console model loading and inference UI (#431, #432)
Updates deepspeed to 0.8.0 (#465)
Upgrades PyTorch to 1.13.1 (#414)

Enhancement

Adds model_id support for huggingface models (#406)
Adds AI template package (#485)
Improves snakeyaml error message (#400)
Improves s5cmd error handling (#442)
Emits medel inference metrics to log file (#452)
Supports model.pt and model.onnx file name (#459)
Makes batch per model setting (#456)
Keeps failure worker status for 1 minutes (#463)
Detects engine to avoid uncessarily download MXNet engine (#481)
Uses temp directory instead of /tmp (#404)
Adds better logging and error handling for s5cmd process execution (#409)
Uses jacoco aggregation report plugin (#421)
Rollback model if failed start work in synchronous mode (#427)
Adds fastertransformer t5 integration test (#469)
Print better stacktrace if channel is closed (#473)
Supports FasterFansformer to run in mpi mode (#474)

Bug fixes

Adds fix to workaround SageMaker changes (#401)
Treats empty HTTP parameter as absent (#429)
Fixes inference console UI bug (#439)
Fixed gpt-neox model name typo (#441)
Fixes wrong onnx configuration (#449)
Fixes issue with passing dtype in huggingface handler. Refactor dtype_f…
Fixes issues with model_dir and model_id usage that occur when s3url is…
Fixes broken vue tags (#453)

Breaking change

Remove unecessary java engine adapter (#448)
Removes djl-central module in favor of management console (#447)
Sets model status to failure after exceed retry threshold (#455)
Removes DLR support (#468)

Documentation

Updates management api document (#436)
Adds dynamic batching settings to document (#462)
Improves plugin README (#477)
Fixes managment_api.md broken list (#478)
Updates serving configuration document (#437)

Assets 4

Releases: deepjavalibrary/djl-serving

DJLServing v0.29.0 Release

Key Features

DJL Serving Changes (applicable to all containers)

LMI Container (vllm, lmi-dist)

TensorRT-LLM Container

Transformers NeuronX Container

Text Embedding (using the LMI container)

Enhancements

Contributors

v0.28.0

Key Features

LMI container

TensorRT-LLM container

Transformers NeuronX container

DeepSpeed container (deprecated)

New Model Support

CX Usability Enhancements/Changes

Breaking Changes

Known Issues

Enhancements

Contributors

DJLServing v0.27.0 Release

Key Changes

Enhancement

Contributors

DJLServing v0.26.0 Release

Key Changes

Enhancements

Bug Fixes

Documentation Updates

CI/CD Updates

Contributors

DJLServing v0.25.0 Release

Key Changes

Enhancements

Bug Fixes

Contributors

DJLServing v0.24.0 release

Key Features

Enhancement

Contributors

DJLServing v0.23.0 release

Key Features

Enhancement

Serving and python engine enhancements

Contributors

0.23.0-Alpha Release

What's Changed

Contributors

DJLServing v0.22.1 release

Key Features

Bug Fixes

Enhancement

Documentation

Contributors

DJLServing v0.21.0 release

Key Features

Enhancement

Bug fixes

Breaking change

Documentation