Releases: deepjavalibrary/djl-serving
Releases · deepjavalibrary/djl-serving
DJLServing v0.29.0 Release
Key Features
Details regarding the latest LMI container image_uris can be found here
DJL Serving Changes (applicable to all containers)
- Allows configuring health checks to fail based on various types of error rates
- When not streaming responses, all invocation errors will respond with the appropriate 4xx or 5xx HTTP response code
- Previously, for some inference backends (vllm, lmi-dist, tensorrt-llm) the behavior was to return 2xx HTTP responses when errors occurred during inference
- HTTP Response Codes are now configurable if you require a specific 4xx or 5xx status to be returned in certain situations
- Introduced annotations
@input_formatter
and@output_formatter
to bring your own script for pre- and post-postprocessing.
LMI Container (vllm, lmi-dist)
- vLLM updated to version 0.5.3.post1
- Added MultiModal Support for Vision Language Models using the OpenAI Chat Completions Schema.
- More details available here
- Supports Llama 3.1 models
- Supports beam search,
best_of
andn
with non streaming output. - Supports chunked prefill support in both vllm and lmi-dist.
TensorRT-LLM Container
- TensorRT-LLM updated to version 0.11.0
- [Breaking change] Flan-T5 is now supported with C++ triton backend. Removed Flan-T5 support for TRTLLM python backend.
Transformers NeuronX Container
- Upgraded to Transformers NeuronX 2.19.1
Text Embedding (using the LMI container)
- Various performance improvements
Enhancements
- best_of support in output formatter by @sindhuvahinis in #1992
- Convert cuda env tgi variables to lmi by @sindhuvahinis in #2013
- [serving] Update python typing to support py38 by @tosterberg in #2034
- Refactor lmi_dist and vllm to support best_of with RequestOutput by @sindhuvahinis in #2011
- [metrics] Improve prometheus metric type handling by @frankfliu in #2039
- [serving] Fail ping if error rate exceed by @frankfliu in #2040
- [serving] Update default max worker to 1 for GPU by @xyang16 in #2048
- [Serving] Implement SageMaker Secure Mode & support for multiple data sources by @ethnzhng in #2042
- AutoAWQ Integration Script by @a-ys in #2038
- [secure-mode] Refactor secure mode plugin by @frankfliu in #2058
- [vllm, lmi-dist] add support for top_n_tokens by @sindhuvahinis in #2051
- [python] refactor rolling batch inference method by @sindhuvahinis in #2090
- Record telemetry including acceptance rate by @zachgk in #2088
- bump up trtllm to 0.10.0 by @ydm-amazon in #2043
- [fix] Set tokenizer on output_formatter for TRT-LLM Handlers by @maaquib in #2100
- [dockerfile] pin datasets to 2.19.1 in trtllm by @sindhuvahinis in #2104
- [serving] make http response codes configurable for exception cases by @siddvenk in #2114
- update flags to prevent deprecation by @lanking520 in #2118
- [docker] Update DJL to 0.29.0-SNAPSHOT by @frankfliu in #2119
- [awscurl] Handles Bedrock special url case by @frankfliu in #2120
- [secure-mode] Add properties allowlist validation by @ethnzhng in #2129
- [secure-mode] add per-model configs to allowlist by @ethnzhng in #2132
- [docker] remove tensorflow native from cpu-full image by @frankfliu in #2136
- [onnx] Allows to customize onnxruntime optimization level by @frankfliu in #2137
- [python] add support for 3p use-case by @siddvenk in #2122
- [python] move parse input functions to input_parser.py by @sindhuvahinis in #2092
- [python] log exception stacktrace for exceptions in python, improve r… by @siddvenk in #2142
- [engine] include lmi recommended entrypoint when model.py exists by @sindhuvahinis in #2148
- [3p][python]add metering and error details to 3p outputs by @siddvenk in #2143
- Support multi node for lmi-dist by @xyang16 in #2125
- [python] refactor input parser to support Request by @sindhuvahinis in #2145
- [python] add max_logprobs vllm configuration to EngineArgs by @sindhuvahinis in #2154
- [python] parse input only when new requests are received by @sindhuvahinis in #2155
- [lmi] remove redundant auto logic from python handler by @siddvenk in #2152
- [python] support multimodal models openai api in vllm by @sindhuvahinis in #2147
- Add stronger typing for chat completions use-cases by @siddvenk in #2161
- [awscurl] Supports full jsonquery syntax by @frankfliu in #2163
- [Neo] Neo compilation/quantization script bugfixes by @a-ys in #2115
- [docker] bump neuron to 2.19 SDK by @tosterberg in #2160
- [python] add input formatter decorator by @sindhuvahinis in #2158
- [docker] bump neuron vllm to 5.0 by @tosterberg in #2169
- [lmi] Upgrade lmi dockerfile for 0.29.0 release by @maaquib in #2156
- add 0.5.1 supported models by @lanking520 in #2151
- [python] update max_logprobs default for vllm 0.5.1 by @sindhuvahinis in #2159
- [wlm] Trim whitespce for model_id by @frankfliu in #2175
- [fix] optimum update stable diffusion support by @tosterberg in #2179
- [serving][python] Support non 200 HTTP response codes for non-streami… by @siddvenk in #2173
- [awscurl] Includes input data in output file by @frankfliu in #2184
- [multimodal] support specifying image_token, infering default if not … by @siddvenk in #2183
- [Neo] Refactor Neo TRT-LLM partition script by @ethnzhng in #2166
- [Partiton] Don't output
option.parallel_loading
when partitioning by @a-ys in #2189 - Introduce pipeline parallel degree config by @nikhil-sk in #2171
- add trtllm container update by @lanking520 in #2191
- [serving] Adds mutliple node cluster configuration support by @frankfliu in #2190
- [aot] fix aot partition args, add pipeline parallel by @tosterberg in #2196
- bump up bench to 0.29.0 by @ydm-amazon in #2199
- [serving] Download model while initialize multi-node cluster by @frankfliu in #2198
- [lmi] Dependencies upgrade for 0.29.0 by @maaquib in #2194
- [chat][lmi] use generation prompt in tokenizer to avoid bot prompt re… by @siddvenk in #2195
- [lmi] support multimodal in lmi-dist by @siddvenk in #2182
- Revert "Update max_model_len for llama-3 lora test" by @lanking520 in #2207
- [wlm] Fixes retrieve config.json error by @frankfliu in #2212
- determine bedrock usage based on explicit property rather than inferr… by @siddvenk in #2214
- [post-7/22]Add chunked prefill support in vllm and lmi-dist by @rohithkrn in #2202
- lazy compute input ids by @lanking520 in #2216
- [docker] neuron bump to 2.19.1 by @tosterberg in https://...
v0.28.0
Key Features
Check out our latest Large Model Inference Containers.
LMI container
- Provided general performance optimization.
- Added text embedding support
- Our solution for text embedding is 5% faster than HF TEI solution.
- Multi-LoRA feature now supports LLama3 and AWS models
TensorRT-LLM container
- Upgraded to TensorRT-LLM 0.9.0
- AWQ, FP8 support for Llama3 models on G6/P5 machines
- Now, default max_new_tokens=16384
- Bugfix for critical memory leaks on long run.
- Bugfix for model hanging issues.
Transformers NeuronX container
- Upgraded to Transformers NeuronX 2.18.2
DeepSpeed container (deprecated)
DeepSpeed container is now deprecated. If you are not using deepspeed
engine, all you need is 0.28.0-lmi
container and continue using it.
New Model Support
- LMI container
- Artic, DBRX, Falcon 2, Command-R, InternLM2, Phi-3, Qwen2MoE, StableLM, StarCoder2, Xverse, and Jais
- TensorRT-LLM container
- Gemma
CX Usability Enhancements/Changes
- Model loading CX:
- SERVING_LOAD_MODELS env is deprecated, use HF_MODEL_ID instead.
- Inference CX:
- Input/Output schema changes:
- Speculative decoding now in streaming, returns multiple jsonlines tokens at each generation step
- Standardized the output formatter signature:
- We reduced the parameters of output_formatter by introducing RequestOutput class.
- RequestOutput contains all input information such as text, token_ids and parameters and also output information such as output tokens, log probabilities, and other details like finish reason. Check this doc to know more.
- Introduced prompt details in the
details
of the response for vLLM and lmi-dist rolling batch options. These prompt details contains the prompt token_ids and their corresponding text and log probability. Check this doc to know more.
- New error handling mechanism:
- Improved our error handling for container responses for rolling batch. Check this doc to know more
- New CX capability:
- We introduce OPTION_TGI_COMPAT env which enables you to get the same response format as TGI.
- We also now support SSE text/event-stream data format.
- Input/Output schema changes:
Breaking Changes
- Inference CX for rolling batch:
- Token id changed from list into integer in rolling batch response.
- Error handling: “finish_reason: error” during rolling batch inference
- DeepSpeed container has been deprecated, functionality is generally available in the LMI container now
Known Issues
- TRTLLM periodically crashes during model compilation when using A100 GPU
- TRTLLM AWQ quantization currently crashes due to an internal error
Enhancements
- [serving] Avoid unecessary copy plugin jars. by @frankfliu in #1793
- [serving] Refactor rolling batch detection logic by @frankfliu in #1781
- [serving] Uses Utils.openUrl() to download files by @frankfliu in #1810
- [docker] Install onnxruntime in docker by @frankfliu in #1790
- [lmi] infer entry point for lmi models in LmiConfigRecommender by @siddvenk in #1779
- [trt][lmi] always set mpi mode for trt container by @siddvenk in #1803
- [awscurl] Support TGI response with coral streaming by @frankfliu in #1769
- [awscurl] Fix output for server sent event content-type by @frankfliu in #1784
- [tnx] installing tiktoken and blobfile for TNX by @lanking520 in #1762
- [tnx] upgrade neuron sdk to 2.18.1 by @tosterberg in #1765
- [tnx] add torchvision for resnet tests by @tosterberg in #1768
- [tnx] adding additional neuron config options by @tosterberg in #1777
- [tnx] add vllm to container by @tosterberg in #1786
- update lora test to tp=max and variable prompt lengths by @rohithkrn in #1766
- [Neo] Add more logging to Neo Neuron partition script by @a-ys in #1802
- Migrate to pydantic v2 by @sindhuvahinis in #1764
- replace peft with open source version by @lanking520 in #1813
- use corretto java by @lanking520 in #1818
- [Java][DLC] try with CA cert update by @lanking520 in #1820
- upgrade trtllm to 0.9.0 by @lanking520 in #1795
- Remove libnccl-dev installation in trtllm container by @nskool in #1822
- [tnx] version bump Neuron SDK and Optimum by @tosterberg in #1826
- [serving] Support set arguments via env var by @frankfliu in #1829
- Creates initial benchmark suite by @zachgk in #1831
- Resolve protected_namespaces warning for pydantic by @sindhuvahinis in #1834
- [chat] Remove unused parameters by @xyang16 in #1835
- Updated s3url gpt4all lora adapter_config.json by @sindhuvahinis in #1836
- Refactor parse_input for handlers by @sindhuvahinis in #1788
- [awscurl] Adds seed and test duration by @frankfliu in #1844
- [djl-bench] Refactor benchmark for arbitary inputs by @frankfliu in #1845
- [djl-bench] Uses Shape.parseShapes() by @frankfliu in #1849
- [Neo] Add JumpStart Integration to SM Neo Neuron AOT compilation flow by @a-ys in #1854
- update container for LMI by @lanking520 in #1814
- [tnx] Default to generation.config generation settings by @tosterberg in #1894
- [tnx] Read-only Neuron cache workaround for SM by @a-ys in #1879
- [tnx] add greedy speculative decoding by @tosterberg in #1902
- [awscurl] Adds extra parameters to dataset by @frankfliu in #1862
- [awscurl] allow write to json file by @lanking520 in #1859
- [DLC] update deps and wheel by @lanking520 in #1860
- Stop the model server when model download fails by @sindhuvahinis in #1842
- [vllm, lmi-dist] Support input log_probs by @sindhuvahinis in #1861
- Use s3 models for llama gptq by @rohithkrn in #1872
- [serving] Avoid using tab in logging by @frankfliu in #1871
- Add placeholder workflow by @ydm-amazon in #1874
- [T5] TRTLLM python repo model check by @sindhuvahinis in #1870
- add TGI compat feature for rollingbatch by @lanking520 in #1866
- Add better token handling under hazardous condition by @lanking520 in #1875
- Update TRT-LLM Container to build with Triton 24.04 by @nskool in #1878
- Remove docker daemon restarts from p4d workflow by @nskool in #1882
- [RollingBatch] allow appending for token by @lanking520 in #1885
- [RollingBatch] remove pending requests by @lanking520 in #1884
- [awscurl] Allows override stream parameter for dataset by @frankfliu in #1895
- allow TGI compat to work with output token ids by @lanking520 in #1900
- Add max num tokens workflow by @ydm-amazon in #1899
- Convert huggingface model to onnx by @xyang16 in #1888
- Improve the artifact...
DJLServing v0.27.0 Release
Key Changes
- Large Model Inference Containers 0.27.0 release
- DeepSpeed container
- Added DBRX and Gemma model support.
- Provided general performance optimization.
- Added new performance enhancing features support like Speculative Decoding.
- TensorRT-LLM container
- Upgraded to TensorRT-LLM 0.8.0
- Transformers NeuronX container
- Upgraded to Transformers NeuronX 2.18.0
- DeepSpeed container
- Multi-Adapter LoRA Support
- Provided multi-adapter inference functionality in LMI DLC.
- CX Usability Enhancements
- Provided a seamless migration experience across different LMI DLCs.
- Implemented the Low code No code experience.
- Supported OpenAI compatible chat completions API.
Enhancement
- [LMIDist] Allow passing in ignore_eos_token param by @xyang16 in #1489
- [LMIDist] Make ignore_eos_token default to false by @xyang16 in #1492
- Makes netty buffer size configurable by @frankfliu in #1494
- translate few vllm and trtllm params to HF format by @sindhuvahinis in #1500
- Align properties parsing to be similar to java by @ydm-amazon in #1502
- check for rolling batch "disable" value by @sindhuvahinis in #1506
- add max model length support on vLLM by @lanking520 in #1510
- Creates auto increment ID for models by @zachgk in #1109
- making default dtype to fp16 for compilation by @lanking520 in #1512
- Use server-provided seed if not in request params in deepspeed handler by @davidthomas426 in #1520
- [Neuron][WIP] add log probs calulation in Neuron by @lanking520 in #1516
- remove a bit of unused code by @ydm-amazon in #1536
- Remove unused device parameter from all rolling batch classes by @ydm-amazon in #1538
- dump OPTION_ env vars at startup by @siddvenk in #1541
- [vLLM] add enforce eager as an option by @lanking520 in #1547
- [awscurl] Prints invaild response by @frankfliu in #1550
- roll back enum changes by @ydm-amazon in #1551
- [awscurl] Refactor time to first byte calculation by @frankfliu in #1557
- [serving] Allows configure log level at runtime by @frankfliu in #1560
- Stream returns only after putting placeholder finishes by @zachgk in #1567
- type hints and comments for trtllm handler by @ydm-amazon in #1558
- [vLLM] add speculative configs by @lanking520 in #1553
- Add rolling batch type hints: Part 1 by @ydm-amazon in #1564
- Remove adapters preview flag by @zachgk in #1573
- [serving] Adds model event listener by @frankfliu in #1570
- [serving] Add model loading metrics by @frankfliu in #1576
- Speculative decoding in LMI-Dist by @KexinFeng in #1505
- type hints for scheduler rolling batch by @ydm-amazon in #1577
- [serving] Uses model.intProperty() api by @frankfliu in #1582
- [serving] Ignore CUDA OOM when collecting metrics by @frankfliu in #1581
- Remove test as the model is incompatible with transformers upgrade by @rohithkrn in #1575
- [serving] Adds rolling batch metrics by @frankfliu in #1583
- [serving] Uses dimension for model metric by @frankfliu in #1587
- rolling batch type hints part 3 by @ydm-amazon in #1584
- [SD][vLLM] record acceptance by @lanking520 in #1586
- [serving] Adds promtheus metrics support by @frankfliu in #1593
- [feat] Benchmark code for speculative decoding in lmi-dist by @KexinFeng in #1591
- [lmi] add generated token count to details by @siddvenk in #1600
- [console] use StandardCharset instead of deprecated Charset by @siddvenk in #1601
- [awscurl] add download steps to README.md by @siddvenk in #1605
- [lmi][deprecated] remove option.s3url since it has been deprecated fo… by @siddvenk in #1610
- [serving] Skip testPrometheusMetrics when run in IDE by @frankfliu in #1611
- Use workflow template for workflow model_dir by @zachgk in #1612
- [Partition] Remove redudant model splitting, Improve Input Model Parsing by @a-ys in #1609
- Add handler for new lmi-dist by @rohithkrn in #1595
- [lmi] add parameter to allow full text including prompt to be returne… by @siddvenk in #1602
- support cuda driver on sagemaker by @lanking520 in #1618
- remove checker for awq with enforce eager by @lanking520 in #1620
- Add pytorch-gpu for security patching by @maaquib in #1621
- Refactor vllm and rubikon engine rolling batch by @rohithkrn in #1623
- Update TRT-LLM Dockerfile for v0.8.0 by @nskool in #1622
- [UX] sampling with vllm by @sindhuvahinis in #1624
- [vLLM] reduce speculative decoding gpu util to leave room for draft model by @lanking520 in #1628
- [lmi] update auto engine logic for vllm and lmi-dist by @siddvenk in #1617
- [python] Encode error in single line for jsonlines case. by @frankfliu in #1630
- Single model adapter API by @zachgk in #1616
- remove all current no-code test cases by @siddvenk in #1635
- Update the build script to use vLLM 0.3.3 by @lanking520 in #1637
- Update lmi-dist rolling batch to use rubikon engine by @rohithkrn in #1639
- Adds adapter registration options by @zachgk in #1634
- Supports vLLM LoRA adapters by @zachgk in #1633
- add customer required field by @lanking520 in #1640
- [tnx] bump optimum version by @tosterberg in #1632
- Updates dependencies version to latest by @frankfliu in #1647
- updated dependencies for LMI by @lanking520 in #1648
- [DO NOT MERGE][CAN APPROVE]change flash attn url by @lanking520 in #1650
- [cache] Remove gson from fatjar of cache by @frankfliu in #1649
- [python] Move output formatter to request level by @xyang16 in #1644
- [tnx] improve model partitioning time by @tosterberg in #1652
- [tnx] support codellama 70b instruct tokenizer by @tosterberg in #1653
- [python] Remove output_formatter from vllm and lmi-dist sampling para… by @xyang16 in #1654
- [wlm] Makes generateHuggingFaceConfigUri public by @frankfliu in #1656
- [tnx] fix output formatter as param implementation by @tosterberg in #1657
- [lmi] use hf token to get model config for gated/private models by @siddvenk in #1658
- [UX] Changing some default parameters by @sindhuvahinis in #1659
- add parameters to part of the field by @lanking520 in https://github.com/deepja...
DJLServing v0.26.0 Release
Key Changes
- TensorRT-LLM 0.7.1 Upgrade, including support for Mixtral 8x7B MOE model
- Optimum Neuron Support
- Transformers-NeuronX 2.16 Upgrade, including support for continuous batching
- LlamaCPP support
- Many Documentation updates with updated model deployment configurations
- Refactor of configuration management across different backends
- CUDA 12.1 support for DeepSpeed and TensorRT-LLM containers
Enhancements
- [UX][RollingBatch] add details function to the rolling batch by @lanking520 in #1353
- [TRTLLM][UX] add trtllm changes to support stop reason and also log prob by @lanking520 in #1355
- [Docker] upgrade cuda 12.1 support for DJLServing by @lanking520 in #1370
- [feat] optimum handler creation by @tosterberg in #1362
- [python] Update lmi_dist warmup logic by @xyang16 in #1367
- [RollingBatch] optimize rolling batch result by @lanking520 in #1372
- [python] Sets mpi_model property for python to consume by @frankfliu in #1360
- [vLLM] add load_format to support for mixtral model by @lanking520 in #1391
- [python] Sets rolling batch threads as daemon thread by @frankfliu in #1371
- [awscurl] Adds awscurl to repo by @frankfliu in #1408
- Add config passing in lmi-dist by @xyang16 in #1382
- Upgrade flash attention to 2.3.0 by @xyang16 in #1402
- [TRTLLM] Bump up trtllm to version 0.7.1 by @ydm-amazon in #1452
- [tnx] add gqa to properties by @siddvenk in #1478
- [TRTLLM] add enable kv cache reuse by @lanking520 in #1460
- [serving] Adds llama.cpp support by @frankfliu in #1464
- [serving] Allows plugin to override default HTTP handler by @frankfliu in #1424
- [wlm] enable max workers env var for MPI mode by @frankfliu in #1438
- Support AWQ quantization in LMI Dist by @xyang16 in #1435
- [python] Excludes test code from jar by @frankfliu in #1449
- [Refactor][UX] Refactoring vllm rolling batch properties by @sindhuvahinis in #1369
- [DLC][TNX] inf2 stable diffusion handler refactor by @tosterberg in #1393
- [Refactor] lmi dist rolling batch properties by @sindhuvahinis in #1409
- [Refactor] scheduler rolling batch refactor by @sindhuvahinis in #1411
- [awscurl] Allows search nested json key by @frankfliu in #1453
- make jsonline outputs generated tokens by @lanking520 in #1454
- [serving] Loads model zoo and engine from deps folder on startup by @frankfliu in #1457
- [RollingBatch] add customized rollingbatch by @lanking520 in #1468
Bug Fixes
- Fix rolling batch properties by @xyang16 in #1326
- [fix] tnx quantization and docs by @tosterberg in #1332
- [fix] Context length estimate datatype by @sindhuvahinis in #1350
- [UX][CI] fix a few bugs by @lanking520 in #1357
- [fix] inf2 container freeze compiler versions by @tosterberg in #1389
- [Fix] fix the lmi dist device by @sindhuvahinis in #1387
- [vllm] pass hf revision to vllm engine, pin phi2 model revision for test by @siddvenk in #1485
- [python] Fixes mpi_mode properties by @frankfliu in #1368
- [python] Fixes mpi_mode issues by @frankfliu in #1373
- [wlm] Fixes get maxWorkers bug for python engine by @frankfliu in #1375
- [RollingBatch] fix request id in rolling batch by @lanking520 in #1481
- [TRTLLM] Fix bug in handler by @ydm-amazon in #1459
- Works with manual initialization by @zachgk in #1473
- [TNX] version update to 2.16.0 sdk and continuous batching by @tosterberg in #1437
Documentation Updates
- [doc] Update current properties for TNX handler by @tosterberg in #1322
- [doc] lmi configurations readme by @sindhuvahinis in #1323
- [doc] Placeholder for TrtLLM tutorial and tuning guide by @sindhuvahinis in #1333
- [doc] LMI environment variable instruction by @sindhuvahinis in #1334
- [doc] TransformerNeuronX tuning guide by @sindhuvahinis in #1335
- [doc] TensorRt-Llm tuning guide by @sindhuvahinis in #1339
- [doc] Updating new TensorRT-LLM configurations by @sindhuvahinis in #1340
- [doc] DeepSpeed tuning guide by @sindhuvahinis in #1342
- [doc] LMI dist tuning guide by @sindhuvahinis in #1341
- [doc] seq_scheduler_document by @KexinFeng in #1336
- [doc] large model inference document by @sindhuvahinis in #1343
- [doc] fix docker image uri for trtllm tutorial by @sindhuvahinis in #1348
- [docs] Adds option.max_output_size document by @frankfliu in #1354
- [docs] fix tnx n_positions description by @tosterberg in #1401
- [doc] instruction on adding new properties to default handlers by @sindhuvahinis in #1419
- Add AOT Tutorial by @ydm-amazon in #1338
- [docker] Avoid JVM consume GPU memory by @frankfliu in #1365
- [LMI] DJLServing side placeholder by @lanking520 in #1330
- [Tutorials] add tensorrt llm manual by @lanking520 in #1412
- [TRTLLM] add line in docs for chatglm by @ydm-amazon in #1425
- Update LMI dist tuning guide by @xyang16 in #1428
- [TRTLLM] update the docs and more model support by @lanking520 in #1415
- [TRT-LLM] Update docs for newly added TRT-LLM build args in 0.7.1 by @rohithkrn in #1461
- [TNX][config] update rolling batch batch size behavior and docs by @tosterberg in #1404
- [TRTLLM] Update the docs - add mixtral by @ydm-amazon in #1434
- [TRTLLM] Add gpt model to docs and ci by @ydm-amazon in #1475
CI/CD Updates
- [tnx] version bump to 2.15.2 by @tosterberg in #1363
- [CI][IB] Support variables by @zachgk in #1356
- Bump up DJL version to 0.26.0 by @xyang16 in #1364
- [ci] Fixes nightly gpu integration test by @frankfliu in #1378
- [CI] update the model to fp16 by @lanking520 in #1390
- update models for TRT-LLM 0.6.1 by @rohithkrn in #1392
- [CI][fix] Sagemaker integration test cloudwatch metrics fix by @sindhuvahinis in #1385
- [CI][fix] Inf2 AOT integration test fix by @tosterberg in #1395
- [ci] Fixes flaky async token test by @frankfliu in #1429
- [ci] Fixes merge conflict issue by @frankfliu in #1431
- [ci] Upgrades CI to use JDK 17 by @frankfliu in #1413
- [CI][fix] remove g5xl and introduce rolling batch in lmic by @sindhuvahinis in htt...
DJLServing v0.25.0 Release
Key Changes
- TensorRT LLM Integration. DJLServing now supports using the TensorRT LLM backend to deploy Large Language Models.
- See the documentation here
- Llama2-13b usint TRTLLM example notebook
- SmoothQuant support in DeepSpeed
- Llama2-13b using SmoothQuant with DeepSpeed example notebook
- Rolling batch support in DeepSpeed to boost throughput
- Updated Documentation on using DJLServing to deploy LLMs
- We have added documentation for supported configurations per container, as well as many new examples
Enhancements
- Add context length estimate for Neuron handler by @lanking520 in #1184
- [INF2] allow neuron to load split model directly by @lanking520 in #1186
- Adding INF2 (transformers-neuronx) compilation latencies to SageMaker Health Metrics by @Lokiiiiii in #1185
- [serving] Auto detect XGBoost engine with .xgb extension by @frankfliu in #1196
- add memory checking in place to identify max by @lanking520 in #1191
- [python] Do not set default value for truncate by @xyang16 in #1193
- Add aiccl support by @maaquib in #1179
- Setting default datatype for deepspeed handlers by @sindhuvahinis in #1203
- add trtllm container build by @lanking520 in #1215
- Add TRTLLM TRT build from our managed source by @lanking520 in #1199
- [python] Remove generation_dict in lmi_dist_rolling_batch by @xyang16 in #1217
- install s5cmd to trtllm by @lanking520 in #1219
- Update mpirun options by @xyang16 in #1220
- [python] Optimize batch serialization by @frankfliu in #1223
- upgrade vllm by @lanking520 in #1238
- Supports docker build with local .deb by @zachgk in #1231
- Do warmup in multiple requests by @xyang16 in #1216
- [python] Update PublisherBytesSupplier API by @frankfliu in #1242
- remove tensorrt installation by @lanking520 in #1243
- Use CUDA runtime image instead of CUDA devel. by @chen3933 in #1201
- remove unused components by @lanking520 in #1245
- [DeepSpeed DLC] separate container build with multi-layers by @lanking520 in #1246
- New PR for tensorrt llm by @ydm-amazon in #1240
- [python] Buffer tokens for rolling batch by @frankfliu in #1249
- Add trt-llm engine build step during model initialization by @rohithkrn in #1235
- [serving] Adds token latency metric by @frankfliu in #1251
- install trtllm toolkit by @lanking520 in #1254
- [TRTLLM] some clean up on trtllm handler by @lanking520 in #1248
- [TRTLLM] use tensorrt wheel by @lanking520 in #1255
- Adds versions as labels in dockerfiles by @zachgk in #1160
- [TRTLLM] add trtllm with no deps by @lanking520 in #1256
- [TRT partition] add realtime stream reader for the conversion script by @lanking520 in #1259
- [TRTLLM] always setting request output length by @lanking520 in #1258
- Update trtllm toolkit path by @rohithkrn in #1260
- allow gpu detection by @lanking520 in #1261
- add trtllm cuda-compat by @lanking520 in #1247
- [feat] Add serving.properties parameter for compiled graph path inf2 by @tosterberg in #1262
- Inf2 properties refactoring using pydantic by @sindhuvahinis in #1252
- MME - deviceId while creating workers by @sindhuvahinis in #1257
- [serving] Refactor TensorRT-LLM partition code by @frankfliu in #1267
- [DS] Deepspeed rolling batch support by @maaquib in #1295
- Allow user to pass in max_batch_prefill_tokens by @xyang16 in #1320
- add smoothquant as options by @lanking520 in #1285
- Deepspeed configurations refactoring by @sindhuvahinis in #1280
- update smoothquant arg by @rohithkrn in #1291
- [python] Adds do_sample support for trtllm by @frankfliu in #1290
- [wlm] Supports model_id point to a local directory by @frankfliu in #1276
- [SageMaker Galactus developer experience] model load integration to DJL serving by @haNa-meister in #1230
- [feat] Better output format from seq-scheduler by @KexinFeng in #1305
- [serving] Upgrades AWSSDK version to 2.21.19 by @frankfliu in #1313
- [serving] Uses seconds for ChunkedBytesSupplier timeout by @frankfliu in #1311
- install datasets in trtllm container by @rohithkrn in #1270
- TensorRrt Configs refactoring by @sindhuvahinis in #1275
- [TRTLLM] fix corner case that model_id point to local path by @lanking520 in #1317
- Huggingface configurations refactoring by @sindhuvahinis in #1283
- Calculate max_seq_length in warmup dynamically by @xyang16 in #1298
- Increase memory limit for rolling batch integration octocoder model by @xyang16 in #1319
- [TRTLLM] remove default repetition penalty by @lanking520 in #1321
- [feat] Expose max sparse params by @KexinFeng in #1273
- [NeuronX] add attention mask porting from optimum-neuron by @lanking520 in #1206
- [partition] extract properties files by @sindhuvahinis in #1293
- add checkpoint to ds properties by @sindhuvahinis in #1296
- [vllm] standardize input parameters by @frankfliu in #1301
- [TRTLLM] format better for logging by @lanking520 in #1309
- Change default top_k and temperature parameters in TRTLLM rolling batch by @ydm-amazon in #1312
- Add tokenizer check for triton repo by @rohithkrn in #1274
- [SageMaker Galactus developer experience] use python backend when schema is customized by @haNa-meister in #1286
Bug Fixes
- [bug fix] add entrypoint camel case recovery by @lanking520 in #1181
- Fix max tensor_parallel_degree by @zachgk in #1182
- Fix lmi_dist garbage output issue by @xyang16 in #1187
- [fix] update context estimate interface by @tosterberg in #1194
- Check logs for aiccl usage in integ test by @maaquib in #1202
- [serving] Revert management URI matching regex by @frankfliu in #1209
- Update datasets version in deepspeed.Dockerfile by @maaquib in #1211
- [console] Fixes bug for docker port mapping case by @frankfliu in https://github.com/deepjavalibrary/djl-ser...
DJLServing v0.24.0 release
Key Features
- Updates Components
- Updates Neuron to 2.14.1
- Updates DeepSpeed to 0.10.0
- Improved Python logging
- Improved SeqScheduler
- Adds DeepSpeed dynamic int8 quantization with SmoothQuant
- Supports for llama 2
- Supports Safetensors
- Adds Neuron dynamic batching and rolling batch
- Adds Adapter API Preview
- Supports HuggingFace Stopwords
Enhancement
- Allow overriding truncate parameter in request by @maaquib in #953
- Enable multi-gpu inference (device_map='auto') on seq_batch_scheduler by @KexinFeng in #960
- [wlm] Allows set defatul options with environment variable by @frankfliu in #961
- Enable MPI model by environment variable by @frankfliu in #964
- Add built-in json formatter by @frankfliu in #965
- [serving] Update tnx handler for 2.12 supported models by @tosterberg in #896
- [serving] Adds more built-in logging options by @frankfliu in #974
- Bump up DJL version to 0.24.0 by @frankfliu in #979
- [serving] Print out CUDA and Neuron device information by @frankfliu in #978
- [docker] bump transformers-neuronx for small llama-2 support by @tosterberg in #980
- [python] Update lmi-dist by @xyang16 in #975
- Install flash attention using wheel by @xyang16 in #982
- [python] Make paged attention configurable by @xyang16 in #986
- [python] Refactor lmi_dist rolling batch by @xyang16 in #987
- [docker] Upgrade to DJL 0.24.0 by @frankfliu in #989
- Set jsonlines formatter for lmi-dist rolling batch test by @xyang16 in #991
- Install FasterTransformer libs with llama support by @rohithkrn in #993
- Add trust_remote_code to ft handler by @siddvenk in #994
- [serving] Improves PyProcess lifecycle logging by @frankfliu in #996
- [python] Adds pid to python process log by @frankfliu in #997
- [python] Includes individual headers for server side batching by @frankfliu in #1001
- update ft python wheel with llama support by @rohithkrn in #1002
- [serving] Install commong-loggings dependency for XGBoost engine by @frankfliu in #1004
- [python] Finds optimal batch partition by @bryanktliu in #984
- add error handling for rolling batch by @lanking520 in #1005
- [serving] Allows print access log to console by @frankfliu in #1009
- [serving] Adds unregister model log by @frankfliu in #1010
- [python] validate each request in the batch by @frankfliu in #1008
- Update dependencies version by @frankfliu in #1012
- [serving] Return proper HTTP status code for each batch by @frankfliu in #1013
- [HF Streaming] use decode instead batch decode for streaming by @lanking520 in #1016
- [docker] disable TORCH_CUDNN_V8_API_DISABLED for PyTorch 2.0.1 by @frankfliu in #1018
- Allows set TENSOR_PARALLEL_DEGREE=max by @frankfliu in #1019
- Simplify handling of min/max workers by @zachgk in #1021
- [docker] Updates cache directory by @frankfliu in #1027
- [benchmark] Adds safetensors support by @frankfliu in #1031
- [VLLM] use more complex logic to ensure all result are captured by @lanking520 in #1035
- [VLLM] add option to set batched tokens by @lanking520 in #1036
- update inf2 dependencies to 2.13.1 by @lanking520 in #1044
- add data collection and some inf2 bug fixes by @lanking520 in #1047
- [RollingBatch] create request simulator to batch by @lanking520 in #1050
- [DeepSpeed] upgrade dependencies by @lanking520 in #1049
- [docker] Upgrades to inf2 2.13.2 version by @frankfliu in #1052
- add revision to handler by @lanking520 in #1056
- [docker] Change default OMP_NUM_THREADS back to 1 for GPU by @frankfliu in #1073
- Worker type by @zachgk in #1022
- [Handler] add dynamic batching to transformers neuronx by @lanking520 in #1076
- add Neuron RollingBatch implementation by @lanking520 in #1078
- [Neuron] upgrade to Neuron 2.14.0 SDK by @lanking520 in #1089
- [vLLM] add pyarrow dependency by @lanking520 in #1093
- [Handler] formalize all engines with same settings by @lanking520 in #1077
- Removes quick abort of python reader threads by @zachgk in #1095
- Adds adapter support by @zachgk in #1082
- Add unmerged lora support in HF handler by @rohithkrn in #1088
- Cleans some unused pieces of PyProcess by @zachgk in #1100
- Creates adapters by directory by @zachgk in #1094
- Use custom peft wheel by @rohithkrn in #1103
- [feature] Enable model sharding on seq_scheduler tested on gpt_neox_20B by @KexinFeng in #1086
- [vLLM] capture max_rolling_batch settting issues by @lanking520 in #1112
- [RollingBatch] add active requests and pending requests for skip tokens by @lanking520 in #1113
- Upgrade lmi_dist by @xyang16 in #1108
- [INF2][Handler] added optimization level per Neuron instruction by @lanking520 in #1107
- [Handler] add neuron int8 quantization by @lanking520 in #1115
- [Docker] upgrade dependencies version by @lanking520 in #1119
- Upgrade flash attention v2 version to 2.3.0 by @xyang16 in #1123
- [Handler] bump up vllm version and fix some bugs by @lanking520 in #1124
- Integrate with seq_scheduler wheel by @KexinFeng in #1122
- [INF2] remove neuron settings on cache hit for the folder by @lanking520 in #1126
- [python] Make rolling batch output not escape unicode characters by @xyang16 in #1135
- [vLLM][Handler] add quantization option for vLLM by @lanking520 in #1136
- [INF2][Handler] remove type conversion in Neuron by @lanking520 in #1134
- Update vllm_rolling_batch.py by @lanking520 in #1140
- Add support for stopwords in huggingface handler by @ydm-amazon in #1118
- Give a version of seq scheduler by @KexinFeng in #1146
- Support adapters by properties by @zachgk in #1148
- [serving] Allow model_id point to djl model zoo by @frankfliu in #1150
- Assert local lora models in the handler by @rohithkrn in #1153
- Block remote adapter url and handler override by @zachgk in #1147
- Add feature flag for adapters by @zachgk in #1152
- [feat] Modify deepspeed handler to support smoothQuant. by @chen3933 in https://github.com/deepjavalibrary/djl-servi...
DJLServing v0.23.0 release
Key Features
- Introduces Roling Batch
- SeqBatchScheduler with rolling batch #803
- Sampling SeqBatcher design #842
- Max Seqbatcher number threshold api #843
- Adds rolling batch support #828
- Max new length #845
- Rolling batch for huggingface handler #857
- Compute kv cache utility function #863
- Sampling decoding implementation #878
- Uses multinomial to choose from topK samples and improve topP sampling #891
- Falcon support #890
- Unit test with random seed failure #909
- KV cache support in default handler #929
- Introduces LMI Dist library for rolling batch
- Introduces vLLM library for rolling batch
- [VLLM] add vllm rolling batch and add hazard handling #877
- Introduces PEFT and LoRA support in handlers
- Introduces streaming support to FasterTransformer
- Add Streaming support #820
- Introduces S3 Cache Engine
- S3 Cache Engine #719
- Upgrades component versions:
Enhancement
Serving and python engine enhancements
- Adds workflow model loading for SageMaker #661
- Allows model being shared between workflows #665
- Prints out error message if pip install failed #666
- Install fixed version for transformers and accelerate #672
- Add numpy fix #674
- SM Training job changes for AOT #667
- Creates model dir to prevent issues with no code experience in SageMaker #675
- Don't mount model dir for no code tests #676
- AOT upload checkpoints tests #678
- Add stable diffusion support on INF2 #683
- Unset omp thread to prevent CLIP model delay #688
- Update ChunkedBytesSupplier API #692
- Fixes log file charset issue in management console #693
- Adds neuronx new feature for generation #694
- [INF2] adding clip model support #696
- [plugin] Include djl s3 extension in djl-serving distribution #699
- [INF2] add bf16 support to SD #700
- Adds support for streaming Seq2Seq models #698
- Add SageMaker MCE support #706
- [INF2] give better room for more tokens #710
- [INF2] Bump up n positions #713
- Refactor logic for supporting HF_MODEL_ID to support MME use case #712
- Support load model from workflow directory #714
- Add support for se2seq model loading in HF handler #715
- Load function from workflow directory #718
- Add vision components for DeepSpeed and inf2 #725
- Support pip install in offline mode #729
- Add --no-index to pip install in offline mode #731
- Adding llama model support #727
- Change the dependencies so for FasterTransformer #734
- Adds text/plain content-type support #741
- Skeleton structure for sequence batch scheduler #745
- Handles torch.cuda.OutOfMemoryError #749
- Improves model loading logging #750
- Asynchronous with PublisherBytesSupplier #730
- Renames env var DDB_TABLE_NAME to SERVING_DDB_TABLE_NAME #753
- Sets default minWorkers to 1 for GPU python model #755
- Fixes log message #765
- Adds more logs to LMI engine detection #766
- Uses predictable model name for HF model #771
- Adds parallel loading support for Python engine #770
- Updates management console UI: file input are not required in form data #773
- Sets default maxWorkers based on OMP_NUM_THREADS #776
- Support non-gpu models for huggingface #772
- Use huggingface standard generation for tnx streaming #778
- Add trust remote code option #781
- Handles invalid retrun type case #790
- Add application/jsonlines as content-type for streaming #791
- Fixes trust_remote_code issue #793
- Add einops for supporting falcon models #792
- Adds content-type response for DeepSpeed and FasterTransformer handler #797
- Sets default maxWorkers the same as earlier version #799
- Add stream generation for huggingface streamer #801
- Add server side batching #795
- Add safetensors #808
- Improvements in AOT UX #787
- Add pytorch kernel cache default directory #810
- Improves partition script error message #826
- Add -XX:-UseContainerSupport flag only for SageMaker #868
- Move TP detection logic to PyModel from LmiUtils #840
- Set tensor_parallel_degree property when not specified #847
- Add workflow dispatch #870
- Create model level virtualenv #811
- Refactor createVirtualEnv() #875
- Add MPI Engine as generic name for dist...
0.23.0-Alpha Release
This release solved several issues on DJLServing library and also brings some new features.
- Supporting load from workflow directory #714
- Fixed MME support with HF_MODEL_ID #712
- Added parallel loading for python models #770
- Fixed device mismatch issue #805
And more
What's Changed
- [serving] Adds workflow model loading for SageMaker by @frankfliu in #661
- [workflow] Allows model being shared between workflows by @frankfliu in #665
- [python] prints out error message if pip install failed by @frankfliu in #666
- update to djl 0.23.0 by @siddvenk in #668
- [docker] Fixes fastertransformer docker file by @frankfliu in #671
- [kserve] Fixes unit test for extra data type by @frankfliu in #673
- install fixed version for transformers and accelerate by @lanking520 in #672
- [ci] add performance testing by @tosterberg in #558
- add numpy fix by @lanking520 in #674
- SM Training job changes for AOT by @sindhuvahinis in #667
- Create model dir to prevent issues with no code experience in SageMaker by @siddvenk in #675
- Don't mount model dir for no code tests by @siddvenk in #676
- AOT upload checkpoints tests by @sindhuvahinis in #678
- [INF2][DLC] Update Neuron to 2.10 by @lanking520 in #681
- add stable diffusion support on INF2 by @lanking520 in #683
- [CI] add small fixes by @lanking520 in #684
- Add HuggingFace TGI publish and test pipeline by @xyang16 in #650
- Add shared memory arg to docker launch command in README by @rohithkrn in #685
- Update github-slug-action to v4.4.1 by @xyang16 in #686
- unset omp thread to prevent CLIP model delay by @lanking520 in #688
- Change the bucket for different object by @sindhuvahinis in #691
- [ci] make performance tests run in parallel by @tosterberg in #690
- [api] Update ChunkedBytesSupplier API by @frankfliu in #692
- [console] Fixes log file charset issue by @frankfliu in #693
- add neuronx new feature for generation by @lanking520 in #694
- [tgi] Add more models to TGI test pipeline by @xyang16 in #695
- [INF2] adding clip model support by @lanking520 in #696
- [plugin] Include djl s3 extension in djl-serving distribution by @frankfliu in #699
- [INF2] add bf16 support to SD by @lanking520 in #700
- [ci] Upgrade spotbugs to 5.0.14 by @frankfliu in #704
- Add support for streaming Seq2Seq models by @rohithkrn in #698
- add SageMaker MCE support by @lanking520 in #706
- fix the device mapping issue if visible devices is set by @lanking520 in #707
- fix the start gpu bug by @lanking520 in #709
- [INF2] give better room for more tokens by @lanking520 in #710
- bump up n positions by @lanking520 in #713
- Refactor logic for supporting HF_MODEL_ID to support MME use case by @siddvenk in #712
- [ci] reconfigure performance test time and machines by @tosterberg in #711
- [workflow] Support load model from workflow directory by @frankfliu in #714
- Add support for se2seq model loading in HF handler by @rohithkrn in #715
- Add unit test for empty model store initialization by @siddvenk in #716
- Fix no code tests in lmi test suite by @siddvenk in #717
- [serving] Load function from workflow directory by @frankfliu in #718
- [test] Reformat python code by @frankfliu in #720
- Creates S3 Cache Engine by @zachgk in #719
- [test] Refactor client.py by @frankfliu in #721
- update fastertransformers build instruction by @lanking520 in #722
- Add seq2seq streaming integ test by @rohithkrn in #724
- [test] Update tranformser-neuxornx gpt-j-b mode options by @frankfliu in #723
- [DeepSpeed][INF2] add vision components by @lanking520 in #725
- [python] Support pip install in offline mode by @frankfliu in #729
- [python] Add --no-index to pip install in offline mode by @frankfliu in #731
- adding llama model support by @lanking520 in #727
- tokenizer bug fixes by @lanking520 in #732
- [FT] change the dependencies so by @lanking520 in #734
- Remove TGI build and test pipeline by @xyang16 in #735
- ft_handler fix by @rohithkrn in #736
- [docker] Uses the same convention as tritonserver by @frankfliu in #738
- [ci] Upgrade jacoco to 0.8.8 to support JDK17+ by @frankfliu in #739
- [python] Fixes typo in fastertransformer handler by @frankfliu in #740
- [python] Adds text/plain content-type support by @frankfliu in #741
- [serving] Avoid unit-test hang by @frankfliu in #744
- Skeleton structure for sequence batch scheduler by @sindhuvahinis in #745
- update the wheel to have path fixed by @lanking520 in #747
- Adding project diagrams link to architecture.md by @alexkarezin in #742
- Add SageMaker integration test by @siddvenk in #705
- [python] Handle torch.cuda.OutOfMemoryError by @frankfliu in #749
- fix permissions for sm pysdk install script by @siddvenk in #751
- [serving] Improves model loading logging by @frankfliu in #750
- Asynchronous with PublisherBytesSupplier by @zachgk in #730
- [cache] Rename evn var DDB_TABLE_NAME to SERVING_DDB_TABLE_NAME by @frankfliu in #753
- [serving] Sets default minWorkers to 1 for GPU python model by @frankfliu in #755
- SM AOT Tests by @sindhuvahinis in #756
- [docker] Pin bitsandbytes version to 0.38.1 by @xyang16 in #754
- [fix] bump versions for new deepspeed wheel by @tosterberg in #733
- [fix] Fix bitsandbytes pip install by @xyang16 in #758
- [serving] Fixes log message by @frankfliu in #765
- add triton components in the nightly by @lanking520 in #767
- Add mme tests to sagemaker tests by @siddvenk in #763
- [wlm] Adds more logs to LMI engine detection by @frankfliu in #766
- fix typos with get default bucket prefix for sm session by @siddvenk in #768
- [serving] Uses predictable model name for HF model by @frankfliu in #771
- [serving] Adds parallel loading support for Python engine by @frankfliu in https://github.com/deepjavalibrary/d...
DJLServing v0.22.1 release
Key Features
- Add pytorch inf2 by @lanking520 in #535
- Adds chunked encoding support by @frankfliu in #551
- Ahead of Time Partitioning Support in FT default handler and test cases by @sindhuvahinis in #539
- Python engine streaming initial support by @rohithkrn in #573
- Adds async inference API by @frankfliu in #570
- Optimize batch inference for text generation by @siddvenk in #586
- Add default handler for AOT by @sindhuvahinis in #588
- Support text2text-generation task in deepspeed by @siddvenk in #606
- Throttles request if all workers are busy by @frankfliu in #656
- Infer recommended LMI engine by @siddvenk in #623
Bug Fixes
- [fix] requirements.txt install check testcase by @sindhuvahinis in #537
- [python] Fixes typo in unit test by @frankfliu in #554
- [serving] Fixes GPU auto scaling bug by @frankfliu in #561
- Fix typo in streaming utils by @rohithkrn in #581
- KServe data to bytes fix by @sindhuvahinis in #577
- [serving] Fixes NeuronUtils for SageMaker by @frankfliu in #583
- [python] Fixes python startup race condition by @frankfliu in #589
- [serving] Avoid download from s3 multiple time by @frankfliu in #596
- make output consistent by @lanking520 in #616
- [workflow] Fixes workflow loading issue by @frankfliu in #662
Enhancement
- [ci] Upgrades gradle to 8.0.2 by @frankfliu in #540
- [ci] Uses recommended way to create task in build.gradle by @frankfliu in #541
- update deepspeed container python version to 3.9 by @rohithkrn in #546
- [inf2] Adding gptj to transformers handler by @maaquib in #542
- install git by default for all python releases by @lanking520 in #555
- Load external dependencies for workflows by @xyang16 in #556
- [python] Infer default entryPoint if not provided by @frankfliu in https://github.com/deepjavalibrary/djl-serving/pull/5631
- [python] flush logging output before process end by @frankfliu in #567
- [serving] support load entryPoint with url by @frankfliu in #566
- [serving] deprecate s3Url and replace it with model_id by @frankfliu in #568
- Sets huggingface cache directory to /tmp in container by @lanking520 in #571
- add finalize callback function by @lanking520 in #572
- add pad token if not set by @lanking520 in #550
- Include Kserve plugins to distribution by @sindhuvahinis in #552
- [python] Passing arguments to model.py by @frankfliu in #560
- update pytorch docker to py3.9 by @rohithkrn in #547
- [serving] Detect triton engine by @frankfliu in #574
- [python] Refactor PyEngine with PassthroughNDManager by @frankfliu in #578
- Minimal followup for BytesSupplier changes by @zachgk in #580
- [serving] Sets djl cache directory to /tmp by @frankfliu in #585
- [python] Makes download entryPoint atomic by @frankfliu in #587
- [python] Use NeuronUtils to detect neuron cores by @frankfliu in #593
- [python] Fixes visible neuron cores environment variable by @frankfliu in #595
- [serving] Refactor per model configuration initialization by @frankfliu in #594
- Refactor CacheManager, Working Async by @zachgk in #591
- [ci] bump up deepspeed version by @tosterberg in #597
- [serving] Avoid compile time dependency on log4j by @frankfliu in #603
- [serving] add default dtype when running in deepspeed by @tosterberg in #617
- [serving] Adds deps folder to classpath in MutableClassLoader constructor by @frankfliu in #611
- Add support for streaming batch size > 1 by @rohithkrn in #605
- add ddb paginator for DJLServing by @lanking520 in #609
- update fastertransformer to follow huggingface parameters by @lanking520 in #610
- Change billing model to pay per request by @frankfliu in #612
- Upgrade dependencies version by @frankfliu in #613
- clean up docker build script and remove transformers docker image build by @lanking520 in #61
- [AOT] Upload sharded checkpoints to S3 by @sindhuvahinis in #604
- [serving] Upgrade to DJL 0.22.0 by @frankfliu in #622
- Unify tnx experience by @lanking520 in #619
- [serving] Update DJL version to 0.22.1 by @frankfliu in #627
- [Docker] update a few versions by @lanking520 in #620
- [serving] Make chunked read timeout configurable by @frankfliu in #652
- [python][streaming]Do best effort model type validation to fix configs without arch list by @rohithkrn in #649
- [AOT] Entrypoint download from url by @sindhuvahinis in #628
- [wlm] Moves LmiUtils.inferLmiEngine() into separate class by @frankfliu in #630
- [python][streaming]Batching fix and validate model architecture by @rohithkrn in #626
- skip special tokens by default by @lanking520 in #635
- [serving] Read x-synchronus and x-starting-token from input payload by @frankfliu in #637
- add torchvision by @lanking520 in #638
- [serving] Keep original content-type header by @frankfliu in #642
- [serving] Override inferred options in criteria by @frankfliu in #644
- Pinning aws-neuronx-* packages for Inf2 containers by @maaquib in #621
- [serving] Stop model server if plugin init failed by @frankfliu in #655
Documentation
- [docs] Fix serving doc by @xyang16 in #548
- [docs] Adds streaming configuration document by @rohithkrn in #659
- update docs to djl 0.22.1 by @siddvenk in #664
Full Changelog: v0.21.0...v0.22.1
DJLServing v0.21.0 release
Key Features
- Adds faster transformer support (#424)
- Adds Deepspeed ahead of time partition script in DLC (#466)
- Adds SageMaker MME support (#479)
- Adds support for stable-diffusion-2-1-base model (#484)
- Adds support for stable diffusion depth model (#488)
- Adds out of memory protection for modle loading (#496)
- Makes load_on_devices per model setting (#493)
- Improves several per model settings
- Improves management console model loading and inference UI (#431, #432)
- Updates deepspeed to 0.8.0 (#465)
- Upgrades PyTorch to 1.13.1 (#414)
Enhancement
- Adds model_id support for huggingface models (#406)
- Adds AI template package (#485)
- Improves snakeyaml error message (#400)
- Improves s5cmd error handling (#442)
- Emits medel inference metrics to log file (#452)
- Supports model.pt and model.onnx file name (#459)
- Makes batch per model setting (#456)
- Keeps failure worker status for 1 minutes (#463)
- Detects engine to avoid uncessarily download MXNet engine (#481)
- Uses temp directory instead of /tmp (#404)
- Adds better logging and error handling for s5cmd process execution (#409)
- Uses jacoco aggregation report plugin (#421)
- Rollback model if failed start work in synchronous mode (#427)
- Adds fastertransformer t5 integration test (#469)
- Print better stacktrace if channel is closed (#473)
- Supports FasterFansformer to run in mpi mode (#474)
Bug fixes
- Adds fix to workaround SageMaker changes (#401)
- Treats empty HTTP parameter as absent (#429)
- Fixes inference console UI bug (#439)
- Fixed gpt-neox model name typo (#441)
- Fixes wrong onnx configuration (#449)
- Fixes issue with passing dtype in huggingface handler. Refactor dtype_f…
- Fixes issues with model_dir and model_id usage that occur when s3url is…
- Fixes broken vue tags (#453)
Breaking change
- Remove unecessary java engine adapter (#448)
- Removes djl-central module in favor of management console (#447)
- Sets model status to failure after exceed retry threshold (#455)
- Removes DLR support (#468)