Skip to content

Releases: deepjavalibrary/djl-serving

DJLServing v0.29.0 Release

16 Aug 15:57
c343d60
Compare
Choose a tag to compare

Key Features

Details regarding the latest LMI container image_uris can be found here

DJL Serving Changes (applicable to all containers)

  • Allows configuring health checks to fail based on various types of error rates
  • When not streaming responses, all invocation errors will respond with the appropriate 4xx or 5xx HTTP response code
    • Previously, for some inference backends (vllm, lmi-dist, tensorrt-llm) the behavior was to return 2xx HTTP responses when errors occurred during inference
  • HTTP Response Codes are now configurable if you require a specific 4xx or 5xx status to be returned in certain situations
  • Introduced annotations @input_formatter and @output_formatter to bring your own script for pre- and post-postprocessing.

LMI Container (vllm, lmi-dist)

  • vLLM updated to version 0.5.3.post1
  • Added MultiModal Support for Vision Language Models using the OpenAI Chat Completions Schema.
    • More details available here
  • Supports Llama 3.1 models
  • Supports beam search, best_of and n with non streaming output.
  • Supports chunked prefill support in both vllm and lmi-dist.

TensorRT-LLM Container

  • TensorRT-LLM updated to version 0.11.0
  • [Breaking change] Flan-T5 is now supported with C++ triton backend. Removed Flan-T5 support for TRTLLM python backend.

Transformers NeuronX Container

  • Upgraded to Transformers NeuronX 2.19.1

Text Embedding (using the LMI container)

  • Various performance improvements

Enhancements

Read more

v0.28.0

19 Jun 21:16
73a782d
Compare
Choose a tag to compare

Key Features

Check out our latest Large Model Inference Containers.

LMI container

  • Provided general performance optimization.
  • Added text embedding support
    • Our solution for text embedding is 5% faster than HF TEI solution.
  • Multi-LoRA feature now supports LLama3 and AWS models

TensorRT-LLM container

  • Upgraded to TensorRT-LLM 0.9.0
  • AWQ, FP8 support for Llama3 models on G6/P5 machines
  • Now, default max_new_tokens=16384
  • Bugfix for critical memory leaks on long run.
  • Bugfix for model hanging issues.

Transformers NeuronX container

  • Upgraded to Transformers NeuronX 2.18.2

DeepSpeed container (deprecated)

DeepSpeed container is now deprecated. If you are not using deepspeed engine, all you need is 0.28.0-lmi container and continue using it.

New Model Support

  • LMI container
    • Artic, DBRX, Falcon 2, Command-R, InternLM2, Phi-3, Qwen2MoE, StableLM, StarCoder2, Xverse, and Jais
  • TensorRT-LLM container
    • Gemma

CX Usability Enhancements/Changes

  • Model loading CX:
    • SERVING_LOAD_MODELS env is deprecated, use HF_MODEL_ID instead.
  • Inference CX:
    • Input/Output schema changes:
      • Speculative decoding now in streaming, returns multiple jsonlines tokens at each generation step
      • Standardized the output formatter signature:
        • We reduced the parameters of output_formatter by introducing RequestOutput class.
        • RequestOutput contains all input information such as text, token_ids and parameters and also output information such as output tokens, log probabilities, and other details like finish reason. Check this doc to know more.
        • Introduced prompt details in the details of the response for vLLM and lmi-dist rolling batch options. These prompt details contains the prompt token_ids and their corresponding text and log probability. Check this doc to know more.
    • New error handling mechanism:
      • Improved our error handling for container responses for rolling batch. Check this doc to know more
    • New CX capability:
      • We introduce OPTION_TGI_COMPAT env which enables you to get the same response format as TGI.
      • We also now support SSE text/event-stream data format.

Breaking Changes

  • Inference CX for rolling batch:
    • Token id changed from list into integer in rolling batch response.
    • Error handling: “finish_reason: error” during rolling batch inference
  • DeepSpeed container has been deprecated, functionality is generally available in the LMI container now

Known Issues

  • TRTLLM periodically crashes during model compilation when using A100 GPU
  • TRTLLM AWQ quantization currently crashes due to an internal error

Enhancements

Read more

DJLServing v0.27.0 Release

15 Apr 21:03
Compare
Choose a tag to compare

Key Changes

  • Large Model Inference Containers 0.27.0 release
    • DeepSpeed container
      • Added DBRX and Gemma model support.
      • Provided general performance optimization.
      • Added new performance enhancing features support like Speculative Decoding.
    • TensorRT-LLM container
      • Upgraded to TensorRT-LLM 0.8.0
    • Transformers NeuronX container
      • Upgraded to Transformers NeuronX 2.18.0
  • Multi-Adapter LoRA Support
    • Provided multi-adapter inference functionality in LMI DLC.
  • CX Usability Enhancements
    • Provided a seamless migration experience across different LMI DLCs.
    • Implemented the Low code No code experience.
    • Supported OpenAI compatible chat completions API.

Enhancement

Read more

DJLServing v0.26.0 Release

27 Feb 21:02
Compare
Choose a tag to compare

Key Changes

  • TensorRT-LLM 0.7.1 Upgrade, including support for Mixtral 8x7B MOE model
  • Optimum Neuron Support
  • Transformers-NeuronX 2.16 Upgrade, including support for continuous batching
  • LlamaCPP support
  • Many Documentation updates with updated model deployment configurations
  • Refactor of configuration management across different backends
  • CUDA 12.1 support for DeepSpeed and TensorRT-LLM containers

Enhancements

Bug Fixes

Documentation Updates

CI/CD Updates

Read more

DJLServing v0.25.0 Release

20 Dec 19:41
043e7af
Compare
Choose a tag to compare

Key Changes

  • TensorRT LLM Integration. DJLServing now supports using the TensorRT LLM backend to deploy Large Language Models.
  • SmoothQuant support in DeepSpeed
  • Rolling batch support in DeepSpeed to boost throughput
  • Updated Documentation on using DJLServing to deploy LLMs
    • We have added documentation for supported configurations per container, as well as many new examples

Enhancements

Bug Fixes

Read more

DJLServing v0.24.0 release

17 Oct 22:10
Compare
Choose a tag to compare

Key Features

  • Updates Components
    • Updates Neuron to 2.14.1
    • Updates DeepSpeed to 0.10.0
  • Improved Python logging
  • Improved SeqScheduler
  • Adds DeepSpeed dynamic int8 quantization with SmoothQuant
  • Supports for llama 2
  • Supports Safetensors
  • Adds Neuron dynamic batching and rolling batch
  • Adds Adapter API Preview
  • Supports HuggingFace Stopwords

Enhancement

Read more

DJLServing v0.23.0 release

18 Jul 15:22
e69b646
Compare
Choose a tag to compare

Key Features

  • Introduces Roling Batch
    • SeqBatchScheduler with rolling batch #803
    • Sampling SeqBatcher design #842
    • Max Seqbatcher number threshold api #843
    • Adds rolling batch support #828
    • Max new length #845
    • Rolling batch for huggingface handler #857
    • Compute kv cache utility function #863
    • Sampling decoding implementation #878
    • Uses multinomial to choose from topK samples and improve topP sampling #891
    • Falcon support #890
    • Unit test with random seed failure #909
    • KV cache support in default handler #929
  • Introduces LMI Dist library for rolling batch
    • Rolling batch support for flash models #865
    • Assign random seed for lmi dist #912
    • JSON format for rolling batch #899
    • Add quantization parameter for lmi_dist rolling batch backend for HF #888
  • Introduces vLLM library for rolling batch
    • [VLLM] add vllm rolling batch and add hazard handling #877
  • Introduces PEFT and LoRA support in handlers
    • Add peft to fastertransformer container #889
    • Add peft support to default deepspeed and huggingface handlers #884
    • Add lora support to ft default handler #932
  • Introduces streaming support to FasterTransformer
    • Add Streaming support #820
  • Introduces S3 Cache Engine
    • S3 Cache Engine #719
  • Upgrades component versions:
    • Upgrade PyTorch to 2.0.1 #804
    • Update Neuron to 2.10 #681
    • Upgrade deepspeed to 0.9.5 #804

Enhancement

Serving and python engine enhancements

  • Adds workflow model loading for SageMaker #661
  • Allows model being shared between workflows #665
  • Prints out error message if pip install failed #666
  • Install fixed version for transformers and accelerate #672
  • Add numpy fix #674
  • SM Training job changes for AOT #667
  • Creates model dir to prevent issues with no code experience in SageMaker #675
  • Don't mount model dir for no code tests #676
  • AOT upload checkpoints tests #678
  • Add stable diffusion support on INF2 #683
  • Unset omp thread to prevent CLIP model delay #688
  • Update ChunkedBytesSupplier API #692
  • Fixes log file charset issue in management console #693
  • Adds neuronx new feature for generation #694
  • [INF2] adding clip model support #696
  • [plugin] Include djl s3 extension in djl-serving distribution #699
  • [INF2] add bf16 support to SD #700
  • Adds support for streaming Seq2Seq models #698
  • Add SageMaker MCE support #706
  • [INF2] give better room for more tokens #710
  • [INF2] Bump up n positions #713
  • Refactor logic for supporting HF_MODEL_ID to support MME use case #712
  • Support load model from workflow directory #714
  • Add support for se2seq model loading in HF handler #715
  • Load function from workflow directory #718
  • Add vision components for DeepSpeed and inf2 #725
  • Support pip install in offline mode #729
  • Add --no-index to pip install in offline mode #731
  • Adding llama model support #727
  • Change the dependencies so for FasterTransformer #734
  • Adds text/plain content-type support #741
  • Skeleton structure for sequence batch scheduler #745
  • Handles torch.cuda.OutOfMemoryError #749
  • Improves model loading logging #750
  • Asynchronous with PublisherBytesSupplier #730
  • Renames env var DDB_TABLE_NAME to SERVING_DDB_TABLE_NAME #753
  • Sets default minWorkers to 1 for GPU python model #755
  • Fixes log message #765
  • Adds more logs to LMI engine detection #766
  • Uses predictable model name for HF model #771
  • Adds parallel loading support for Python engine #770
  • Updates management console UI: file input are not required in form data #773
  • Sets default maxWorkers based on OMP_NUM_THREADS #776
  • Support non-gpu models for huggingface #772
  • Use huggingface standard generation for tnx streaming #778
  • Add trust remote code option #781
  • Handles invalid retrun type case #790
  • Add application/jsonlines as content-type for streaming #791
  • Fixes trust_remote_code issue #793
  • Add einops for supporting falcon models #792
  • Adds content-type response for DeepSpeed and FasterTransformer handler #797
  • Sets default maxWorkers the same as earlier version #799
  • Add stream generation for huggingface streamer #801
  • Add server side batching #795
  • Add safetensors #808
  • Improvements in AOT UX #787
  • Add pytorch kernel cache default directory #810
  • Improves partition script error message #826
  • Add -XX:-UseContainerSupport flag only for SageMaker #868
  • Move TP detection logic to PyModel from LmiUtils #840
  • Set tensor_parallel_degree property when not specified #847
  • Add workflow dispatch #870
  • Create model level virtualenv #811
  • Refactor createVirtualEnv() #875
  • Add MPI Engine as generic name for dist...
Read more

0.23.0-Alpha Release

14 Jun 00:54
80a7154
Compare
Choose a tag to compare
0.23.0-Alpha Release Pre-release
Pre-release

This release solved several issues on DJLServing library and also brings some new features.

  • Supporting load from workflow directory #714
  • Fixed MME support with HF_MODEL_ID #712
  • Added parallel loading for python models #770
  • Fixed device mismatch issue #805
    And more

What's Changed

Read more

DJLServing v0.22.1 release

25 Apr 22:55
e5c2113
Compare
Choose a tag to compare

Key Features

Bug Fixes

Enhancement

Documentation

Full Changelog: v0.21.0...v0.22.1

DJLServing v0.21.0 release

25 Feb 18:14
583dbc0
Compare
Choose a tag to compare

Key Features

  • Adds faster transformer support (#424)
  • Adds Deepspeed ahead of time partition script in DLC (#466)
  • Adds SageMaker MME support (#479)
  • Adds support for stable-diffusion-2-1-base model (#484)
  • Adds support for stable diffusion depth model (#488)
  • Adds out of memory protection for modle loading (#496)
  • Makes load_on_devices per model setting (#493)
  • Improves several per model settings
  • Improves management console model loading and inference UI (#431, #432)
  • Updates deepspeed to 0.8.0 (#465)
  • Upgrades PyTorch to 1.13.1 (#414)

Enhancement

  • Adds model_id support for huggingface models (#406)
  • Adds AI template package (#485)
  • Improves snakeyaml error message (#400)
  • Improves s5cmd error handling (#442)
  • Emits medel inference metrics to log file (#452)
  • Supports model.pt and model.onnx file name (#459)
  • Makes batch per model setting (#456)
  • Keeps failure worker status for 1 minutes (#463)
  • Detects engine to avoid uncessarily download MXNet engine (#481)
  • Uses temp directory instead of /tmp (#404)
  • Adds better logging and error handling for s5cmd process execution (#409)
  • Uses jacoco aggregation report plugin (#421)
  • Rollback model if failed start work in synchronous mode (#427)
  • Adds fastertransformer t5 integration test (#469)
  • Print better stacktrace if channel is closed (#473)
  • Supports FasterFansformer to run in mpi mode (#474)

Bug fixes

  • Adds fix to workaround SageMaker changes (#401)
  • Treats empty HTTP parameter as absent (#429)
  • Fixes inference console UI bug (#439)
  • Fixed gpt-neox model name typo (#441)
  • Fixes wrong onnx configuration (#449)
  • Fixes issue with passing dtype in huggingface handler. Refactor dtype_f…
  • Fixes issues with model_dir and model_id usage that occur when s3url is…
  • Fixes broken vue tags (#453)

Breaking change

  • Remove unecessary java engine adapter (#448)
  • Removes djl-central module in favor of management console (#447)
  • Sets model status to failure after exceed retry threshold (#455)
  • Removes DLR support (#468)

Documentation

  • Updates management api document (#436)
  • Adds dynamic batching settings to document (#462)
  • Improves plugin README (#477)
  • Fixes managment_api.md broken list (#478)
  • Updates serving configuration document (#437)