v0.12.0: Multi-LoRA prefix caching, fp8 kv cache, Mllama, function calling

Latest

Latest

tgaddair released this 06 Nov 21:21

· 14 commits to main since this release

🎉 Enhancements

Prompt prefix caching for multi-LoRA by @tgaddair in #655
Convert to Triton Punica kernels by @tgaddair in #658
Support FP8 KV Cache by @ajtejankar in #652
Added Mllama by @tgaddair in #619
Flash mllama by @tgaddair in #622
support MRL embeddings for qwen2 by @magdyksaleh in #621
Support for Embeddings with XLM-RoBERTa and Adapters by @jfhetzer in #656
Merge weights by @magdyksaleh in #600
feat: Function calling with output schema enforcement by @jeffreyftang in #536
Chunked prefill by @tgaddair in #653
add num inputs to metrics by @magdyksaleh in #615
Add --predibase-api-token CLI arg by @joseph-predibase in #617
Add --disable-sgmv flag by @joseph-predibase in #639
Enhance Structured Output Interface by @GirinMan in #644

🐛 Bugfixes

Add done message to openai endpoints by @magdyksaleh in #618
Fix CUDA graph compilation by @tgaddair in #627
Fix CUDA graphs for Medusa by @tgaddair in #628
Fix retrace message by @tgaddair in #629
Fix prefix plumbing and BGMV compiler dimensions by @tgaddair in #631
Fix punica kernel compilation by @tgaddair in #632
Fix FlashInfer when not using prefix caching by @tgaddair in #633
Fix cuda graph tracing without lora ranks by @tgaddair in #634
Added ranks 96 and 128 to BGMV kernel by @tgaddair in #630
Look for language model lm head by @Infernaught in #640
Return n choices for chat completions API by @tgaddair in #638
Fix llava_next for llama 3.2 vision cross attention states by @tgaddair in #641
Fix compile for qwen-2.5-32b by @tgaddair in #645
Added backwards compatible field to OpenAI json_object API by @tgaddair in #648
Fix PREDIBASE_API_TOKEN env var being thrown away by @joseph-predibase in #654
Fix absent fp8_kv property on llama and qwen models by @ajtejankar in #662
Fix seqlen bug for sliding window models like Mistral v0.1 by @ajtejankar in #660
Fix sliding window + compile bug by @ajtejankar in #666

📝 Docs

added metrics docs, updated links in main docs by @noyoshi in #663

🔧 Maintenance

upgrade poetry by @magdyksaleh in #613
Fix deps4 by @magdyksaleh in #614
Remove LD_PRELOAD from Docker and improve error message by @tgaddair in #623
add label to id this as a lorax image by @noyoshi in #626
pass correct stuff to predibase-reporter by @magdyksaleh in #635
try using arc runner for build by @noyoshi in #646
change runner 2 by @magdyksaleh in #650

New Contributors

@joseph-predibase made their first contribution in #617
@jfhetzer made their first contribution in #656

Full Changelog: v0.11.0...v0.12.0

Contributors

jeffreyftang, tgaddair, and 7 other contributors

Assets 2