🎉 Enhancements
- Prompt prefix caching for multi-LoRA by @tgaddair in #655
- Convert to Triton Punica kernels by @tgaddair in #658
- Support FP8 KV Cache by @ajtejankar in #652
- Added Mllama by @tgaddair in #619
- Flash mllama by @tgaddair in #622
- support MRL embeddings for qwen2 by @magdyksaleh in #621
- Support for Embeddings with XLM-RoBERTa and Adapters by @jfhetzer in #656
- Merge weights by @magdyksaleh in #600
- feat: Function calling with output schema enforcement by @jeffreyftang in #536
- Chunked prefill by @tgaddair in #653
- add num inputs to metrics by @magdyksaleh in #615
- Add --predibase-api-token CLI arg by @joseph-predibase in #617
- Add --disable-sgmv flag by @joseph-predibase in #639
- Enhance Structured Output Interface by @GirinMan in #644
🐛 Bugfixes
- Add done message to openai endpoints by @magdyksaleh in #618
- Fix CUDA graph compilation by @tgaddair in #627
- Fix CUDA graphs for Medusa by @tgaddair in #628
- Fix retrace message by @tgaddair in #629
- Fix prefix plumbing and BGMV compiler dimensions by @tgaddair in #631
- Fix punica kernel compilation by @tgaddair in #632
- Fix FlashInfer when not using prefix caching by @tgaddair in #633
- Fix cuda graph tracing without lora ranks by @tgaddair in #634
- Added ranks 96 and 128 to BGMV kernel by @tgaddair in #630
- Look for language model lm head by @Infernaught in #640
- Return n choices for chat completions API by @tgaddair in #638
- Fix llava_next for llama 3.2 vision cross attention states by @tgaddair in #641
- Fix compile for qwen-2.5-32b by @tgaddair in #645
- Added backwards compatible field to OpenAI json_object API by @tgaddair in #648
- Fix PREDIBASE_API_TOKEN env var being thrown away by @joseph-predibase in #654
- Fix absent
fp8_kv
property on llama and qwen models by @ajtejankar in #662 - Fix seqlen bug for sliding window models like Mistral v0.1 by @ajtejankar in #660
- Fix sliding window + compile bug by @ajtejankar in #666
📝 Docs
🔧 Maintenance
- upgrade poetry by @magdyksaleh in #613
- Fix deps4 by @magdyksaleh in #614
- Remove LD_PRELOAD from Docker and improve error message by @tgaddair in #623
- add label to id this as a lorax image by @noyoshi in #626
- pass correct stuff to predibase-reporter by @magdyksaleh in #635
- try using arc runner for build by @noyoshi in #646
- change runner 2 by @magdyksaleh in #650
New Contributors
- @joseph-predibase made their first contribution in #617
- @jfhetzer made their first contribution in #656
Full Changelog: v0.11.0...v0.12.0