- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré. [Paper][Github]
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. Tri Dao. [Paper][Github]
- Efficiently Scaling Transformer Inference. Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean. [Paper]
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang. [Paper][Github]
- Efficient Memory Management for Large Language Model Serving with PagedAttention. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica. [Paper][Github]
- Efficient LLM Inference on CPUs. Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng. [Paper][Github]
- EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models. Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, Mengwei Xu. [Paper]
- GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models. Yonggan Fu, Yongan Zhang, Zhongzhi Yu, Sixu Li, Zhifan Ye, Chaojian Li, Cheng Wan, Yingyan Lin. [Paper]
- Rethinking Memory and Communication Cost for Efficient Large Language Model Training. Chan Wu, Hanxiao Zhang, Lin Ju, Jinjing Huang, Youshao Xiao, Zhaoxin Huan, Siyuan Li, Fanzhuang Meng, Lei Liang, Xiaolu Zhang, Jun Zhou. [Paper]
- Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models. Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, Gustavo Alonso. [Paper]
- FlashDecoding++: Faster Large Language Model Inference on GPUs. Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Hanyu Dong, Yu Wang. [Paper]
- Striped Attention: Faster Ring Attention for Causal Transformers. William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley. [Paper][Github]
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen. [Paper][Github]
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory. Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar. [Paper]
- FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA. Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang. [Paper]
- Efficient LLM inference solution on Intel GPU. Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu. [Paper][Github]
- Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models. Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li. [Paper][Github]
- DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference. Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He. [Paper][Github]
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference. Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, Hyungjun Kim. [Paper][Github]
- FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning. Xupeng Miao, Gabriele Oliaro, Xinhao Cheng, Mengdi Wu, Colin Unger, Zhihao Jia. [Paper][Github]
- BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences. Sun Ao, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong Sun, Shengnan Wang, Teng Su. [Paper]
- Efficiently Programming Large Language Models using SGLang. Lianmin Zheng*, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng*. [Paper] [Github]
- MELTing point: Mobile Evaluation of Language Transformers. MELTing point: Mobile Evaluation of Language Transformers. [Paper]
- DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference. Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin. [Paper]
- Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs. Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie. [Paper]
- LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism. Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin Jin. [Paper]
- Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity. Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica. [Paper][Github]
- Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification. Josef Pichlmeier, Philipp Ross, Andre Luckow. [Paper]
- Efficient and Economic Large Language Model Inference with Attention Offloading. Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu. [Paper]
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu. [Paper]
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone. Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen. [Paper]
- Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization. Jungi Lee, Wonbeom Lee, Jaewoong Sim. [Paper]
- EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting. Zhongzhi Yu, Zheng Wang, Yuhan Li, Haoran You, Ruijie Gao, Xiaoya Zhou, Sreenidhi Reedy Bommu, Yang Katie Zhao, Yingyan Celine Lin. [Paper][Github]
- Preble: Efficient Distributed Prompt Scheduling for LLM Serving. Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang. [Paper]
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao. [Paper][Github][Blog]
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation. Branden Butler, Sixing Yu, Arya Mazaheri, Ali Jannesari. [Paper]
- Designing Efficient LLM Accelerators for Edge Devices. Jude Haris, Rappy Saha, Wenhao Hu, José Cano. [Paper]
- SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving. Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris. [Paper]
- Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference. Rohan Baskar Prabhakar, Hengrui Zhang, David Wentzlaff. [Paper]
- LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration. Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang. [Paper]
- Accelerating Large Language Model Training with Hybrid GPU-based Compression. Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda. [Paper]
- OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models. Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung. [Paper]
- Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores. Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang. [Paper]
- TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices. Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu. [Paper][Github]
- POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference. Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar. [Paper]
- FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs. Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu. [Paper]
- SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training. Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, Dingwen Tao. [Paper]
- EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models. Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie. [Paper]