Authors:
(1) Siqi Kou, Shanghai Jiao Tong University and with Equal contribution;
(2) Lanxiang Hu, University of California, San Diego and with Equal contribution;
(3) Zhezhi He, Shanghai Jiao Tong University;
(4) Zhijie Deng, Shanghai Jiao Tong University;
(5) Hao Zhang, University of California, San Diego.
Table of Links
3. Methodology and 3.1. Preliminary: Jacobi Decoding
3.2. Consistency Large Language Models (CLLMs)
3.3. Acceleration Mechanisms in CLLMs
4. Experiments
4.2. Acceleration Mechanisms in CLLMs
4.4. Limitations and Discussion
5. Conclusion, Impact Statement, and References
A. Illustration of Consistency Loss Learning Objectives
B. Comparison with Baseline Algorithms
C. Pesudo Code for Jacobi Decoding with KV Cache
2. Related Work
Efficient LLM Inference. This body of work can be broadly categorized into two streams: methods that necessitate additional training and those that do not. The high AR inference cost in LLMs has sparked a surge in research aimed at efficient LLM inference, primarily focused on accelerating the AR decoding process.
The methods that do not require additional training include speculative decoding, as introduced in studies by Leviathan et al. (2023) and Chen et al. (2023). These techniques enhance LLM decoding speed by leveraging a smaller draft model to predict the outputs of a larger target model which subsequently verifies these predictions. Another category of training-free approaches involves system- or hardwareoriented optimizations. Notable examples include PagedAttention (Kwon et al., 2023), which optimizes KV cache management for throughput using memory paging, and FlashAttention (Dao et al., 2022; Dao, 2023), which accelerates attention module computations by reducing HBM access via softmax tiling. Other strategies enhance LLM inference speed by optimizing model designs, reducing weight/activation precision, and utilizing sparsity, including multi-query and grouped-query attention mechanisms with fused heads (Shazeer, 2019; Ainslie et al., 2023), posttraining quantization (Dettmers et al., 2022; Xiao et al., 2023; Frantar et al., 2022; Lin et al., 2023), and various pruning techniques (Sun et al., 2023; Frantar & Alistarh, 2023; Ashkboos et al., 2024).
For methods that necessitate training, they often require integration of auxiliary components, such as additional LM or AR heads, to facilitate faster AR generation (Cai et al., 2024; Li et al., 2024). It may also involve significant modifications to the model weights or architecture, as seen in various pruning approaches (Ma et al., 2023; Xia et al., 2022; 2023). Moreover, training can enhance certain training-free techniques, like speculative decoding, by capturing the behavior of the original, larger model in a smaller student model through distillation, thereby retaining performance with reduced size (Zhou et al., 2023b; Liu et al., 2023). An detailed analysis that compare CLLMs with different SOTA baseline methods are further discussed and compared in Section B and Table 7. It’s worthy noticing that CLLMs requires neither modification to pre-trained models nor any auxiliary components. This brings higher memory efficiency and adaptability to users at inference time.
LLM Distillation. Knowledge distillation (KD) serves as a technique for creating smaller models that replicate the functionality of larger ones. While traditional KD approaches often fall short for LLMs, (Gu et al., 2023) has adapted KD for autoregressive LLMs, focusing on minimizing the reverse KL divergence between student and teacher models through student-driven decoding. In another advancement, Agarwal et al. (2023) introduces generalized knowledge distillation (GKD), which balances forward and reverse KL divergences by employing a mix of data sampled from both teacher and student models.
CLLMs are distinct from these works as our proposed method can be regarded as a self-distillation approach with a Jacobi trajectory training dataset that matches the target LLM’s output distribution.
Consistency Models. Diffusion models (Ho et al., 2020; Song et al., 2021b) suffer from slow iterative sampling process. Consistency models overcome this limitation by mapping any point along the probability flow ODE of the diffusion process back to the original point, corresponding to the initial image, in a single step (Song et al., 2023). In this work, we highlight that a parallelism can be drawn between the few-step generation capability of CLLMs and that of the consistency models.
This paper is available on arxiv under CC0 1.0 Universal license.