LionW Outperforms AdamW in LoRA and Full Fine-Tuning Tasks

cover
18 Jun 2025

Abstract and 1 Introduction

2 Background

3 Experimental Setup and 3.1 Datasets for Continued Pretraining (CPT) and Instruction Finetuning (IFT)

3.2 Measuring Learning with Coding and Math Benchmarks (target domain evaluation)

3.3 Forgetting Metrics (source domain evaluation)

4 Results

4.1 LoRA underperforms full finetuning in programming and math tasks

4.2 LoRA forgets less than full finetuning

4.3 The Learning-Forgetting Tradeoff

4.4 LoRA’s regularization properties

4.5 Full finetuning on code and math does not learn low-rank perturbations

4.6 Practical takeaways for optimally configuring LoRA

5 Related Work

6 Discussion

7 Conclusion and References

Appendix

A. Experimental Setup

B. Learning rate searches

C. Training Datasets

D. Theoretical Memory Efficiency Gains with LoRA for Single and Multi-GPU Settings

A Experimental Setup

Code CPT.

Math CPT.

Code IFT.

Math IFT. Same as code IFT, except that

• maximum sequence length = 1024

We compared the two optimizers by training for two epochs of Magicoder-Evol-Instruct-110K using different learning rates. We found that Decoupled LionW outperformed DecoupledAdamW on HumanEval for both LoRA and full finetuning, and across learning rates, as seen in Fig. S1.

B Learning rate searches

For IFT we find that LoRA LRs should be an order of magnitude higher. For the longer CPT, these effects are more subtle.

B.1 Learning rate sensitivity analysis across optimizers

Figure S1: Comparing LionW to AdamW across learning rates for two epochs of the Megicoder-Evol-Instruct110K dataset. Left: HumanEval; right: Average of “Language Understanding” benchmarks. Both methods peak at the learning rate used in the original paper (Wei et al., 2023)

Figure S4: Same data as Fig. 3

Figure S2: Sample-efficiency curves matching Fig. 2, with all individual LoRA configurations.

Figure S3: Pareto curves for continued pretraining of Llama-2-13B on up to 20B tokens of the StarcoderPython (Code CPT).

(a) Code CPT: Individual forgetting plots for Llama-2-7B on Starcoder-Python.

(b) Code IFT: Individual forgetting plots for Llama-2-7B on Magicoder-Evol-Instruct-110K.

(c) Math CPT: Individual forgetting plots for Llama-2-7B on OpenWebMath.

(d) Math IFT: Individual forgetting plots for Llama-2-7B on MetaMathQA.

Figure S5: Same data as in Fig. 4 plotted for individual tasks HellaSwag, ARC-Challenge and WinoGrande

(a) Code CPT: Individual Pareto curves for Llama-2-7B on Starcoder-Python.

(b) Code IFT: Individual Pareto curves for Llama-2-7B on Magicoder-Evol-Instruct-110K.

(c) Math CPT: Individual Pareto curves for Llama-2-7B on OpenWebMath.

(d) Math IFT: Individual Pareto curves for Llama-2-7B on MetaMathQA.

Figure S6: SVD analysis for 4096 × 4096 matrix Wq at layer 26. Left: singular values for base weights, finetuned weights, and their difference. Right: cumulative explained variance. Notice that for all three matrices, a rank > 1500 is needed to explain 90% of the variance.

Figure S7: Analyzing the sprectra of the sum of two 1000 × 1000 Gaussian i.i.d matrices. A and B are 1000 × 1000 random matrices with i.i.d. standard normal Gaussian entries.

Authors:

(1) Dan Biderman, Columbia University and Databricks Mosaic AI (db3236@columbia.edu);

(2) Jose Gonzalez Ortiz, Databricks Mosaic AI (j.gonzalez@databricks.com);

(3) Jacob Portes, Databricks Mosaic AI (jportes@databricks.com);

(4) Mansheej Paul, Databricks Mosaic AI (mansheej.paul@databricks.com);

(5) Philip Greengard, Columbia University (pg2118@columbia.edu);

(6) Connor Jennings, Databricks Mosaic AI (connor.jennings@databricks.com);

(7) Daniel King, Databricks Mosaic AI (daniel.king@databricks.com);

(8) Sam Havens, Databricks Mosaic AI (sam.havens@databricks.com);

(9) Vitaliy Chiley, Databricks Mosaic AI (vitaliy.chiley@databricks.com);

(10) Jonathan Frankle, Databricks Mosaic AI (jfrankle@databricks.com);

(11) Cody Blakeney, Databricks Mosaic AI (cody.blakeney);

(12) John P. Cunningham, Columbia University (jpc2181@columbia.edu).


This paper is available on arxiv under CC BY 4.0 DEED license.