LLM Training Hyperparameters: Detailed Overview

$Table S13: Overview of all training hyperparameters used. We schedule all learning rates with a linear warmup and cosine decay (Loshchilov and Hutter, 2017) to a fraction of the peak learning rate which is depicted in the last column (“decay ratio”). All experiments use the Adam (Kingma and Ba, 2015) optimizer with β1 = 0.9, β2 = 0.95 and decoupled L2 weight decay (Loshchilov and Hutter, 2019) coefficient 0.1. We clip gradients to a maximal Euclidean norm of 1.0 in all experiments except CodeContests finetunings, where we use 0.1 instead. Summarization finetunings correspond to three epochs on all datasets except BigPatent (1 epoch). Byte-level models use the architecture with replicated unembeddings from Appendix B.$

Table S14: Overview of model architectures used for scaling analyses.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech, and contributed equally;

(2) Badr Youbi IdrissiFAIR at Meta, LISN Université Paris-Saclay, and contributed equally;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and his the last author;

(5) Gabriel Synnaeve, FAIR at Meta and the last author.

← Previous

Intuitions Behind Multi-Token Prediction: Information Theory & Choice Points

Up Next →

Researchers Combine Touch and Language to Boost Robotic Reasoning