Information-Theoretic Argument for Multi-Token Prediction Benefits

Language models are typically trained by teacher-forcing, where the model receives the ground truth for each future token during training. However, during test time generation is unguided and autoregressive, whereby errors accumulate. Teacher-forcing, we argue, encourages models to focus on predicting well in the very short term, at the potential expense of ignoring longer-term dependencies in the overall structure of the generated sequence.

To illustrate the impact of multi-token prediction, consider the following information-theoretic argument. Here, X denotes the next future token, and Y the second-next future token. The production of both of these tokens is conditioned on some observed, input context C, that we omit from our equations for simplicity. When placed before token X, vanilla next-token prediction concerns the quantity H(X), while multi-token prediction with n = 2 aims at H(X) + H(Y ). We decompose these two quantities as:

H(X) = H(X | Y ) + I(X; Y ),

H(X) + H(Y ) = H(X | Y ) + 2I(X; Y ) + H(Y | X).

By discarding the term H(Y | X)—which appears again when predicting at the following position—we observe that 2-token prediction increases the importance of I(X; Y ) by a factor of 2. So, multi-token predictors are more accurate at predicting tokens X that are of relevance for the remainder of the text to come. In Appendix L.2, we give a relative version of the above equations that shows the increased weight of relative mutual information in a loss decomposition of 2-token prediction loss.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech, and contributed equally;

(2) Badr Youbi IdrissiFAIR at Meta, LISN Université Paris-Saclay, and contributed equally;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and his the last author;

(5) Gabriel Synnaeve, FAIR at Meta and the last author.

← Previous

Why Multi-Token Prediction Works: Intuition & Theoretical Insights

Up Next →

Differentiating Multi-Token Prediction from Prior LLM Training Methods