Abstract

Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model’s compression, and the information present in it, can predict downstream performance across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.

0.00

bits

H(colour distribution)

I(colour ; colour id)

I(colour ; has word)

Note

This is a web version of our ICLR paper Learning is Forgetting: LLM Training As Lossy Compression. A less technical companion piece is available here

1 Introduction

We still have a limited understanding of how Large Language Models (LLMs) achieve impressive results across a wide array of tasks (Devlin et al. 2019; Grattafiori et al. 2024). While a growing body of work interprets LLMs using behavioural experiments, probing, or causal interventions, the scale of these models makes understanding how their representation spaces are structured a continued challenge. Here we look at an LLM as an instance of lossy compression, offering an account of how models represent information during training and what information matters for performance.

Lossy compression represents data efficiently by preserving only the information from a source relevant to a goal. While uncompressed audio recordings intended for human listeners can be gigabytes in size, MP3 files save space by discarding frequencies typically outside the range of human hearing (Jayant, Johnston, and Safranek 1993); similarly, a JPEG file omits subtle colour variations that are difficult for the human eye to perceive. We draw a parallel with LLMs, which are expected to generate responses humans prefer, after being trained on trillions of tokens — more language data than a human hears in 200 lifetimes. More generally, compression is thought to underpin learning in both humans and models (Feldman 2016), and giving a formal account of LLM pre-training in terms of compression allows us to work towards a unified theory of representation learning. We present results showing that over the course of pre-training LLMs optimally compress the information present in their training data for next sequence prediction.

Compression is inherently opinionated — some information from the source is preserved, some is forgotten to save space. Information Theory (Shannon 1948) provides a formal framework to describe this process, letting us both quantify the information present in a representation and compute a bound where it is optimally compressed with respect to the data it represents. Our results build on the Information Bottleneck (IB) theory of deep learning (Tishby and Zaslavsky 2015), showing pre-training follows a two phase trajectory: first increasing mutual information with the training objective, before compressing input information. Across a wide array of LLMs we find each model compresses differently, with the optimality of a model’s compression and the information it preserves predicting performance on downstream benchmarks.

**LLMs Learn an Optimal Compression of the Internet.** **(Left)** The information plane for pre-training of the OLMo2 7B model. The horizontal axis shows mutual information between representations and the input (complexity), the vertical axis shows mutual information with the predicted output (expressivity). The dotted line indicates the bound where models are optimally compressed; hue indicates timepoint in training in terms of tokens in billions. Estimates are based on 10,000 samples from the C4 dataset. **(Right)** The vertical axis shows the OLMo2 7B model’s loss on next-token-prediction of C4. The horizontal axis shows the model’s proximity to the bound. Representations begin to approach the bound as the loss saturates.

A hallmark of large-scale distributed systems, like neural networks, is that they are difficult to understand as a function of their parts alone (Anderson 1972; Mitchell 2009). Our approach to interpretability allows us to consider learning and generalisation at the scale of an entire model, rather than studying individual circuits, heads, or neurons within it. Additionally, it allows us to frame how models do so well at so much in terms of existing theories of learning and compression, while providing actionable insights at LLM scale.

In what follows we focus on offering concrete answers to three questions: Do LLMs optimally compress their representations? What information survives that compression? What representational structures drive performance? In summary, our core findings are:

Pre-training dynamics for LLMs closely follow theoretical predictions from the Information Bottleneck, with models first expanding representations before slowly approaching optimal compression.
Scale conditions these dynamics, with smaller models (below 7 billion parameters) struggling to achieve meaningful compression later in training.
How optimally compressed a model is correlates significantly with performance across six benchmarks for six families of open-weights large language models, letting us directly relate representation structure to behaviour.
By quantifying the amount of preference information in a model we get a quantification of how aligned representations are with preference distinctions, which significantly predicts downstream performance across 47 LLMs (\(r=0.76, p<0.001\)).
Finally, we compare a wide array of open-weight models across 5 model families, showing they all converge near optimal compression.

2 Background & Related Work

2.1 Learning, Inference, and Compression

Compression has been argued to underpin learning and inference in humans (Chater 1997; Chater and Vitányi 2003; Feldman 2000; Pothos and Chater 2001) and models (Poggio et al. 2004; MacKay 2003). Increasingly, probabilistic inference and complexity minimisation are seen as deeply intertwined (Feldman 2016) — a point perhaps made clearest by Bayesian inference, which implicitly prefers the simplest hypotheses consistent with observed data (Jeffreys 1939; Edwards 1972; Vitányi and Li 2000). Bayesian approaches to human cognition offer accounts of how a broad array of human behaviour can be productively thought of as this kind of inference (Griffiths, Chater, and Tenenbaum 2024). In machine learning Occam’s Razor has long been used as a model selection criterion, where the best model is the simplest one consistent with the data (Wallace and Boulton 1968; Rissanen 1978; Burnham and Anderson 2002). The bias variance trade-off (Geman, Bienenstock, and Doursat 1992) makes this explicit in the context of neural networks, showing more complex models may achieve better fit to the training data, but they also generalise worse than their simpler counterparts.

While some work has studied whether or not LLMs can match lossless compression algorithms in-context (Delétang et al. 2023), this is distinct from giving an account of LLM training itself as a process of lossy compression — the object of study here. It is worth noting that there is not universal agreement about how to assess compression (MacKay 2003), but here we follow in the information-theoretic tradition (Shannon 1948).

2.2 Rate Distortion Theory

Consider a function \(\theta\) that encodes an input \(X\) in a representation \(Z\), \(Z = \theta(X)\). This representation is then decoded by a function \(\phi\) to produce predictions \(\hat{Y}\) for an output with true label \(Y\), \(\hat{Y} = \phi(Z)\). Assuming that \(X\) and \(Y\) are not independent, if \(\theta\) were to losslessly preserve all the information from the input, we would expect \(\phi\) to be able to precisely recover the corresponding output, with \(\hat{Y} = Y\). Rate Distortion Theory (RDT) (Shannon 1948) instead considers the lossy case \(\hat{Y} \neq Y\), where some amount of error in the prediction — distortion — is acceptable. It then becomes a question of how much information about the input — termed the rate — the encoder needs to preserve to achieve a given level of distortion.

2.2.1 The Information Bottleneck (IB)

Tishby, Pereira, and Bialek (2000) looks at a particular case, where the rate is given as the mutual information between inputs and their representation \(I(X; Z)\), and distortion as the mutual information between a representation and the corresponding target prediction \(I(Y; Z)\) — the 2D space this creates is called the information plane. Since \(I(X; Z)\) reflects how much information about the input space is preserved it can be referred to as complexity (Zaslavsky et al. 2018). Likewise \(I(Y; Z)\) is referred to as expressivity — how uniquely a representation can refer to its target (Kirby et al. 2015). Optimal compression within the IB occurs when an encoding \(Z|X\) preserves only the information about \(X\) relevant to predicting \(Y\), or when \(Z|X\) minimises

\[ \mathcal{F}_\beta[p(Z|X)] = I(X;Z) - \beta I(Y; Z) \]

where \(\beta\) is a trade-off parameter controlling the allowable level of distortion. When \(\beta\) approaches 0 all inputs are compressed to a single point; as \(\beta \rightarrow \infty\) we approach the lossless case, where \(I(Y;Z) = I(Y;X)\). The curve traced by varying \(\beta\) draws a bound, where the encoding \(p(Z|X)\) is optimally compressed — everything above the curve is unachievable and everything below it is suboptimal. This bound starts off with a linear relationship where \(I(X; Z) = I(Z; Y)\), until \(Z\) captures all information shared between \(X\) and \(Y\). Intuitively, in an optimal encoding each additional bit of complexity gets you an additional bit of expressivity, until all information shared by X and Y are represented such that \(I(Y;Z) = I(Y;X)\).¹ In the cases studied here all models stay well below this saturation point, so for clarity we refer to the bound as the line \(I(X; Z) = I(Z; Y)\).

¹ The bound \(I(X;Z) = I(Z;Y)\) follows from the data processing inequality and is generally looser than the true information bottleneck frontier, which for a given joint distribution \(p(X,Y)\) may require \(I(X;Z) > I(Z;Y)\). We use this simpler bound here as it suffices to illustrate the key intuition.

2.2.2 Applying the Information Bottleneck to Deep Learning

Tishby and Zaslavsky (2015) offers a theoretical characterisation of training a multi-layered neural network as optimising an Information Bottleneck. They theorise two phases of training: first, a fitting phase during which representations increase mutual information with the target labels \(I(Y; Z)\); and second, a compression phase, during which models compress irrelevant information about the input \(I(X; Z)\) and in so doing begin to approach the optimal bound. It is this latter phase that is hypothesised to result in representations that generalise robustly.

Shwartz-Ziv and Tishby (2017) confirm the two-phase prediction from the IB theory of deep-learning empirically in feed forward networks trained on MNIST. Subsequent work has questioned the generality of these findings, showing how — at least in linear networks — the compression phase can be driven by the type of non-linearity used (Saxe et al. 2019), or that compression is not necessarily required for generalisation (Goldfeld et al. 2019). Prior work has not investigated whether these dynamics extend beyond simple feed-forward networks to sequence models (e.g. Transformers) trained on complex tasks — the object of study here.

2.3 Interpreting Neural Networks

A broad literature on the theory of deep learning tries to give an accounting of learning dynamics in small multi-layer networks (Frankle and Carbin 2018; Saxe, McClelland, and Ganguli 2019). While there has been some extension of these kinds of representational analyses to larger models — like applying information theoretic methods to transformers (Voita, Sennrich, and Titov 2019) — much of the work on interpretability in LLMs leverages behavioural or probing evidence. Behavioural approaches treat models as akin to psycholinguistic subjects (Futrell et al. 2019, 2018), taking model outputs as behaviours (Marvin and Linzen 2018; Warstadt et al. 2019; Hu et al. 2020). Probing (Veldhoen, Hupkes, and Zuidema 2016; Pimentel et al. 2020; Voita and Titov 2020) trains a smaller model — like a linear classifier — to predict labels from a model’s latent representations, as evidence that information relevant to those labels is present. While valuable, these approaches are removed from the models’ representations themselves, characterising downstream behaviours rather than the representational structures that drive them.

Mechanistic interpretability follows in a similar vein but aims to describe how circuits within a model implement the functions that solve a task. These analyses have given accounts of how two layer linear and non-linear models represent features from synthetic data (Elhage et al. 2021) or how single-layer attention only transformers solve modular addition (Nanda et al. 2023). When deployed at scale, to LLMs, this work often relies on training unsupervised probes termed sparse auto-encoders (Elhage et al. 2022) to identify correspondences between parameters and different words or concepts from the training data (Bricken et al. 2023). In the general case this work often looks for ‘mono-semanticity’ — looking for lossless, one-to-one correspondences between input features and parts of a model. More recently studies of when features emerge during pre-training have aligned with the expansion/compression pattern described by the IB theory (Ge et al. 2025).

To be sure, there is an abundance of methods for analysing deep-learning models. Here, we highlight a disconnect between work on the theory of learning in humans and neural networks, and work on interpretability. Interpretability methods can be deployed at scale on complex models and tasks, but lack clear relationship to existing theoretical work. In the sections that follow we operationalise Rate Distortion Theory, and related work on learning as compression, at a scale commensurate with current LLMs. This allows us to analyse training dynamics while contextualising our conclusions in pre-existing and well-studied theoretical frameworks. Our approach is both theoretically motivated and able to be applied to any model at any scale.

3 Methods

3.1 Entropy Estimation

Let \(T \in \mathbb{Z}^{B\times S}\) be a batch of \(B\) tokenized samples with sequence length \(S\), drawn from a corpus of text data \(\mathcal{T}\), and let \(\theta\) be a model with \(L\) layers and representation dimension \(h\); the corresponding encoded representations are \(Z \in \mathbb{R}^{L \times B \times S \times h}\). Let \(X\in \mathbb{Z}^{B\times S}\) be feature labels for the text in \(T\). For example, when we look at optimal compression with respect to the IB bound, these labels \(X\) are the token ids for the model inputs; however, when analysing representation information more generally, these can be other input features, such as preference label or language ID. It is desirable to compute the mutual information \(I(X;Z)\) using Shannon entropy as opposed to differential entropy. Previous work quantises \(Z\) into \(n\) bins, to get a discrete encoding \(\hat{Z}\) (Voita, Sennrich, and Titov 2019; Shwartz-Ziv and Tishby 2017). Unfortunately these approaches have memory and resource requirements that make them difficult to apply at LLM scale.²

² For discussion of Shannon entropy and why previous approaches are not scalable see Appendix E.6 and E.7.

As a result we use the soft-entropy estimator from Conklin (2025) — an efficient differentiable relaxation of a binning-based estimate that has been shown to converge to the true entropy of a distribution. This estimator is not original to our work, but we are the first to apply it to analyse LLMs using rate distortion theory.

We first compute \(\bar{Z}\), the normalization of \(Z\) to lie on the surface of the unit sphere \(\mathbb{S}^h\) in \(\mathbb{R}^h\). Then we compute \(W\) by sampling \(n\) points \(\{w_i\}_{i=1}^n\) uniformly at random from \(\mathbb{S}^h\).³ For each normalized representation \(\bar{z} \in \mathbb{R}^h\), we compute a vector whose \(i^{th}\) entry is the cosine between \(\bar{z}\) and \(w_i\), then apply softmax to that vector — softly assigning each embedding \(\bar{z}\) to the points in \(W\). More formally, each \((l,b,s) \in [L] \times [B] \times [S]\) tensor \(\bar{Z}\) (whose shape coincides with \(Z\)) is defined so that \(\bar{Z}_{l,b,s, :} = Z_{l,b,s, :} / \| Z_{l,b,s, :} \|\), and we stack the uniform samples \(\{ w_i\}_{i=1}^n\) into a matrix \(W \in \mathbb{R}^{h \times n}\). The soft-quantisation of \(Z\) is then given by \(\check{Z} \in \mathbb{R}^{L \times B \times S \times n}\) for \((l,b,s) \in [L] \times [B] \times [S]\):

³ This is equivalent to sampling from an isometric \(h\)-dimensional multivariate normal, \(\tilde{w}_i \sim \mathcal{N}(0, Id_h)\), and scaling to unit length, \(w_i = \frac{\tilde{w}_i}{||\tilde{w}_i||}\).

\[ \{w_i\}_{i=1}^n \sim \text{Unif}(\mathbb{S}^h), \qquad W_{:, i} = w_i, \qquad \check{Z}_{l, b, s, :} = \mathrm{softmax}\!\Big(\frac{\sum_{j=1}^h \bar{Z}_{l,b,s,j} W_{j,:}}{\epsilon}\Big) \]

where \(\epsilon\) is a temperature parameter, which we set to enable direct comparison of representations with different dimensionalities following the calibration procedure described in Appendix E.1. Each vector \(\check{Z}_{l, b, s,:}\) is a probability vector. Let \(\hat{Z} \in \mathbb{R}^{L \times n}\) be the matrix obtained from tensor \(\check{Z}\) by averaging over the batch and sequence dimensions, and let \(\hat{z}_l\) be the \(l\)-th row of this matrix:

\[ \hat{Z} = \frac{1}{BS} \sum_{b=1}^B \sum_{s=1}^S \check{Z}_{:,b,s,:}, \qquad \hat{z}_l = \hat{Z}_{l, :}, \quad H(\hat{z}_l) = - \sum_{j=1}^n \hat{z}_{l, j} \log \hat{z}_{l,j} \]

Vectors \(\hat{z}_l\) are probability vectors for each layer \(l \in [L]\) describing a categorical distribution over \(n\) categories, so we can compute the Shannon entropy \(H(\hat{z}_l)\) as above.

**Illustration of Soft Entropy Estimation.** (Top) These facets illustrate the normalisation, sampling, and soft assignment. (Bottom) Soft assignments are aggregated into a distribution describing the space \(P(\hat{Z})\) of which we take the Shannon entropy. An interactive visual of this process is available here.

Due to the normalisation step during quantisation, this distribution intuitively estimates the probability that a representation in a layer \(l\) lies along a particular angle with respect to the origin. To estimate the entropy in an entire model, denoted \(H(Z)\), we average entropy across layers. Efficiency (Wilcox 1967) normalises \(H\) by the entropy of a uniform distribution \(\log(n)\), thereby bounding the entropic quantity between 0 and 1. To aid interpretability we convert \(H(Z)\) to an efficiency \(\mathcal{H}(Z)\) by normalising by the entropy of a uniform distribution at each layer. These definitions can also be conditioned on the feature labels \(X\):

\[ \mathcal{H}(Z) := \frac{1}{L\log(n)}\sum_{l=1}^{L} H(\hat{z}_l), \qquad \mathcal{H}(Z| X=x) := \frac{1}{L \log (n)} \sum_{l=1}^L H(\hat{z}_l | X=x) \]

This now allows us to efficiently compute the mutual information between input features \(X\) and encodings across an entire model, regardless of model size:

\[ I(X; Z) := \mathcal{H}(Z) - \sum_{x \in X} P(X=x)\,\mathcal{H}(Z| X=x) \]

3.2 Mutual Informations & Back-off

To determine whether or not a model is optimally compressed with respect to some data we need to compute mutual informations with respect to input and output labels. LLMs are trained with inputs as preceding context and outputs as trailing context. Maintaining conditional estimates of a token embedding given a preceding context \(P(Z|X)\) for every possible context window proves intractable, and many contexts occur only once in the training data. Accordingly, like many other works on language modelling, we approximate the distribution over possible sequences using n-grams with a kind of back-off (Katz 1987). By conditioning on finite widths of preceding context we can tractably approximate \(P(Z|X)\); the maximum width we consider here are quad-grams, by which point \(I(X;Z)\) begins to converge. By backing off further (e.g. to trigrams, bigrams, and tokens) we can also estimate how much different context widths contribute to information in a model. We vary the degree of backoff equally for both the input \(P(Z|X)\) and output \(P(Z|Y)\) distributions, because during training a model receives gradient information from the full trailing context \(Y\) due to teacher forcing.

**Illustration of conditional probability estimates.** An example sentence is provided, assuming word-level tokenization for simplicity. At left are the indices for the input and output tokens when the current input word is *wherefore*. At right is shown the sub-setting procedure for estimating conditional probabilities. This illustrates how bigram estimates do not compute entropy of two token embeddings, but rather the embedding for the current token conditioned on preceding context.

In comparing different models we would like to be able to determine how close a given representation system is to the IB bound — by extension, how optimally compressed it is. When on the bound, representations preserve only the information from the input relevant to predicting the output. We quantify this with a summary statistic optimality:

\[ \text{Optimality} = \frac{\text{Expressivity}}{\text{Complexity}} = \frac{I(Y;Z)}{I(X;Z)} \]

Intuitively this quantity approaches 1.0 as a representation system approaches the bound, regardless of where along the bound the system is placed. More generally this is a relative quantity reflecting how many bits of expressivity a system has for each bit of complexity.

In addition to mutual information with input and output labels, we also consider preference information. A growing body of work stresses the importance of post-training approaches for aligning models with human preference (Bai et al. 2022; Rafailov et al. 2023; Ouyang et al. 2022). We can quantify this information in a model using preference data, where a prompt has two continuations — one labelled preferred and one labelled rejected. Conditioning on this label lets us compute \(P(Z|\text{preferred})\) and \(I(Z;\text{preferred})\).

Data and Sampling. Getting a true estimate of the entropy of a vector space remains a major challenge, with most approaches underestimating the true entropy (Paninski 2003). As a result we do not claim our experiments estimate the entropy of a model’s true latent distribution, but rather an estimate of the entropy with respect to a particular sample of data. By holding the data constant across models and experiments we can compute an estimate that is useful for comparisons, even if it does not exactly match the true entropy. Unless otherwise noted, token, bigram, trigram and quad-gram estimates are with respect to 10,000 samples from C4 (Raffel et al. 2020), and preference estimates are based on 10,000 samples from Tulu (Lambert et al. 2024); in both cases we consider a maximum context length of 512.

4 Experiments

In order to study training time-courses our pre-training analyses look at the OLMo2 family of models (OLMo et al. 2025), which makes available intermediate checkpoints.⁴ We focus analysis on the 7B model unless otherwise noted, while including results for the 32B and 1B variants to show where conclusions hold or differ across model scales. In addition, to show our conclusions hold outside of this particular family of models we compare a wide array of open-weights LLMs (which do not make intermediate training checkpoints available), showing where they lie on the information plane at the end of training.

⁴ Appendix D includes additional pre-training analyses of the Smol LM2 (Allal et al. 2025) and Pythia (Biderman et al. 2023) models, which also make intermediate checkpoints available. These follow a similar pattern to the results presented here.

4.1 Pre-training Approaches Optimal Compression

The majority of pre-training appears to be a slow compression of a model’s training data. The Information Bottleneck theory of deep learning predicts two phases: a fitting phase during which output information \(I(Y;Z)\) increases, followed by a compression phase during which input information \(I(X;Z)\) decreases and representations approach the bound. This transition to compression is believed to occur when error on the training set saturates.

Shown in Figure 1 (and reproduced below) is the training trajectory for the OLMo2 7B model with respect to data from English C4. Strikingly, the 7B model closely follows the two-phase prediction from the Information Bottleneck, first increasing mutual information with outputs, before compressing input information and progressing towards the bound on optimal compression. Additionally this transition appears to happen as the model’s loss on next-token prediction begins to saturate. This shows how, even at scale, deep-learning models appear to thread a needle between representational complexity and expressivity. It also demonstrates how LLMs can be effectively studied from the perspective of Rate Distortion Theory, as they try to converge to an optimal lossy compression of their training data.

**Models Largely Encode Local Context.** (Top) The information plane over pre-training for the different levels of backoff. By changing how many tokens we condition the mutual information on in the context window, we see how the OLMo2 7B model compresses not just token but also local context information. Across all context windows we see the same two phase pattern predicted by the Information Bottleneck — with more contextual representations approaching greater optimality, indicated by hue. (Bottom) By computing the conditional mutual information for a level of back-off given the others we can quantify what proportion of a model’s information encodes each level of context information. Each facet shows a different model size, with the horizontal axis reflecting training step and the vertical axis reflecting proportion of information from the source — hue indicates level of back-off.

Models More Optimally Compress Contextual Information. By varying the degree of back-off in the source and target distributions used to compute mutual information, we can examine how contextual information evolves over pre-training at the token, bigram, trigram, and quad-gram levels. All cases result in a similar two-phase pattern of expansion and compression, with larger conditioning context converging closer to the bound. For token-level back-off late training aligns with previous work on MNIST (Shwartz-Ziv and Tishby 2017), with models compressing the source distribution — reducing complexity — while maintaining expressivity. At higher levels of contextualisation both complexity and expressivity are reduced. We hypothesise this is because in language modelling the source and target are sampled from the same distribution; what counts as an ‘input’ vs. an ‘output’ is a product of what point in the sequence the model is during generation. The higher degree of optimality in contextual encodings likely reflects an inherent pressure in the pre-training objective for models to develop representations of a token in context.

Embeddings Largely Encode Local Context. We compute the proportion of information in a model explained by each level of back-off in the source distribution independently. As shown in the figure above (bottom), the majority of information in a model encodes local context (token to quadgram), likely reflecting the information locality of the natural language on which they’re trained (Gibson 1998; Gibson et al. 2000; Hahn et al. 2022). The 1 billion parameter model also has more token information and less contextual information than its larger counterparts. The residual information likely encodes the finer-grained contextual distinctions found in the remainder of the 512 token context window — given the sparsity of n-grams greater than a quadgram those mutual informations are intractable for us to compute. This gives an interpretation of an LLM from the perspective of earlier work in NLP as akin to a context-window-width n-gram model that is smoothed enough to be tractable to train from finite data.

**Models Converge Along the Bound With Smaller Models Struggling to Compress.** **(Top Left)** Open-weights models across 6 families at the end of training lie along the bound on optimal compression. Hue indicates performance on MMLU Pro. **(Top Right)** The vertical axis indicates mutual information with preference, with models with more preference information exhibiting better performance. **(Bottom)** Zooming in on later pre-training for each model size, the 1B model matches Phase 1 but struggles to achieve meaningful compression later on, oscillating for much of pre-training off the frontier. All results use back-off to the trigram level. A full legend identifying each dot, with additional levels of back-off, is given in Appendix A.

The Effect of Scale: Smaller Models Struggle to Compress. Parameter count shows a marked effect on the degree of compression achievable by a model. The larger models both closely follow the hypothesized Information Bottleneck trajectory, exhibiting phases of expansion and compression, ultimately approaching optimal compression. The 1B parameter model exhibits markedly different behaviour. While it successfully completes the initial expansion phase — increasing output information \(I(Y;Z)\) — it fails to approach optimal compression. Instead, in the second phase the smaller model oscillates while moving slowly away from the bound. This suggests that for a given level of data complexity, a certain parameter threshold may be necessary for models to achieve an optimal compression — an observation in line with work on scaling laws (Kaplan et al. 2020).

A Wide Array of Open-Weight Models Converge Along the Bound. In addition to looking at the OLMo2 family of models, we compute complexity and expressivity estimates across a diverse array of open-weight models. A striking convergence pattern emerges: across different model families, hyper-parameters, and training methodologies, representations ultimately converge and cluster near the bound on compression. Furthermore, models all approach the same point on the bound, suggesting they all converge to a similar information structure. This suggests that training as a process of compression is not an artifact of a single LLM’s training trajectory, but more fundamentally applies to deep-learning models as a class, and to the data and the objectives used to train them.

4.2 Relating Representation Structure to Performance

So far we have studied how information in an LLM is structured; we now consider how that structure relates to downstream performance. We look at how representational information for 47 open weights models from 6 different families relates to performance across six benchmarks (Fourrier et al. 2024): MMLU Pro, BBH, Math LVL5, IFEval, GPQA, and MuSR.

**Representation Information Relates Significantly to Performance.** Vertical axes, shared across all plots, show aggregate performance across 6 benchmarks. Horizontal axes use token back-off to show complexity significantly correlates with downstream performance **(Top Left)**, while expressivity alone does not **(Top Right)**. The ratio of how many bits of expressivity a model has per bit of complexity does correlate significantly — this quantity indicates how optimally compressed a model is **(Bottom Left)**. The amount of preference information in a model also correlates with downstream performance **(Bottom Right)**. Above each facet are results from a spearman correlation between their axes, and a partial spearman that treats a model’s number of parameters as a covariate.

At the token level, lower complexity relates significantly to performance (\(r=-0.38\), \(p=0.006\)), while expressivity alone does not (\(r=0.08\), \(p=0.575\)). However, the ratio between expressivity and complexity — a measure of how close a model is to optimal compression — is a significant predictor of downstream performance (\(r=0.52\), \(p<0.001\)). This tells us that better performing models have less token complexity, and are more optimally compressed.

More Contextual Information Improves Performance. Looking beyond token-level backoff we see a pattern emerge. Proportionally less token information correlates with downstream performance but having more bigram, trigram, and quadgram information correlates positively with performance. Intuitively this shows how models that have a better representation of context, allocating less of their representations to token-level distinctions, perform better downstream. This aligns with our finding that larger models allocate more of their representation space to contextual information.

**Proportions of Information vs. Performance.** Across 47 models less token information and more local contextual information relates significantly to performance based on a spearman correlation reported above each facet. Hue indicates model, legend is provided in Figure 5.

Optimal Compression. At all levels of back-off we see a consistent relationship between how close a model is to the information bottleneck bound on compression, and its performance downstream. Closer to the bound, representations preserve only the information from the input that is relevant to predicting the output. While higher performing models have more context information and less token information, how close to the bound representations are at each level of context is consistently positively related to performance. This allows us to link the simplest representation, at a given level of expressivity, to the most generalisable representation across the benchmarks considered here. Critically these compression estimates are computed with general data from the internet (C4) rather than data from the evaluations themselves, showing our methods can identify sufficiently general properties of a compression that we can predict downstream performance without knowledge of the test distribution.

Preference. While LLMs approach optimal compression for next sequence prediction over pre-training, a large body of work also tries to improve their ability to follow instructions, and generate responses humans prefer (Ouyang et al. 2022). We use preference data (Lambert et al. 2024) to compute mutual information with preference. The amount of preference information in a model proves a significant predictor of downstream performance (\(r=0.76\), \(p=0.001\)). This suggests that not only does the optimality of a model’s compression matter, but exactly what information survives that compression does too. In Appendix B we include results showing that post-training can increase the amount of preference information across different models in two different families while minimally changing their complexity. This suggests that pre-training is responsible for the broad compression learned by a model, while post-training edits the information it contains.

Results here focus on aggregate performance across 6 benchmarks; in Appendix C we discuss each of the benchmarks individually. At the individual task level optimal compression of C4 significantly predicts performance across math, reasoning, and factual knowledge benchmarks — but not instruction following. Instruction following is, however, predicted by the amount of preference information in a model. This helps us better understand what behaviours optimal compression of C4 is likely to give rise to — like broad factual knowledge — and what it is unlikely to give rise to, e.g. the ability to respond to questions with precise formatting and word counts.

More broadly these results indicate how the information theoretic approach taken here could potentially be leveraged during training. Optimality could be used as a stopping-criterion — ceasing pre-training when distance to the bound no longer decreases — or as a model-selection criterion — picking the checkpoint that is the most optimally compressed, or with the highest proportion of preference information. Given the estimates here are computed with a single-forward pass using teacher forcing, computing an entropy estimate for candidate selection would be substantively less costly than evaluating a model across a suite of benchmarks.

5 Conclusion

The work presented here bridges the gap between theoretical accounts of learning and the practical complexities of LLMs. We show that LLMs learn an optimal compression of the data on which they are trained, with a wide array of open-weights models converging along the IB bound — with the optimality of a model’s compression predicting downstream performance. Each compression is different; we can account for the information that survives the compressive process, showing how representations encode information about different levels of local context and human preferences.

The approach to interpretability we introduce here interprets a model as a whole — rather than focussing on a particular circuit, or attention head, or representational measures for just the final embedding from the final layer — because complex distributed systems are not best understood in terms of their parts alone. Giving a holistic account of what it means to train an entire model on the entire internet is a challenge, but we argue that LLMs are best understood as lossy compression. In doing so, we place them in the context of a long history of work on representation learning across the sciences.

A — Open-Weights Models, Detailed Visuals

**Token & Bigram Information Plane for Open-Weights Models.** Shown here is the full, labelled token information plane for 75 open-weights models. While models lie at different levels of complexity and expressivity, they broadly approach the IB Bound on optimal compression. Hue indicates optimality — proximity to the bound.

**Trigram and Quadgram Information Plane.** Shown here is the full, labelled trigram and quadgram information plane for 75 open-weights models. Compared with the token case above, here models lie even closer to the frontier. The quadgram estimates are noisy due to sample sparsity; this combined with the fact that all models are close to the bound results in some estimates appearing to cross the bound.

B — Post-Training and Preference Information

**Post-Training and Preference Information.** (Above) Preference information on the vertical axis against whether or not the model is post-trained on the horizontal axis, with significance values from a paired permutation test above. (Below) Again, preference information on the vertical axis against optimality of a model’s compression on the horizontal axis.

While LLMs become optimally compressed for next sequence prediction over pre-training, the final phase of the training pipeline often introduces other kinds of information. In the general case, post-training is designed to improve a model’s ability to follow instructions and better align it with human preferences; we look at how this changes the information content of a model, and how it affects the representations from pre-training. The figure above shows preference information across two different families of open weights models, Llama and Gemma, which release a checkpoint at the end of pre-training and one at the end of post-training. In the Llama case post trained models consistently have higher preference information than their pre- and mid-trained counterparts. This supports a framing of pre-training as imbuing the model with core semantic information, which is later augmented with task-specific and preference information. With the Gemma models the picture is more complicated, with a consistent effect for the most recent Gemma 3 release — post trained models have greater preference information — but no significant pattern across earlier models.

C — Predicting Performance on Individual Tasks

In the main paper we focus analysis relating representational structure to performance on aggregate performance across 6 benchmarks.

IFEval (Zhou et al. 2023): A benchmark of approximately 500 prompts containing objectively verifiable constraints such as word counts, keyword inclusion, and formatting requirements. Strong performance indicates that a model can reliably adhere to precise, compositional instructions rather than merely producing plausible text.
MMLU-Pro (Wang et al. 2024): An enhanced multi-task language understanding benchmark containing over 12,000 questions across 14 domains with ten answer choices per question. It emphasizes reasoning-focused questions over pure knowledge recall. Strong performance indicates broad expert-level knowledge and robust reasoning across STEM, humanities, and social sciences.
BBH (Suzgun et al. 2022): A suite of 23 challenging tasks drawn from BIG-Bench on which prior language models failed to outperform average human raters. Tasks span algorithmic, logical, commonsense, temporal, and multi-step reasoning. Strong performance indicates the ability to carry out diverse, non-trivial reasoning that benefits from chain-of-thought prompting.
MATH Level 5 (Hendrycks et al. 2021): The hardest difficulty tier of the MATH dataset, which contains competition-level mathematics problems sourced from contests such as the AMC and AIME. Strong performance indicates the ability to solve multi-step problems requiring creative mathematical reasoning, not just computation.
GPQA (Rein et al. 2024): A set of 448 graduate-level multiple-choice questions in biology, physics, and chemistry, written by PhD-level domain experts. The questions are designed to be “Google-proof” — skilled non-experts with full web access achieve only ~34% accuracy. Strong performance indicates deep scientific reasoning beyond surface-level retrieval.
MuSR (Sprague et al. 2024): A benchmark of multistep soft reasoning tasks embedded in natural-language narratives (e.g., ~1000-word murder mysteries). It requires extracting facts, applying commonsense, and chaining multiple inference steps. Strong performance indicates robust narrative comprehension and multi-hop reasoning in realistic settings.

Here analysing the individual task correlations reveals a pattern. Optimal compression of C4 — a broad crawl of the internet — predicts performance across math, reasoning, and factuality benchmarks, but not IFEval (\(r=0.07, p=0.631\)). IFEval assesses a model’s ability to follow specific compositional instructions in the prompt, and performance here is predicted by the amount of preference information present in a model (\(r=0.39, p=0.004\)). This sheds light on what drives model performance: general purpose knowledge and reasoning is related to optimal compression of the training data, while instruction following is related to preference information which arises during post-training. Preference information proves predictive of math, reasoning, and factuality benchmarks as well as instruction following — but this may largely reflect the makeup of the preference data used.

**Individual Task Performance Relationships.** Shown on the vertical axis are individual task accuracies with each facet representing a different task. (Top) On the horizontal axis are optimality scores on C4 across 47 different open weights models. (Bottom) On the horizontal axes is the amount of preference information in each model based on the Tulu dataset.

D — Additional Model Timecourses

A major challenge in studying pre-training is the limited availability of checkpoints. We focus analysis in the main paper on the OLMo2 models as they offer comprehensive checkpointing and comparatively strong performance. Here we look at two other families of models which make available some pre-training checkpoints.

The Smol LM2 models (Allal et al. 2025) are models with 1.7B parameters or smaller that achieve competitive performance. The 1.7B Smol model was trained on 11 Trillion tokens and performs comparably to the 1B OLMo2 model which was trained on 4 Trillion Tokens. Broadly the 1.7B Smol model follows a similar training trajectory to the OLMo2 1B model, having phases of expansion and compression but failing to approach the bound like the OLMo2 7B and 32B models. This figure also includes trajectories for the smaller 100M and 400M variants — these models struggle to show much meaningful compression, though part of the issue may be that checkpointing starts comparatively late in the pre-training process.

**Smol LM2 Timecourses.** Pre-training timecourses for the Smol LM2 family across four levels of n-gram backoff.

The other family of models we analyse are the Pythia models (Biderman et al. 2023). Included are analyses of the 1.4B and 6.9B models. In terms of parametrisation these are roughly comparable to the 1B and 7B OLMo2 models analysed in the main paper. However the methodology for training these models is substantially different, and their performance is substantially lower than the OLMo2 models. Pythia models are intended for scientific analysis — as a result they use the same amount of data, batch size, and number of training steps across model sizes, and are trained on the Pile dataset (L. Gao et al. 2020). This contains roughly 300 billion tokens; by contrast the 1B OLMo2 model is trained on 4 trillion tokens — meaning the Pythia models see 7.5% of that data. Accordingly the 1.4B Pythia model appears to achieve better compression later in training than its OLMo counterpart. The 6.9B Pythia model is still expanding representations late into pre-training, suggesting it is under-trained.

**Pythia Model Timecourses.** Pre-training timecourses for the Pythia 1.4B and 6.9B models.

E — Entropy Estimation

E.1 — Estimator Hyperparameters

The estimator has two parameters that need to be set: the number of bins \(m\) and the temperature used in the softmax \(\varepsilon\). As shown in Appendix E.8 and Conklin (2025), the estimator is generally robust with respect to the number of bins — in all experiments presented here we use \(m=100\).

Temperature Calibration

Naively we could use the same temperature across all models, however models differ in the dimensionality of their hidden representations. Within self-attention, as the dimensionality \(d_k\) of query and key vectors grows, the variance of their dot products scales linearly with \(d_k\), pushing the softmax function into saturated regions where gradients vanish (Vaswani et al. 2017). This is a specific instance of the broader concentration of measure phenomenon in high-dimensional spaces, where distance and similarity metrics become increasingly uniform and less discriminative (Aggarwal, Hinneburg, and Keim 2001). As a result Vaswani et al. (2017) scales the dot product by \(\sqrt{d}\) to avoid saturation.

The soft entropy estimator uses dot-products on the surface of the unit hypersphere passed through a softmax in order to estimate a density. As a result, for a fixed temperature, higher-dimensional space will begin to appear more uniformly distributed. We compute a temperature which is calibrated to prevent saturation, making estimates for different dimensionalities directly comparable.

Let \(V_\varepsilon\) denote the von Mises–Fisher (vMF) distribution on unit hypersphere \(\mathbb{S}^{d-1}\) with concentration parameter \(1/\varepsilon\); by rotational symmetry, \(D_{\mathrm{KL}}(V_\varepsilon \| U)\) depends only on \(\varepsilon\) and \(d\), and measures how far a single vMF kernel is from uniform. For an arbitrary data distribution \(P\) on the sphere, let \(P_\varepsilon\) denote the convolution of \(P\) with the vMF kernel at temperature \(\varepsilon\); the smoothed KL divergence \(D_{\mathrm{KL}}(P_\varepsilon \| U)\) is the estimation target, and satisfies \(0 \leq D_{\mathrm{KL}}(P_\varepsilon \| U) \leq D_{\mathrm{KL}}(V_\varepsilon \| U)\) by Jensen’s inequality.

For the soft entropy estimator \(\widehat{D}^{(\mathrm{SQ})} = D_{\mathrm{KL}}(\hat{p} \| u_m)\) takes values in \([0, \log m]\), where \(m\) is the number of bins. To ensure this range is well-matched to the estimation target, we calibrate the temperature \(\varepsilon\) so that the maximum possible value of the target equals \(\log m\). The target is bounded above by \(D_{\mathrm{KL}}(V_\varepsilon \| U)\). Direct computation requires evaluating modified Bessel functions, which is numerically unstable at large \(d\); however, using Amos-type bounds on Bessel function ratios (Amos 1974), one can construct upper and lower envelope functions \(\Psi^\pm_{\varepsilon,d}\) satisfying \(\Psi^-_{\varepsilon,d} \leq D_{\mathrm{KL}}(V_\varepsilon \| U) \leq \Psi^+_{\varepsilon,d}\), with a gap of order \(O(d^{-1})\). It can be shown that \(D_{\mathrm{KL}}(V_\varepsilon \| U)\) is monotone decreasing, so the equation \(D_{\mathrm{KL}}(V_\varepsilon \| U) = \log m\) has a unique solution \(\varepsilon^\star(m,d)\). To leading order in \(d\):

\[ \varepsilon^\star(m, d) \approx \frac{1}{\sqrt{2d \log m}} \]

Throughout our experiments, we use the more exact bounds, \(\Psi_{\varepsilon, d}^{\pm}\), to calibrate temperature. These bounds are numerically stable in high dimensions, and we approximate \(\varepsilon^\star(m,d)\) by choosing the smallest temperature \(\varepsilon\) such that \(\Psi_{\varepsilon, d}^+ \leq \log m\). In practice this is computed once per model based on \(m=100\) and the model’s dimensionality. The correction here bears strong resemblance to the default scaling within self attention \(\sqrt{d}\).

E.2 — Approximating the Input Distribution

We estimate the mutual information between model inputs and outputs. In an auto-regressive decoder-only LLM the input to a model is the preceding context up to the current token. We view the input as n-grams of tokens where the input at timestep \(x_t\) is an ngram of width \(t\) containing all tokens \(x_0 \ldots x_t\). Maintaining probability distributions for every possible context proves intractable due to the combinatorial complexity of natural language. Additionally, ngrams greater than 3 tokens become sparsely distributed in the data making reliable estimation of their probabilities a challenge. As a result we condition estimates on ngrams of fixed-widths 1, 2, 3, 4 — referred to in the paper as token, bigram, trigram, quadgram. This is related to Backoff (Katz 1987) which reduces n-gram size until the n-gram has non-zero probability in a corpus. Here though we do not interpolate different n-gram widths, instead maintaining separate aggregate estimates for each width — in part to be able to study how different levels of contextual information are represented in the model. Where a given n-gram, like a quadgram, does not have non-zero probability in the data it is omitted from the overall quadgram mutual information estimate.

In practice this means estimates for smaller n-gram widths are more reliable — a classical issue in language modelling (Jurafsky and Martin 2000, 32). Token, bigram, and trigram estimates can be estimated reliably from a relatively small sample of data. We judge this by looking at how estimates change as a function of the number of samples during the estimation procedure; by 5,000 samples these estimates reliably begin to converge. Quadgrams, due to their sparsity, tend to have less robust estimates. As a result our broad comparison of open-weights models uses token and bigram estimates. The pre-training model size analysis focuses on trigram estimates as the widest context that still reflects a reliable estimate. The analysis of how context is represented over pretraining includes quadgram estimates for reference.

E.3 — Approximating the Output Distribution

During inference models predict the next token given preceding context, but this is distinct from how they are trained. During training of an auto-regressive decoder-only LLM, causal masking means a token can only attend to preceding context. However transformer decoders are trained using teacher forcing, where predictions are generated for the entire sequence in parallel by assuming predictions are made correctly — this is distinct from having training operate one token at a time. The result is that for an embedding \(e_t\) at timestep \(t\), following embeddings \(e_{t+1}\) can attend to \(e_t\). This means embeddings get gradient information from the trailing context.

Given that our analysis computes embedding mutual informations over training with respect to a model’s input and outputs, this has implications. It means that the output for \(e_t\) is not just \(y_{t+1}\) but all following output tokens \(y_{t+1} \ldots y_{n}\) where \(n\) is the sequence length. As a result we consider \(X\) to be the entire preceding context in the input, and \(Y\) to be the entire trailing context after the current point in the sequence. This means when we compute mutual informations for different n-gram widths we match the width for \(X\) and \(Y\) — conditioning the estimates on the same width of preceding and trailing context respectively.

E.4 — Estimating Mutual Informations

To compute mutual informations between the input \(X\) and representations \(Z\), we need two quantities: the entropy of representations \(\mathcal{H}(Z)\) and the conditional entropy given the input \(\mathcal{H}(Z|X)\). To compute \(\mathcal{H}(Z)\) we use the quantisation procedure described in Section 3.1 applied to all token embeddings, giving \(\hat{Z}\) — by summing over each embedding and renormalising we get a categorical distribution \(P(Z)\) that describes the embedding space. To get a conditional estimate \(P(Z|X)\) we simply take \(\hat{Z}\) and compute a subset containing the embeddings corresponding to the input \(X\), \(\hat{Z}|X\). Summing and renormalising gives us \(P(\hat{Z}|X)\).

This brings us to an important distinction: our analysis discusses mutual informations with respect to tokens, bigrams, trigrams, and quadgrams. These are not computed over different widths of embeddings, but rather over single token embeddings conditioned on the preceding context. It means that \(Z|\text{token}\) is a subset of \(Z\), \(Z|\text{bigram}\) is a subset of \(Z|\text{token}\), \(Z|\text{trigram}\) is a subset of \(Z|\text{bigram}\), etc. The terms token, or bigram mutual informations refer to the width of the conditioning context, not the width of the embeddings over which entropy is computed.

E.5 — Conditional Mutual Informations and the Residual

In order to compute what proportion of a model encodes each level of context we use the chain rule for mutual information. As we increase the context width used in back-off the estimates contain each other — the bigram mutual information includes the token mutual information. The chain rule means:

\[ I(X; x_t, x_{t-1}) = I(Z; x_t) + I(Z; x_{t-1}|x_t) \]

which allows us to separate out the information explained by the current token \(I(Z;x_t)\) and the preceding one given the current token \(I(Z; x_{t-1}|x_t)\). For the source distribution \(X\) and a given n-gram width \(n\) we can get a proportion \(\phi\) of model information by normalising by the entropy of the model:

\[ \phi(x, n) = \frac{I(Z; x_{n}|x_1 \ldots x_{n-1})}{\mathcal{H}(Z)} \]

We compute this for each level of backoff, where at the token level \(\phi(x, 1) = I(Z;x_t)/\mathcal{H}(Z)\). The most granular label we have is quadgram. The residual, or unexplained information, is the information in the model left after subtracting the mutual information of the most granular category:

\[ \phi_{\text{residual}}(x, n_{\max}) = \mathcal{H}(Z) - I(Z; x_n \ldots x_{n_{\max}}) \]

E.5.1 — Conditional Mutual Informations and Performance

Shown above — less token information, but a higher proportion of local contextual information, relates significantly to downstream performance. In the main paper we report token level back-off in the correlation results, where lower complexity is related to performance.

E.6 — On The Use of Shannon Entropy

In this paper we compute the entropy of continuous latent variables. As a result it is natural to ask why we — in line with previous work (Shwartz-Ziv and Tishby 2017; Voita, Sennrich, and Titov 2019; Sajjadi et al. 2018) — opt instead to discretise representations and compute their Shannon entropy (Shannon 1948). There are two major reasons for this. First, differential entropy is not the true continuous analogue of Shannon Entropy (Jaynes 1957). This is shown by the fact that differential entropy \(D(X)\) is unbounded \(-\infty \le D(X) \le \infty\), and variant under linear transformations. This is the main motivator for an information theoretic analysis to discretise and use Shannon entropy directly. A secondary consideration is that we don’t know how embeddings are distributed, so in order to get a differential entropy estimate we would first need to fit a distribution to the data. At scale this fitting step can be expensive, and introduce topographic assumptions.

E.7 — Scalability of Prior Work

Shwartz-Ziv and Tishby (2017) perform an empirical information theoretic analysis of neural-networks trained on MNIST. To do so they perform dimension-wise discretisation of model embeddings. This turns a 16-dimensional vector into a 16 character string. They then convert this to a categorical distribution over all possible strings. This approach works well on small problems, but the dimension-wise discretisation requires taking a hidden representation with dimensions \(\textit{batch}\times\textit{hidden}\) and transforming it to \(\textit{batch}\times\textit{hidden}\times\textit{n bins}\). If using 50 bins, in practice this means using 50 times the memory of not discretising. For the OLMo2 32B model used in this paper which has a hidden dimension of 5120 and 64 layers, and where we have a context window of 512 tokens, this would require holding in memory a tensor of dimensions \(\textit{batch}\times 512\times5120\times64\times\textit{n bins}\). The memory use of this approach makes it intractable to apply to contemporary models.

Voita, Sennrich, and Titov (2019) studied the transformer base model which has only 6 layers with a hidden dimension of 512. Despite this they note the approach from Shwartz-Ziv and Tishby (2017) was not tractable to apply to the model. They opt instead for quantising representations via clustering, based on related work from Sajjadi et al. (2018). This method runs a clustering algorithm (mini-batch k-means), then treats each cluster as an event in a categorical distribution. While this method provides robust entropy estimates and dramatically less memory usage, it still has relatively high computational complexity. It requires running a clustering algorithm to convergence before performing quantisation, prohibiting its use in an online setting. Again thinking of the OLMo2 32B model used here, this would require running a clustering algorithm on 5120 dimensional spaces, at all 64 layers separately, for each of the 150 pre-training checkpoints. This would provide the ‘bins’ for the quantisation, then embeddings would need to be assigned to bins, requiring a second forward pass.

In practice an information-theoretic analysis of an LLM requires an entropy estimation method that is memory efficient, fast to compute, and can be applied in an online setting — requiring a single forward pass and no caching of the embeddings. The only estimator we’re aware of that meets these criteria is the soft-entropy estimator (Conklin 2025). Here the quantisation requires only a cosine-similarity and a softmax, making it fast and memory efficient. Additionally the normalisation step means ‘bins’ can be computed once at the start of the analysis, rather than needing a pass through the data to fit clusters. Conklin (2025) notes that the use of cosine similarities means this method considers only angular information in the representation space. However use of cosine-based methods is standard practice in NLP (Zhang et al. 2020; Reimers and Gurevych 2019; T. Gao, Yao, and Chen 2021), with some work suggesting vector norms in LLMs predominantly encode frequency information (Oyama, Yokoi, and Shimodaira 2023).

E.8 — The Information Bottleneck Bound

The Information Bottleneck bound is the curve traced by varying the trade-off parameter \(\beta\) in:

\[ \mathcal{F}_\beta[p(Z|X)] = I(X;Z) - \beta I(Y; Z) \]

The curve this traces is where representations are optimally compressed. Along this bound \(p(Z|X)\) is an optimal encoder, preserving only the information in \(X\) relevant to \(Y\). For a given dataset this optimal encoder can be found numerically via a version of the Blahut-Arimoto (Blahut 1972; Arimoto 1972) method for computing channel capacity. Introduced in Tishby, Pereira, and Bialek (2000), the information bottleneck method relies on three equations:

\[ p_{\beta}(z|x) = \frac{p_{\beta}(z)}{Z_{\beta}(x)}\exp\!\biggl(-\beta\, D[p(y|x)\|p_{\beta}(y|Z)]\biggr) \]

\[ p_{\beta}(z) = \sum_{x\in X}p(x)p_{\beta}(z|x) \]

\[ p_{\beta}(y|z) = \sum_{x\in X}p_{\beta}(x|z)p(y|x) \]

These equations are satisfied self-consistently at the bound. As these three equations rely on each other, one can learn an optimal encoder by starting with a randomly initialised one and iteratively computing each equation in turn.

In the general case the shape of this bound follows a linear relationship, until all mutual information between \(x\) and \(y\) is captured. At this point the curve saturates — additional complexity doesn’t result in additional accuracy, as there’s no more predictive information in \(x\). Numerical computation of the bound in our setting proves intractable. The optimal encoder \(p(z|x)\) needs to map all of natural language to representations that optimally predict the next token. In experiments we are able to compute a bound for tokenizers up to 50,000 tokens, however past this point convergence begins to fail. Given that we would like to have a bound for problems where numerical computation proves intractable, we leverage the observed linear pattern — the bound follows a linear relationship until the saturation point where \(I(Z;Y)=I(X;Y)\). Across all open weights models the highest token complexity converged to is 0.15, well below the saturation point. This is in line with results from Shwartz-Ziv and Tishby (2017), which shows FFNs on MNIST only converge near the saturation point when over-fitting.

F — Relating Compression to Training Loss

**Cross-Entropy Loss vs. Distance to bound.** Shown on the vertical axis is the OLMo2 7b model’s cross-entropy loss on 10,000 examples from C4. On the horizontal axis is the ratio between \(I(Y;Z)\) and \(I(X;Z)\), indicative of how close a representation is to the IB bound. Models begin to compress and approach the bound as the loss saturates.

Prior work (Shwartz-Ziv and Tishby 2017) shows that models transition from the fitting phase to the compression phase when empirical error on the training distribution saturates. Their setting is substantively different to the one studied here — the most relevant differences are that they analyse a feed-forward model trained on MNIST for multiple epochs, meaning the model’s performance can fully saturate in-distribution. In an LLM setting models are trained on orders of magnitude more data, often for a single epoch, meaning saturation is more graded.

We compute the cross-entropy loss for the OLMo2 7b model performing next token prediction on 10,000 examples from C4. C4 is a substantive component of the OLMo2 pre-training data (OLMo et al. 2025) and so gives us a proxy for in-distribution performance on the model’s training set. This follows a previously attested dynamic, where earlier steps dramatically decrease the loss before this begins to slowly saturate. The figure above shows this loss plotted against the ratio between expressivity \(I(Y;Z)\) and complexity \(I(X;Z)\). This ratio acts as a distance to the bound — as this quantity approaches 1.0 models approach the bound. The figure shows how models begin to approach the bound as the loss on C4 begins to saturate, broadly aligning with Shwartz-Ziv and Tishby (2017).

G — Estimator Robustness

Our work does not introduce the soft entropy estimator but is the first to apply it in this context. As a result we run some robustness experiments to see how the results vary under different hyper-parameters and data distributions.

**Estimator Robustness to number of bins and data distribution.** Shown are trajectories through the information plane for the OLMo2 7b model. (Top) Trajectories in the main paper use 100 reference points \(w_i\) per layer; here 50 points are used, and show the same overall two-phase pattern. (Bottom) Estimates in the main paper are with respect to C4; here are trajectories computed with data from Tulu and MMLU Pro, which show the same two-phase pattern. Hue indicates log tokens in billions over the course of pre-training.

G.1 — Robustness to Data Distribution

Core results in the paper show information plane trajectories computed using C4 as this dataset forms a substantive part of the pre-training data for the OLMo2 models. To verify that the overall pattern of expansion and compression is robust across data distributions we analyse the pre-training checkpoints of the OLMo2 7B model across data from C4, Tulu (Lambert et al. 2024), and MMLU (Hendrycks et al. 2020). The two-phase pattern proves consistent across all of them with a fitting phase followed by a compression phase where models approach the bound. There are individual variations for each dataset, with Tulu and MMLU having higher mutual informations than C4. This may reflect that MMLU and Tulu are more domain-specific than C4, which is a broad crawl of the internet.

G.2 — Robustness to Number of Reference Points

The Soft Entropy estimator relies on a soft-quantisation of a model’s embedding space, whereby each representation is softly assigned to \(n\) points \(w_i\) sampled uniformly at random from the surface of the unit sphere. Experiments in the paper use \(n=100\). Here we show the core 7b model pre-training time-course computed for C4 with token backoff using \(n=100\) and \(n=50\). The results show the same overall pattern of expansion and compression with small changes to the exact mutual information values. Given this estimator resembles a differentiable relaxation of a binning-based estimate, it is relevant to note that in binning based approaches increasing the number of bins reduces mutual information by assigning similar representations to an increasing number of different bins (Paninski 2003). The results seen here are consistent with this effect — 100 points achieves slightly lower mutual information than 50 points.

G.3 — Language is a Long-Tailed Distribution: Computing Mutual Information with Means

As noted in Section 3.1, the quantity estimated here is mutual information, which uses an expectation over conditional entropies:

\[ I(X; Z) := \mathcal{H}(Z) - \sum_{x \in X} P(X=x)\, \mathcal{H}(Z| X=x) \]

Here we recompute the core pre-training analyses for the 1B, 7B and 32B models using a mean — to see how treating each event as equiprobable affects the analysis. Given language is known to be Zipfian distributed a small number of high-probability patterns likely drive the mutual information. It is worth noting when using a mean the resulting quantity is not the true mutual information, and so the information bottleneck bound does not necessarily apply:

\[ I(X; Z) := \frac{1}{|X|}\sum_{x \in X} \mathcal{H}(Z) - \mathcal{H}(Z| X=x) \]

As shown in the figure below, estimates computed with the mean and with the expectation both show the same two-phase pattern, with models first expanding representations before compressing towards the bound. When taken as a mean the quantity reflects the mean mutual information per label — like mean mutual information per token — rather than being weighted by the exponentially distributed token representations.

**Pre-training Time-courses Computed with a Mean vs. Expectation.** Shown are trajectories through the information plane over pre-training for the OLMo2 1b, 7b, 32b models. These analyses use a mean (top) or expectation (bottom) in computation of mutual information. The expectation is used in the main paper as it reflects true mutual information. Hue indicates tokens in billions over the course of pre-training.

H — Datasets, Models, and Compute

H.1 — Licenses for Models and Datasets

As noted in Section 3, we use two datasets for estimation — Tulu (Lambert et al. 2024) and C4 (Raffel et al. 2020), both of which fall under the Open Data Commons Attribution License (ODC-By) v1.0. We also use MMLU Pro for behavioural evaluation (Wang et al. 2024), which falls under the Apache License (Version 2.0).

We study a wide array of models; license information grouped by model family:

OLMo: The code and models are released under Apache 2.0.
Gemma: Released under the Gemma license: https://ai.google.dev/gemma/terms
Llama: Released under the Llama license: https://www.llama.com/llama3/license/
Qwen: The code and models are released under Apache 2.0.
Aya/Command: Released under the Creative Commons Attribution Non Commercial 4.0.
Pythia: The code and models are released under Apache 2.0.

H.2 — Compute Resource and Complexity

The estimation procedure used here has low complexity for an entropy estimator, requiring only a dot-product and softmax. The majority of compute expense comes from the model’s forward pass required to compute the estimate. The complexity of this depends on the size of the model. In experiments here estimates required encoding 10,000 samples from C4 and Tulu. This process takes approximately 10, 40, or 70 minutes on 2, 4, or 8 H100 GPUs respectively (number required depending on model size). Given this we estimate the total number of GPU hours required for the results in this paper at approximately 3,600 H100 hours.

I — Acknowledgements, Ethics & Reproducibility

I.1 — Acknowledgements

We would like to thank Kenny Smith for his role in developing the core ideas presented here in earlier versions of this project.

I.2 — Ethics Statement

All experiments reported here use publicly available datasets and pretrained models obtained under their original licenses; see Appendix H.1 for details. To our knowledge, these datasets contain no personally identifiable information, and we are in compliance with their terms of use. No additional data were collected. More generally, all authors have read and adhered to the ICLR Code of Ethics. To the best of our knowledge, these results and their dissemination do not raise any ethical concerns.

I.3 — Reproducibility Statement

All datasets and pre-trained models used in our experiments are publicly available (see Appendix H.1). All code has been released at https://github.com/hcoxec/soft_h. Appendix H.2 contains details of compute resources necessary to reproduce these findings.

References

Aggarwal, Charu C., Alexander Hinneburg, and Daniel A. Keim. 2001. “On the Surprising Behavior of Distance Metrics in High Dimensional Space.” In International Conference on Database Theory (ICDT), 420–34. Springer.

Allal, Loubna Ben, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, et al. 2025. “SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model.”

Amos, Donald E. 1974. “Computation of Modified Bessel Functions and Their Ratios.” Mathematics of Computation 28 (125): 239–51.

Anderson, Philip W. 1972. “More Is Different: Broken Symmetry and the Nature of the Hierarchical Structure of Science.” Science 177 (4047): 393–96.

Arimoto, Suguru. 1972. “An Algorithm for Computing the Capacity of Arbitrary Discrete Memoryless Channels.” IEEE Transactions on Information Theory 18 (1): 14–20.

Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, et al. 2022. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” arXiv Preprint arXiv:2204.05862.

Biderman, Stella, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, et al. 2023. “Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.” In International Conference on Machine Learning, 2397–2430. PMLR.

Blahut, Richard. 1972. “Computation of Channel Capacity and Rate-Distortion Functions.” IEEE Transactions on Information Theory 18 (4): 460–73.

Bricken, Trenton, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, et al. 2023. “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Transformer Circuits Thread 2.

Burnham, Kenneth P, and David R Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer.

Chater, Nick. 1997. “Simplicity and the Mind.” The Psychologist.

Chater, Nick, and Paul Vitányi. 2003. “Simplicity: A Unifying Principle in Cognitive Science?” Trends in Cognitive Sciences 7 (1): 19–22. https://doi.org/10.1016/S1364-6613(02)00005-0.

Conklin, Henry Coxe. 2025. “Information Structure in Mappings: An Approach to Learning, Representation and Generalisation.” The University of Edinburgh.

Delétang, Grégoire, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, et al. 2023. “Language Modeling Is Compression.” arXiv Preprint arXiv:2309.10668.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86.

Edwards, Anthony William Fairbank. 1972. “Likelihood.” In. Springer.

Elhage, Nelson, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, et al. 2022. “Toy Models of Superposition.” arXiv Preprint arXiv:2209.10652.

Elhage, Nelson, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, et al. 2021. “A Mathematical Framework for Transformer Circuits.” Transformer Circuits Thread 1: 1.

Feldman, Jacob. 2000. “Minimization of Boolean Complexity in Human Concept Learning.” Nature 407 (6804): 630–33.

———. 2016. “The Simplicity Principle in Perception and Cognition.” Wiley Interdisciplinary Reviews: Cognitive Science 7 (5): 330–40.

Fourrier, Clémentine, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. 2024. “Open LLM Leaderboard V2.” https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.

Frankle, Jonathan, and Michael Carbin. 2018. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” arXiv Preprint arXiv:1803.03635.

Futrell, Richard, Ethan Wilcox, Takashi Morita, and Roger Levy. 2018. “RNNs as Psycholinguistic Subjects: Syntactic State and Grammatical Dependency.” arXiv Preprint arXiv:1809.01329.

Futrell, Richard, Ethan Wilcox, Takashi Morita, Peng Qian, Miguel Ballesteros, and Roger Levy. 2019. “Neural Language Models as Psycholinguistic Subjects: Representations of Syntactic State.” arXiv Preprint arXiv:1903.03260.

Gao, Leo, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, et al. 2020. “The Pile: An 800gb Dataset of Diverse Text for Language Modeling.” arXiv Preprint arXiv:2101.00027.

Gao, Tianyu, Xingcheng Yao, and Danqi Chen. 2021. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6894–6910. Online; Punta Cana, Dominican Republic: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.552.

Ge, Xuyang, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, and Xipeng Qiu. 2025. “Evolution of Concepts in Language Model Pre-Training.”

Geman, Stuart, Elie Bienenstock, and René Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.” Neural Computation 4 (1): 1–58.

Gibson, Edward. 1998. “Linguistic Complexity: Locality of Syntactic Dependencies.” Cognition 68 (1): 1–76.

Gibson, Edward et al. 2000. “The Dependency Locality Theory: A Distance-Based Theory of Linguistic Complexity.” Image, Language, Brain 2000: 95–126.

Goldfeld, Ziv, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, and Yury Polyanskiy. 2019. “Estimating Information Flow in Deep Neural Networks.” arXiv. http://arxiv.org/abs/1810.05728.

Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. 2024. “The Llama 3 Herd of Models.” arXiv Preprint arXiv:2407.21783.

Griffiths, Thomas L, Nick Chater, and Joshua B Tenenbaum. 2024. Bayesian Models of Cognition: Reverse Engineering the Mind. MIT Press.

Hahn, Michael, Richard Futrell, Roger Levy, and Edward Gibson. 2022. “A Resource-Rational Model of Human Processing of Recursive Linguistic Structure.” Proceedings of the National Academy of Sciences 119 (43): e2122602119.

Hendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. “Measuring Massive Multitask Language Understanding.” arXiv Preprint arXiv:2009.03300.

Hendrycks, Dan, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. “Measuring Mathematical Problem Solving with the MATH Dataset.” NeurIPS.

Hu, Jennifer, Jon Gauthier, Peng Qian, Ethan Wilcox, and Roger P. Levy. 2020. “A Systematic Assessment of Syntactic Generalization in Neural Language Models.” arXiv:2005.03692 [Cs], May. http://arxiv.org/abs/2005.03692.

Jayant, N., J. Johnston, and R. Safranek. 1993. “Signal Compression Based on Models of Human Perception.” Proceedings of the IEEE 81 (10): 1385–1422. https://doi.org/10.1109/5.241504.

Jaynes, Edwin T. 1957. “Information Theory and Statistical Mechanics.” Physical Review 106: 620–30. https://api.semanticscholar.org/CorpusID:17870175.

Jeffreys, Harold. 1939. The Theory of Probability. OuP Oxford.

Jurafsky, Dan, and James H. Martin. 2000. “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd Edition.” In Prentice Hall Series in Artificial Intelligence. https://api.semanticscholar.org/CorpusID:5073927.

Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” arXiv Preprint arXiv:2001.08361.

Katz, Slava. 1987. “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.” IEEE Transactions on Acoustics, Speech, and Signal Processing 35 (3): 400–401.

Kirby, Simon, Monica Tamariz, Hannah Cornish, and Kenny Smith. 2015. “Compression and Communication in the Cultural Evolution of Linguistic Structure.” Cognition 141: 87–102.

Lambert, Nathan, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, et al. 2024. “Tülu 3: Pushing Frontiers in Open Language Model Post-Training.”

MacKay, David JC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge university press.

Marvin, Rebecca, and Tal Linzen. 2018. “Targeted Syntactic Evaluation of Language Models.” arXiv:1808.09031 [Cs], August, 1192–202. https://doi.org/10.18653/v1/D18-1151.

Mitchell, Melanie. 2009. Complexity: A Guided Tour. Oxford University Press.

Nanda, Neel, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. “Progress Measures for Grokking via Mechanistic Interpretability.”

OLMo, Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, et al. 2025. “2 OLMo 2 Furious.” arXiv. https://doi.org/10.48550/arXiv.2501.00656.

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” In Advances in Neural Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, 35:27730–44. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.

Oyama, Momose, Sho Yokoi, and Hidetoshi Shimodaira. 2023. “Norm of Word Embedding Encodes Information Gain.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 1:2108–30. Association for Computational Linguistics (ACL).

Paninski, Liam. 2003. “Estimation of Entropy and Mutual Information.” Neural Computation 15 (6): 1191–1253. https://doi.org/10.1162/089976603321780272.

Pimentel, Tiago, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. 2020. “Information-Theoretic Probing for Linguistic Structure.” arXiv Preprint arXiv:2004.03061.

Poggio, Tomaso, Ryan Rifkin, Sayan Mukherjee, and Partha Niyogi. 2004. “General Conditions for Predictivity in Learning Theory.” Nature 428 (6981): 419–22.

Pothos, Emmanuel M, and Nick Chater. 2001. “4 Categorization by Simplicity: A Minimum Description Length Approach to Un Supervised Clustering.”

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” In Advances in Neural Information Processing Systems, edited by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, 36:53728–41. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf.

Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” Journal of Machine Learning Research 21 (140): 1–67.

Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) and the 9th International Joint Conference on Natural Language Processing (IJCNLP), 3982–92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.

Rein, David, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2024. “GPQA: A Graduate-Level Google-Proof q&a Benchmark.” In COLM.

Rissanen, Jorma. 1978. “Modeling by Shortest Data Description.” Automatica 14 (5): 465–71.

Sajjadi, Mehdi SM, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. 2018. “Assessing Generative Models via Precision and Recall.” Advances in Neural Information Processing Systems 31.

Saxe, Andrew M, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan D Tracey, and David D Cox. 2019. “On the Information Bottleneck Theory of Deep Learning.” Journal of Statistical Mechanics: Theory and Experiment 2019 (12): 124020. https://doi.org/10.1088/1742-5468/ab3985.

Saxe, Andrew M, James L McClelland, and Surya Ganguli. 2019. “A Mathematical Theory of Semantic Development in Deep Neural Networks.” Proceedings of the National Academy of Sciences 116 (23): 11537–46.

Shannon, Claude E. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (3): 379–423.

Shwartz-Ziv, Ravid, and Naftali Tishby. 2017. “Opening the Black Box of Deep Neural Networks via Information.” arXiv Preprint arXiv:1703.00810.

Sprague, Zayne, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. 2024. “MuSR: Testing the Limits of Chain-of-Thought with Multistep Soft Reasoning.” In ICLR.

Suzgun, Mirac, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, et al. 2022. “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them.” arXiv Preprint arXiv:2210.09261.

Tishby, Naftali, Fernando C Pereira, and William Bialek. 2000. “The Information Bottleneck Method.” arXiv Preprint Physics/0004057.

Tishby, Naftali, and Noga Zaslavsky. 2015. “Deep Learning and the Information Bottleneck Principle.” In 2015 Ieee Information Theory Workshop (Itw), 1–5. Ieee.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need,” 11.

Veldhoen, Sara, Dieuwke Hupkes, and Willem Zuidema. 2016. “Diagnostic Classiﬁers: Revealing How Neural Networks Process Hierarchical Structure,” 10.

Vitányi, Paul MB, and Ming Li. 2000. “Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity.” IEEE Transactions on Information Theory 46 (2): 446–64.

Voita, Elena, Rico Sennrich, and Ivan Titov. 2019. “The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives.” arXiv. http://arxiv.org/abs/1909.01380.

Voita, Elena, and Ivan Titov. 2020. “Information-Theoretic Probing with Minimum Description Length.” arXiv. http://arxiv.org/abs/2003.12298.

Wallace, Chris S, and David M Boulton. 1968. “An Information Measure for Classification.” The Computer Journal 11 (2): 185–94.

Wang, Yubo, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, et al. 2024. “Mmlu-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.” In The Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.

Warstadt, Alex, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2019. “BLiMP: A Benchmark of Linguistic Minimal Pairs for English.” arXiv:1912.00582 [Cs], December. http://arxiv.org/abs/1912.00582.

Wilcox, Allen R. 1967. “Indices of Qualitative Variation.” Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States).

Zaslavsky, Noga, Charles Kemp, Terry Regier, and Naftali Tishby. 2018. “Efficient Compression in Color Naming and Its Evolution.” Proceedings of the National Academy of Sciences 115 (31): 7937–42.

Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. “BERTScore: Evaluating Text Generation with BERT.” In Proceedings of the International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=SkeHuCVFDr.

Zhou, Jeffrey, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. “Instruction-Following Evaluation for Large Language Models.” arXiv Preprint arXiv:2311.07911.