so u want to estimate an entropy in a vector space

say you want to know how much information is in the representations of a large language model.

not what the model does — what information it contains.

to answer that, you need entropy estimates of the model’s representation space. that turns out to be surprisingly hard. in the general case, getting useful estimates of entropy in continuous spaces remains a chalenge across disciplines.

here we walk through the soft entropy estimator (from my thesis! Conklin 2025), first visually, then mathematically, then in code. this approach gives a way of doing density and entropy estimation that is robust even at high dimensions, and is designed to be pallelised on gpu (it only really requires a dot product, softmax, and sum)

Note

code for the soft entropy estimator described here is available at hcoxec/soft_h. Or can be tested on MNIST in this google collab

P(Z) — aggregate

P(Ẑ | z₁)

H(Z) aggregate

−

H(Z | z₁) selected

          H(Z) - H(Z | z₁)
          
          pseudo mutual information

embeddings z anchors wᵢ

0.00

bits

H(Z) — entropy of aggregate distribution

ε 0.10

n(z) 6

z concentration 0.00

Some Quick Background Notes on Existing Approaches

A couple of years ago when I started looking at how to estimate entropy in deep-learning models it took a while to work out that there wasn’t a clear off the shelf-solution. Or at least there wasn’t one that did exactly what I wanted. I’ll note down here some of the existing approaches, and their complexities, to save anyone else starting down that rabbit hole some time. It’s worth noting though that the solution I ended up building works at LLM scale but can also be straightforwardly applied to smaller models, and continuous spaces that aren’t from deep-learning at all.

Ideally we want an estimator that is fast enough to apply to high-dimensional spaces, with a huge number of samples. Each of the basic estimates in the LLM project here are the result of estimating entropy for 225,280,000 \(d\) dimensional points where \(d\) is typically \(>4000\) (Assuming 10,000 samples from c4 with a context window of 512 tokens, on a model with 44 layers). In that project alone we computed estimates on data of that size for over 1,500 model checkpoints - working at that scale eliminates most approaches on the basis of memory use or compute complexity.

The classic approach — from Shwartz-Ziv & Tishby (2017) — is dimension-wise discretisation. Take a hidden representation with shape batch × hidden, bin each dimension independently, turning a \(d\) dimensional embedding into a string of symbols of length \(d\). Then estimate the entropy of the distribution over possible strings. This works on a 16-dimensional feedforward network trained on MNIST. At LLM scale it’s a non-starter: a single layer of OLMo2 32B has hidden dimension 5120, so the binned representation would require holding a batch × 512 × 5120 × n_bins tensor in memory. with 100 bins that’s 100× the cost of the forward pass itself, across 64 layers. It also requires knowing the support of the distribution in advance, Shwartz-Ziv & Tishby (2017) use a FFN with tanh activations so they can safely assume the support is [-1, 1], in a contemporary transformer estimating the support would require an additional forward pass through the data.
A more tractable approach that has been applied at LLM scale is clustering (although large here means 65 million parameters, not 70 billion – 2019 was a different time 🫠) — fit k-means to get cluster centroids, assign each embedding to its nearest cluster, treat normalised membership counts as a distribution. This was the approach in Voita et al. (2019), by using mini-batch k-means and discretising the whole embedding rather than per-dimension, it’s more memory-efficient, but it requires two passes through the data: one to fit the clusters, one to assign embeddings. You can’t use it online, so for the OLMo2 32B model this would require running 65 instances of kmeans in parallel (one per layer) doing a forward pass through all the data to learn the centroids. Then passing through all the data again to assign embeddings to the learned centroids.
More recently there have been gradient based approaches to estimating mutual information, like Poole et al. (2019) and Cheng et al. (2020). these are designed for estimating MI between continuous variables, and broadly work by learning a critic function that scores pairs of samples from the joint distribution higher than pairs from the product of marginals. they can be applied at scale, but they require a separate fitting step to learn the critic function, and they don’t give you Shannon entropy estimates that you can use to apply the information bottleneck bound. We talk a bit about why we want something akin to shannon entropy further down.

What we actually want is something that:

runs in a single forward pass
needs no separate fitting step
scales to any model size
gives you “Shannon entropy”

The soft entropy estimation process requires no support estimation (it assumes the support is the surface of the unit sphere). It’s also non parametric requiring no gradient-based learning or clustering. It involves a contraction over the hidden dimension making it memory efficient, and relies on operations that are parallelisable and already have highly optimised implementations on a GPU.

Why Shannon Entropy?

Differential entropy would seem like the natural quantity to estimate for continuous variables. Why don’t we want to use that? Well first it’s worth noting that differential entropy is just an entirely different quantity from Shannon entropy. As noted by Jaynes (1957), Shannon just took Shannon entropy, swapped the sum with an integral and assumed it was the correct continuous analog without deriving it (a rare moment of fallibility from the guy who invented the discipline of information theory with his master’s thesis).

\[-\sum_x p(x) \log p(x)\]

\(\neq\)

\[-\int f(x) \log f(x)\, dx\]

Differential entropy ends up differing from it’s discrete counterpart in a number of ways that makes it challenging to use for interpretability. In a future post we’ll discuss this in greater detail but for now we’ll consider two key issues:

Bounding: Shannon entropy of a discrete variable is always non-negative and upper-bounded by the log of the number of possible outcomes. Differential entropy can be negative, and it doesn’t have a fixed upper or lower bound — it can go to negative infinity for distributions that are very concentrated. This makes it harder to interpret as a measure of information content, since you don’t have a clear scale to compare against.
Invariance: Shannon entropy is invariant to reparameterisations of the variable — if you apply a bijective transformation to your variable, the entropy doesn’t change. Consider a 6 sided fair die, it will have the same entropy regardless of whether the faces are 1,2,3,4,5,6 or 2,5,9,10,15,17. Differential entropy isn’t invariant in this way — for example, dilating a distribution, can change the differential entropy even if the underlying information content is the same. A Normal distribution with 0 mean and variance 1 has a differential entropy of about 1.42 nats, but if you scale it to have variance 0.01 the differential entropy drops to about -3.22 nats, even though they’re both normally distributed.

Intuitively you can think about shannon entropy as telling you how uniformly distributed a distribution is over outcomes. While differential entropy reflects some compounded notion of uniformity and scale. This is particularly problematic for embeddings, where the norm of the embedding can vary widely across layers and models. Using differential entropy to measure the information content of an embedding space, you might end up with estimates that are more about how spread out the embeddings are in space rather than how much meaningful information they contain.

Formally

Ok, we’ve looked at how the estimator works visually, then discussed some of the motivations behind this estimator. Now we’ll describe it formally before looking at minimal code examples. The core idea is to replace hard binning with a soft, differentiable version — and to do it on the unit sphere, where cosine similarity gives a natural notion of proximity. Doing estimation on the sphere means we don’t need to estimate the support of the distribution in advance, and the soft assignment means we can get a smooth estimate of the underlying density without needing to cluster or discretise (which can destroy mutual information by arbitrarily assigning points that are close together to different bins depending on where the bin boundary falls).

here’s the procedure for \(b\) samples from an embedding space \(Z\):

step 1 — normalize. Project each embedding onto the unit sphere:

\[\bar{z} = \frac{z}{\|z\|}\]

step 2 — sample anchors. draw \(n\) reference points \(\{w_i\}_{i=1}^n\) uniformly at random from the sphere surface. this is equivalent to sampling from an isotropic gaussian and normalising:

\[\tilde{w}_i \sim \mathcal{N}(0, I_{d_h}), \quad w_i = \frac{\tilde{w}_i}{\|\tilde{w}_i\|}\]

these anchors act akin to centres of soft “bins.”

step 3 — cosine similarity. for each normalized embedding \(\bar{z}\), compute cosine similarities to every anchor. for each normalized representation \(\bar{z} \in \mathbb{R}^h\), we compute a vector whose \(i^{th}\) entry is the cosine between \(\bar{z}\) and \(w_i\)

\[\sum_j \bar{Z}_{b,i} W_{i,n}\]

step 4 — soft assignment. we then apply softmax with temperature \(\varepsilon\) to that vector of cosine similarities – softly assigning each embedding \(\bar{z}\) to the points in \(W\):

\[\check{Z}_{b,n} = \text{softmax}\left(\frac{\sum_j \bar{Z}_{b,i} W_{i,:}}{\varepsilon}\right)\]

this gives a probability vector over the \(n\) anchors — a soft assignment of each embedding to the reference points. We calibrate temperature to prevent saturation of the softmax and to make different dimensionalities comparable — more on that below.

step 5 — aggregate and compute entropy. average the soft assignments over the batch dimension to get a single categorical distribution:

\[\hat{Z} = \frac{1}{B} \sum_b \check{Z}_{b,:}\]

then shannon entropy is just:

\[H(\hat{Z}) = -\sum_{j=1}^n \hat{z}_{j} \log \hat{z}_{j}\]

in practice we often normalise this by dividing by \(\log n\) to get a value between 0 and 1, a quantity called efficiency, which can be easier to interpret (e.g. 0.8 means the distribution has 80% the information of a uniform distribution over \(n\) bins). This quantity can be straightforwardly conditioned to get conditional entropies and mutual informations:

\[ \qquad I(X; \hat{Z}) := \mathcal{H}(\hat{Z}) - \sum \limits_{x \in X} P(X=x) \mathcal{H}(\hat{Z}| X=x)\]

temperature calibration

we use a temperature in the softmax, which we use to correct for effects from different dimensionalities. different models have different hidden dimensions, and in high-dimensional spaces dot products concentrate — the softmax saturates and everything looks uniform. you need to calibrate \(\varepsilon\) per model so estimates are comparable across dimensionalities. Our calibration is courtesy of the brilliant Julian Gold – we’ll discuss it at length in a future post because it gives a good intuition for relating the quantity we estimate to differential entropy but for now we just consider the simple leading order behaviour of the calibration.

the calibration sets \(\varepsilon\) so that the maximum possible KL divergence between the soft assignment distribution and uniform exactly equals \(\log n\) (the entropy of a uniform distribution over \(n\) bins). using the von Mises–Fisher distribution on the unit sphere and Amos-type bounds on Bessel function ratios, this gives a closed-form solution to leading order in \(d\):

\[\varepsilon^\star(m, d) = \frac{1}{\sqrt{2d \log m}}\]

this resembles the \(1/\sqrt{d_k}\) scaling used in attention — both are correcting for the concentration of dot products in high dimensions.

in code

the full estimator is about 20 lines of pytorch. soft_entropy maps a batch of embeddings to a scalar; soft_mutual_information wraps it to compute \(I(X; Z)\) for any labelled partition.

import torch
import torch.nn.functional as F
import math

soft_entropy

def soft_entropy(
  z: torch.Tensor, 
  n_bins: int = 100, 
  seed: int = 0
) -> torch.Tensor:

    batch, d = z.shape

    # ε* = 1 / sqrt(2d log n) — calibrated so softmax doesn't saturate
    temp = 1.0 / math.sqrt(2 * d * math.log(n_bins))

    z_norm = F.normalize(z, dim=-1)              # project onto unit sphere

    generator = torch.Generator(device=z.device).manual_seed(seed)
    w = F.normalize(                             # sample reference points
        torch.randn(n_bins, d, device=z.device, generator=generator),
        dim=-1,
    )

    scores = z_norm @ w.T                        # [batch, n_bins] cosine similarities
    p_per_sample = F.softmax(scores / temp, dim=-1)
    p = p_per_sample.mean(dim=0)                 # aggregate over batch → P(Z)

    h = -(p * p.clamp(min=1e-9).log()).sum()
    return h / math.log(n_bins)                  # normalize to [0, 1]

temperature is set automatically from d and n_bins — no tuning needed. the fixed seed means the same reference points are reused across calls, so estimates are directly comparable. the result is efficiency-normalized: 1.0 means the aggregate distribution is uniform across all bins (maximum entropy), 0.0 means all mass is on one bin.

soft_mutual_information

def soft_mutual_information(
  z: torch.Tensor, 
  labels: torch.Tensor, 
  n_bins: int = 100
) -> torch.Tensor:

    h_z = soft_entropy(z, n_bins)                # H(Z)

    conditional_h = 0.0
    for label in labels.unique():
        mask = labels == label
        p_label = mask.float().mean()            # P(X = label)
        conditional_h += p_label * soft_entropy(z[mask], n_bins)

    return h_z - conditional_h                   # I(X; Z) = H(Z) - H(Z | X)

labels can be anything expressible as integer classes — token ids, n-gram contexts, preference labels, language ids. the loop here scales linearly with the number of classes – the repo offers a version tht accelerates this.

fin

that’s a brief overview of the estimator and how it works. In future we’ll look at some projects where we’ve applied it, and discuss some of the theoretic properties of the estimator in detail.

if you’re really quite keen chapter 5 of my thesis is on this (entropy for continuous variables has been a quasi-neurotic-vortex that has consumed an enourmous amount of my time over the past two years). it continues to consume my time - this is a project under active development - so if you have a use case or domain where you’d like to apply it please get in touch.

acknowldgements

as i note this grew out of my doctoral work, so i owe a thank you to my advisor kenny for encouraging me to figure this out.

References

Conklin, Henry Coxe. 2025. “Information Structure in Mappings: An Approach to Learning, Representation and Generalisation.” The University of Edinburgh.

Jaynes, Edwin T. 1957. “Information Theory and Statistical Mechanics.” Physical Review 106: 620–30. https://api.semanticscholar.org/CorpusID:17870175.