Skip to content

This framework asks: Is the information about a high-level variable linearly accessible in the model’s representations?

Linear probing trains a simple linear classifier or regressor on frozen intermediate activations to predict a target variable. High probe accuracy indicates that the representation geometrically separates the variable’s values in a linearly decodable way. This is a necessary (but not sufficient) condition for the model to use that information downstream.

Probing is the representational complement to causal intervention: it measures what information is present, while IIA (E01) measures what information is used. The gap between probe accuracy and IIA reveals “latent” information — decodable but causally inert features stored in the residual stream.

SourceYearKey contribution
Alain & Bengio, “Understanding intermediate layers using linear classifier probes”2016Introduced linear probes for representation analysis
Hewitt & Manning, “A Structural Probe for Finding Syntax in Word Representations”2019Structural probes for tree distance
Belinkov, “Probing Classifiers: Promises, Shortcomings, and Advances”2021Survey of probing methodology and pitfalls
Voita & Titov, “Information-Theoretic Probing with MDL”2020MDL-based probe complexity as representation quality

A linear probe learns ( W \in \mathbb{R}^{c \times d} ) and ( b \in \mathbb{R}^c ) minimizing:

[ \mathcal{L}{\text{probe}} = -\sum{i} \log \text{softmax}(W h_\ell^{(i)} + b)_{y_i} ]

where ( h_\ell^{(i)} ) is the residual stream at layer ( \ell ) for input ( i ), and ( y_i ) is the target label. Probe accuracy reflects linear decodability; MDL-based approaches additionally penalize probe complexity to avoid overfitting to memorization.

The critical distinction: probe accuracy is an upper bound on what the model could extract linearly, not evidence that it does. A direction may be decodable simply because the embedding preserves input features that the model never attends to.

Linear Probe (01_das_iia.py — probe mode)

Section titled “Linear Probe (01_das_iia.py — probe mode)”

DAS with identity rotation reduces to a constrained linear probe. Running without learned rotation gives probe-only baselines.

What it establishes: Whether target information is linearly separable at each layer. What it does not establish: Whether the model causally relies on that linear encoding.

Usage:

uv run python 01_das_iia.py --tasks ioi sva --probe-only

Compares probe accuracy on the target variable against a control task (random labels or unrelated variable). Selectivity = target accuracy minus control accuracy.

What it establishes: Whether high accuracy reflects genuine structure vs. probe memorization capacity. What it does not establish: Causal role of the decoded information.

Usage:

uv run python 01_das_iia.py --tasks ioi sva --probe-only --selectivity
PatternWhat it means
Probe accuracy high, IIA highVariable linearly encoded and causally used
Probe accuracy high, IIA lowInformation present but not causally active at that site
Probe accuracy low everywhereVariable not linearly decodable; may require nonlinear read-out
Accuracy jumps at layer ( \ell )Layer ( \ell ) computes or consolidates the variable
High selectivity (> 0.3)Genuine structure, not probe capacity artifact