Skip to content

E01 — Distributed Alignment Search & IIA

Section titled “E01 — Distributed Alignment Search & IIA”

This framework asks: Does a specific linear subspace in the model’s residual stream causally encode a particular high-level variable?

Interchange Intervention Accuracy (IIA) tests causal alignment between model representations and abstract causal variables. Rather than passively observing correlations, IIA actively intervenes: swap the representation from one input into another and check whether the model’s output changes as predicted by the abstract causal model.

DAS extends this by learning the optimal rotation of the residual stream in which to intervene. Instead of assuming the variable is axis-aligned, DAS finds the direction that maximizes IIA — making it a constrained linear probe with causal validation built in.

SourceYearKey contribution
Geiger et al., “Causal Abstractions of Neural Networks”2021Formalized IIA as causal alignment metric
Geiger et al., “Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations”2023Introduced DAS — learned rotation for distributed IIA
Wu et al., “Interpretability at Scale”2023Scaled DAS to large language models
Sutter et al., “Nonlinear Causal Abstractions”2023Critiqued linear IIA; proposed nonlinear extensions

Given a high-level causal model ( \mathcal{C} ) with variable ( V ), DAS learns a rotation matrix ( R \in \mathbb{R}^{d \times d} ) and selects dimensions ( S \subseteq {1, \ldots, d} ) such that intervening on ( (Rh)_S ) maximizes:

[ \text{IIA} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}\left[ f\left(\text{do}(h^{(i)}, h^{(j)}, R, S)\right) = y^{(j)}_V \right] ]

where ( \text{do}(h^{(i)}, h^{(j)}, R, S) ) replaces the ( S )-dimensions of the rotated source with those from the base. High IIA means the subspace causally encodes ( V ); low IIA means the variable is either nonlinearly encoded or distributed across layers.

The key distinction from probing: a probe can achieve high accuracy on linearly decodable but causally inert directions. IIA requires that the direction actually matters for downstream computation.

Learns rotation ( R ) via gradient descent on IIA loss, then reports final IIA at each layer.

What it establishes: Whether a target variable has a clean linear causal encoding at a given site. What it does not establish: Whether that encoding is the only pathway, or how the encoding is used downstream.

Usage:

uv run python 01_das_iia.py --tasks ioi sva

Tests boundary conditions: multi-token variables, partial interventions, and nonlinear baselines.

What it establishes: Robustness of the linear encoding assumption across intervention granularities. What it does not establish: Optimality of the learned subspace relative to all possible encodings.

Usage:

uv run python 15_iia_variants.py --tasks ioi sva --variants multi_token partial

Extends DAS to simultaneously align multiple causal variables, measuring orthogonality of their encodings.

What it establishes: Whether multiple variables occupy orthogonal subspaces or share directions. What it does not establish: Causal interaction effects between variables.

Usage:

uv run python 31_multi_axis_iia.py --tasks ioi sva --n-variables 3
PatternWhat it means
IIA > 0.9 at a single layerClean linear causal encoding localized to that layer
IIA > 0.9 only with multi-axisVariable requires >1 dimension for faithful encoding
IIA ~ 0.5 everywhereVariable not linearly encoded; try nonlinear extensions
High IIA but low probe accuracyCausal direction diverges from readout direction