E01 — Distributed Alignment Search & IIA
Section titled “E01 — Distributed Alignment Search & IIA”This framework asks: Does a specific linear subspace in the model’s residual stream causally encode a particular high-level variable?
Interchange Intervention Accuracy (IIA) tests causal alignment between model representations and abstract causal variables. Rather than passively observing correlations, IIA actively intervenes: swap the representation from one input into another and check whether the model’s output changes as predicted by the abstract causal model.
DAS extends this by learning the optimal rotation of the residual stream in which to intervene. Instead of assuming the variable is axis-aligned, DAS finds the direction that maximizes IIA — making it a constrained linear probe with causal validation built in.
Theoretical grounding
Section titled “Theoretical grounding”| Source | Year | Key contribution |
|---|---|---|
| Geiger et al., “Causal Abstractions of Neural Networks” | 2021 | Formalized IIA as causal alignment metric |
| Geiger et al., “Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations” | 2023 | Introduced DAS — learned rotation for distributed IIA |
| Wu et al., “Interpretability at Scale” | 2023 | Scaled DAS to large language models |
| Sutter et al., “Nonlinear Causal Abstractions” | 2023 | Critiqued linear IIA; proposed nonlinear extensions |
Core concept
Section titled “Core concept”Given a high-level causal model ( \mathcal{C} ) with variable ( V ), DAS learns a rotation matrix ( R \in \mathbb{R}^{d \times d} ) and selects dimensions ( S \subseteq {1, \ldots, d} ) such that intervening on ( (Rh)_S ) maximizes:
[ \text{IIA} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}\left[ f\left(\text{do}(h^{(i)}, h^{(j)}, R, S)\right) = y^{(j)}_V \right] ]
where ( \text{do}(h^{(i)}, h^{(j)}, R, S) ) replaces the ( S )-dimensions of the rotated source with those from the base. High IIA means the subspace causally encodes ( V ); low IIA means the variable is either nonlinearly encoded or distributed across layers.
The key distinction from probing: a probe can achieve high accuracy on linearly decodable but causally inert directions. IIA requires that the direction actually matters for downstream computation.
Instruments under E01
Section titled “Instruments under E01”DAS-IIA Core (01_das_iia.py)
Section titled “DAS-IIA Core (01_das_iia.py)”Learns rotation ( R ) via gradient descent on IIA loss, then reports final IIA at each layer.
What it establishes: Whether a target variable has a clean linear causal encoding at a given site. What it does not establish: Whether that encoding is the only pathway, or how the encoding is used downstream.
Usage:
uv run python 01_das_iia.py --tasks ioi svaIIA Variants (15_iia_variants.py)
Section titled “IIA Variants (15_iia_variants.py)”Tests boundary conditions: multi-token variables, partial interventions, and nonlinear baselines.
What it establishes: Robustness of the linear encoding assumption across intervention granularities. What it does not establish: Optimality of the learned subspace relative to all possible encodings.
Usage:
uv run python 15_iia_variants.py --tasks ioi sva --variants multi_token partialMulti-Axis IIA (31_multi_axis_iia.py)
Section titled “Multi-Axis IIA (31_multi_axis_iia.py)”Extends DAS to simultaneously align multiple causal variables, measuring orthogonality of their encodings.
What it establishes: Whether multiple variables occupy orthogonal subspaces or share directions. What it does not establish: Causal interaction effects between variables.
Usage:
uv run python 31_multi_axis_iia.py --tasks ioi sva --n-variables 3Reading the scores
Section titled “Reading the scores”| Pattern | What it means |
|---|---|
| IIA > 0.9 at a single layer | Clean linear causal encoding localized to that layer |
| IIA > 0.9 only with multi-axis | Variable requires >1 dimension for faithful encoding |
| IIA ~ 0.5 everywhere | Variable not linearly encoded; try nonlinear extensions |
| High IIA but low probe accuracy | Causal direction diverges from readout direction |