Skip to content

This framework asks: how much does including a component in the circuit cause the output to change, and does this effect vary across input subpopulations?

The Rubin causal model defines causation through potential outcomes: for each unit (here, each input prompt), there is a potential outcome ( Y(1) ) under treatment (component active) and ( Y(0) ) under control (component ablated). The individual treatment effect ( Y(1) - Y(0) ) is never directly observed for both conditions simultaneously, but the Conditional Average Treatment Effect (CATE) can be estimated across subgroups. This brings the statistical machinery of heterogeneous treatment effects to MI — asking not just “does this component matter on average?” but “for which inputs does it matter most?”

Where activation patching (A01) gives a single average score per component, CATE estimation reveals that a head might be critical for long-distance subject-verb agreement but irrelevant for adjacent cases. This heterogeneity is invisible to aggregate faithfulness metrics and critical for understanding when circuits activate versus remain dormant.

SourceYearKey contribution
Rubin, “Estimating Causal Effects of Treatments”1974Potential outcomes framework; fundamental problem of causal inference
Wager & Athey, “Estimation and Inference of Heterogeneous Treatment Effects”2018Causal forests for CATE estimation with valid confidence intervals
Wang et al., arXiv 2211.005932022Activation patching as treatment/control design
Miller et al., arXiv 2407.087342024Ablation-method dependence of causal effect estimates in circuits

Core concept: heterogeneous treatment effects

Section titled “Core concept: heterogeneous treatment effects”

Define treatment ( T = 1 ) as “component ( c ) runs normally” and ( T = 0 ) as “component ( c ) is ablated.” For input ( x ) with covariates ( W(x) ) (e.g., token distance, syntax type, prompt length), the CATE is:

[ \tau(w) = \mathbb{E}[Y(1) - Y(0) \mid W = w] ]

When ( \tau(w) ) varies substantially across covariate strata, the component’s causal role is context-dependent. A causal forest partitions the input space into regions of homogeneous treatment effect, revealing which structural features of the input determine whether the component is load-bearing.

This directly addresses a limitation of standard activation patching: a head with average ( AP = 0.3 ) might have ( \tau = 0.9 ) on syntactically complex inputs and ( \tau = 0.0 ) on simple ones. The aggregate hides the mechanism’s activation conditions.

Estimates conditional average treatment effects for each circuit component using ablation as treatment assignment. Partitions inputs by structural covariates and reports per-stratum effects:

[ \hat{\tau}(w) = \frac{1}{|S_w|} \sum_{x \in S_w} \left[ LD(x, T=1) - LD(x, T=0) \right] ]

where ( S_w ) is the set of inputs in covariate stratum ( w ). Also fits a causal forest for continuous covariate spaces.

What it establishes: Which input features modulate a component’s causal importance. Identifies the activation conditions for circuit engagement.

What it does not establish: Why the effect is heterogeneous (mechanism), or whether the covariates chosen are the right partitioning variables.

Usage:

uv run python 06_cate.py --tasks ioi sva --n-prompts 40

C25 — Intervention Specificity (25_intervention_specificity.py)

Section titled “C25 — Intervention Specificity (25_intervention_specificity.py)”

Tests whether a component’s causal effect is specific to the target task or whether ablation causes collateral damage to unrelated behaviors. Measures the ratio of on-task effect to off-task disruption:

[ \text{Specificity}(c) = \frac{\tau_{\text{on-task}}(c)}{\tau_{\text{on-task}}(c) + \tau_{\text{off-task}}(c)} ]

What it establishes: Whether the component is specialized (high specificity) or polyfunctional (low specificity).

What it does not establish: The full set of tasks the component participates in.

Usage:

uv run python 25_intervention_specificity.py --tasks ioi sva --n-prompts 40
PatternWhat it means
Uniform CATE across strataComponent is always active (constitutive role)
High CATE in one stratum, near-zero elsewhereContext-dependent activation; circuit engages conditionally
High specificity (> 0.8)Component is task-specialized
Low specificity (< 0.5)Polyfunctional component; ablation disrupts multiple behaviors