Neuroscience — Metrics & Protocols
Section titled “Neuroscience — Metrics & Protocols”This page documents the metrics and protocols under the Neuroscience lens. These metrics adapt causal inference methods from neuroscience — mediation analysis, counterfactual interventions, path-specific effects, cross-task representation overlap, and attribution patching — to evaluate whether proposed circuits are genuine causal mechanisms rather than mere correlates.
All metrics in this page follow one principle: does the proposed circuit pass the same evidential standards that neuroscience uses to establish that a brain region is causally involved in a cognitive function? Some operate at the single-head level (mediation, PSE), some at the circuit level (AtP*, cross-task overlap), and some orchestrate multiple metrics into protocols that map onto specific causal frameworks (Pearl’s SCM, Rubin’s potential outcomes, Woodward’s interventionism).
Causal Intervention Metrics
Section titled “Causal Intervention Metrics”These metrics directly test whether circuit components are causally involved in the task via intervention experiments.
C5v2 — Mediation Analysis (v2)
Section titled “C5v2 — Mediation Analysis (v2)”Source: Pearl (2001), “Direct and Indirect Effects”; Vig et al. (2020).
Criteria: Causal / I1 Necessity
What it establishes: Whether each circuit head mediates the model’s task performance. Decomposes the total effect (TE) into natural indirect effect (NIE, the effect flowing through the head) and natural direct effect (NDE, the effect bypassing the head). A head with high NIE relative to TE is a genuine mediator — the model’s output depends on information flowing through it.
What it does not establish: Whether the head is sufficient for the task. High mediation means the head is necessary (information passes through it), but other heads may also be necessary. It also does not reveal what computation the head performs.
Method:
- Run the model on clean prompts, record the logit-diff (total effect baseline).
- For each circuit head, apply counterfactual mediation:
- NIE: intervene on the head’s input (set to corrupted/mean), let the effect propagate naturally. NIE = clean_ld - intervened_ld.
- NDE: freeze the head’s output to its clean value, corrupt everything else. NDE = TE - NIE.
- Compute
prop_mediated= NIE / TE for each head. - Report
mean_prop_mediatedacross all circuit heads.
Key quantities:
mean_prop_mediated— mean proportion of total effect mediated through circuit headsnie_per_head— natural indirect effect per headnde_per_head— natural direct effect per head
Pass condition: No explicit threshold (report-only). Higher mean_prop_mediated indicates stronger mediation.
Usage:
uv run python 05_mediation_v2.py --model gpt2 --device cpuuv run python 05_mediation_v2.py --tasks ioi --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| mean_prop_mediated > 0.7 | Strong mediation — most of the task effect flows through circuit heads |
| mean_prop_mediated 0.3—0.7 | Moderate mediation — circuit heads carry some but not all information |
| mean_prop_mediated < 0.3 | Weak mediation — the circuit may be misspecified or the effect flows through other paths |
| One head dominates NIE | A single bottleneck head — circuit is effectively a serial pipeline |
C5 — Mediation Analysis (Original)
Section titled “C5 — Mediation Analysis (Original)”Source: Pearl (2001), “Direct and Indirect Effects”; Baron & Kenny (1986).
Criteria: Causal / I1 Necessity
What it establishes: Same as C5v2 but uses the original NDE/NIE decomposition via Pearl’s formula. Reports nie_fraction as the primary metric.
What it does not establish: Same limitations as C5v2. The original version is retained for backward compatibility; C5v2 is the recommended implementation.
Key quantities:
nie_fraction— fraction of total effect attributable to indirect (mediated) path
Pass condition: Report-only.
C24 — Path-Specific Effect (PSE)
Section titled “C24 — Path-Specific Effect (PSE)”Source: Pearl (2001); Vig et al. (2020).
Criteria: Causal / I1 Necessity
What it establishes: The causal contribution of each individual head in isolation. PSE patches ONE head at a time from mean to clean and measures the change in logit-diff. Unlike mediation analysis (which decomposes TE into NIE/NDE), PSE directly quantifies how much each head contributes to restoring the correct output.
What it does not establish: Whether heads interact — PSE is a marginal (one-at-a-time) measure that misses synergistic or redundant interactions between heads. If two heads are redundant, PSE will overcount their combined contribution.
Method:
- Start from a corrupted baseline (all circuit heads mean-ablated).
- For each head, patch it from mean back to clean activation.
- PSE = clean_ld - patched_ld for each head.
pse_ratio= sum of all PSEs / total effect.
Key quantities:
pse_ratio— ratio of summed path-specific effects to total effectpse_per_head— individual head PSE values
Pass condition: Report-only. pse_ratio near 1.0 indicates the circuit accounts for the full effect additively; values > 1.0 suggest redundancy.
Usage:
uv run python 24_pse.py --model gpt2 --device cpuuv run python 24_pse.py --tasks ioi --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| pse_ratio near 1.0 | Circuit heads additively account for the total effect |
| pse_ratio > 1.0 | Redundancy — some heads contribute overlapping information |
| pse_ratio < 1.0 | Synergy or missing components — heads interact non-additively or circuit is incomplete |
| One head has PSE >> others | Single dominant head — the “circuit” may be mostly one component |
G2b — AtP* Attribution Patching
Section titled “G2b — AtP* Attribution Patching”Source: Kramar et al. (2024). arXiv:2403.00745.
Criteria: Causal
What it establishes: Which edges in the computational graph carry task-relevant information, with correction for cancellation artifacts. Standard attribution patching (AtP) can miss important edges when positive and negative contributions cancel. AtP* decomposes attributions to detect and correct for this cancellation, producing more reliable edge importance scores.
What it does not establish: Whether the identified edges form a sufficient circuit. AtP* ranks edges by importance but does not test whether the selected subset reproduces the model’s behavior.
Method:
- Run the model on clean and corrupted prompt pairs.
- Compute standard AtP scores: gradient of output metric with respect to each edge activation, evaluated at the corrupted baseline.
- Compute AtP* scores: decompose AtP into positive and negative contributions, flag edges where cancellation occurs (large positive and negative terms that nearly cancel).
- For flagged edges, use the uncancelled (absolute) attribution.
- Report fraction of circuit edges above the attribution threshold.
Key quantities:
frac_above_threshold— fraction of circuit edges with AtP* score above thresholdmean_atp_star— mean AtP* score across circuit edgescancellation_rate— fraction of edges where AtP and AtP* disagree substantially
Pass condition: >= 80% of circuit edges above threshold.
Usage:
uv run python G2b_atp_star.py --model gpt2 --device cpuuv run python G2b_atp_star.py --tasks ioi --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| frac_above_threshold > 0.8 | Circuit edges are reliably important — AtP* confirms the circuit definition |
| High cancellation_rate | Standard AtP would miss important edges — AtP* correction is valuable |
| Low frac_above_threshold | Many circuit edges have weak attribution — circuit may include non-essential components |
Cross-Task Metrics
Section titled “Cross-Task Metrics”E63 — Cross-Task Overlap
Section titled “E63 — Cross-Task Overlap”Source: Kornblith et al. (2019), CKA; Raghu et al. (2021).
Criteria: Representational / E5 Cross-task
What it establishes: Whether the same representations are used across different tasks. Measures subspace overlap (via principal angles between task activation matrices) and CKA (Centered Kernel Alignment) at each layer. If two tasks share high CKA in early layers but diverge later, the model reuses low-level features but develops task-specific representations at higher layers.
What it does not establish: Whether shared representations serve the same function in both tasks. High overlap means similar activation patterns, not necessarily the same computation.
Method:
- Run the model on prompts from two different tasks, collecting activations at each layer.
- Compute CKA between the two task activation matrices at each layer.
- Compute principal angles between the top- principal subspaces of each task’s activations.
- Report
cka_decline(whether CKA decreases across layers) andsubspace_overlap(mean cosine of principal angles).
Key quantities:
cka_decline— whether CKA decreases from early to late layers (indicates increasing task specialization)subspace_overlap— mean subspace overlap across layers
Pass condition: Report-only. Both quantities are diagnostic.
Usage:
uv run python 63_cross_task_overlap.py --model gpt2 --device cpuuv run python 63_cross_task_overlap.py --tasks ioi,greater_than --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| High CKA early, low CKA late | Model reuses low-level representations, develops task-specific features in later layers |
| High CKA throughout | Tasks use similar representations at all levels — possibly a shared mechanism |
| Low CKA throughout | Tasks use distinct representations from the start — independent circuits |
| High subspace_overlap | Activation subspaces align closely — same directions encode both tasks |
Protocols
Section titled “Protocols”Protocols orchestrate multiple metrics into a coherent evaluation that maps onto a specific causal framework. Each protocol defines required metrics, pass thresholds, and the theoretical tradition that justifies the evaluation structure.
Protocol A01 — SCM (Pearl)
Section titled “Protocol A01 — SCM (Pearl)”Source: Pearl (2000), “Causality”; Wang et al. (2022); Chan et al. (2022); Conmy et al. (2023).
Framework: Structural Causal Models. Evaluates circuits through Pearl’s causal hierarchy: Association (observational correlation), Intervention (do-calculus), and Counterfactual (what-if reasoning). Higher rungs provide stronger evidence.
Metrics and thresholds:
| Metric | Threshold | Rung |
|---|---|---|
logit_diff | > 0.0 | Association |
activation_patching | > 0.7 | Intervention |
causal_scrubbing | > 0.5 | Counterfactual |
role_ablation | Report-only | Intervention |
What it establishes: Whether the circuit satisfies the evidential requirements of Pearl’s causal hierarchy. Passing at all three rungs means the circuit is not merely correlated with the task (Association), actually does something when intervened on (Intervention), and behaves correctly under counterfactual manipulations (Counterfactual).
What it does not establish: That the SCM is the unique correct causal model. Multiple SCMs may be compatible with the same intervention data.
Protocol A02 — Counterfactual DAS
Section titled “Protocol A02 — Counterfactual DAS”Source: Geiger et al. (2021, 2023); Wu et al. (2023).
Framework: Distributed Alignment Search. Tests whether model representations can be aligned with causal variables in an interpretive theory via interchange interventions.
Metrics and thresholds:
| Metric | Threshold |
|---|---|
das_iia | > 0.8 |
iia_variants | > 0.5 |
corrupt_restore | > 0.7 |
multi_axis_iia | > 0.5 |
counterfactual_consistency | > 0.7 |
path_patching | > 0.5 |
intermediate_state_prediction | > 0.5 |
What it establishes: Whether there exist linear subspaces in the model’s activations that correspond to interpretive variables, such that swapping those subspace components between inputs produces the counterfactually correct output. This is the strongest form of causal alignment — it tests not just “is this head important” but “does this subspace encode this specific variable.”
What it does not establish: Uniqueness of the alignment. Multiple subspaces may pass DAS, and the method assumes linearity of the representation.
Protocol A03 — Rubin CATE
Section titled “Protocol A03 — Rubin CATE”Source: Rubin (1974); Holland (1986); Imbens & Rubin (2015).
Framework: Rubin Causal Model / Potential Outcomes. Evaluates circuits using the framework of randomized experiments and average treatment effects.
Metrics and thresholds:
| Metric | Threshold |
|---|---|
cate | > 0.5 |
intervention_specificity | > 0.7 |
What it establishes: Whether the circuit has a measurable, specific causal effect. CATE (Conditional Average Treatment Effect) measures the average effect of circuit ablation, while intervention specificity ensures the effect is targeted (ablating the circuit affects the target task more than non-target tasks).
What it does not establish: The mechanism by which the circuit produces its effect. The Rubin framework is agnostic about mechanisms — it quantifies effects without requiring a structural model.
Protocol A04 — Woodward Interventionism
Section titled “Protocol A04 — Woodward Interventionism”Source: Woodward (2003), “Making Things Happen”; Craver & Bechtel (2007).
Framework: Woodward’s interventionist account of causation. Tests three properties: stability (the causal relationship holds under different ablation methods), proportionality (the intervention is at the right level of abstraction), and invariance (the relationship holds across contexts).
Metrics and thresholds:
| Metric | Threshold |
|---|---|
sigma_ablation | > 0.5 |
resample_complement | > 0.7 |
misalignment | < 0.3 |
What it establishes: Whether the circuit is a stable, proportional, invariant cause of the task behavior. Stability (low CV across ablation methods) means the finding is not an artifact of a particular ablation technique. Proportionality (low misalignment between noising and denoising) means the circuit is at the right level of granularity. Invariance (resample complement) means non-circuit components are genuinely uninvolved.
What it does not establish: Whether the circuit is the unique mechanism. Woodward’s framework allows for multiple causes at different levels of description.
Protocol A05 — MDC/Glennan Mechanisms
Section titled “Protocol A05 — MDC/Glennan Mechanisms”Source: Machamer, Darden & Craver (2000); Glennan (2017); Bechtel & Abrahamsen (2005).
Framework: New Mechanistic Philosophy. Tests whether the circuit qualifies as a mechanism in the MDC sense: organized entities and activities producing the phenomenon.
Metrics and thresholds:
| Metric | Threshold |
|---|---|
operation_specification | > 0.5 |
held_out_prediction | > 0.5 |
replacement_test | > 0.7 |
procedure_specification | > 0.5 |
composition_test | > 0.5 |
logic_gates | > 0.5 |
What it establishes: Whether the circuit’s components have specifiable operations (each head does a characterizable thing), the operations predict behavior on held-out data, components are not trivially replaceable, and the components compose into a coherent procedure.
What it does not establish: That we have correctly identified what each component does — only that the structural properties of a mechanism are present.
Protocol A06 — Mediation Analysis
Section titled “Protocol A06 — Mediation Analysis”Source: Pearl (2001); Baron & Kenny (1986); Vig et al. (2020).
Framework: Causal mediation. Orchestrates the mediation metrics (C5, C5v2, C24) into a unified evaluation of whether the circuit mediates the model’s task performance.
Metrics and thresholds:
| Metric | Threshold |
|---|---|
mediation (C5) | > 0.5 |
mediation_v2 (C5v2) | > 0.5 |
pse (C24) | > 0.7 |
What it establishes: Convergent evidence from three mediation measures that the circuit is on the causal pathway between input and output.
What it does not establish: Whether the circuit is the only causal pathway. Mediation analysis identifies mediators but cannot rule out unmeasured alternative pathways.
Protocol A10 — Regularity/INUS
Section titled “Protocol A10 — Regularity/INUS”Source: Mackie (1965), “Causes and Conditions.”
Framework: INUS conditions (Insufficient but Necessary part of an Unnecessary but Sufficient condition). Tests whether circuit heads satisfy INUS conditions for the task behavior.
Metrics and thresholds:
| Metric | Threshold |
|---|---|
minimality_class | > 0.5 |
What it establishes: Whether each head is an INUS condition for the task — necessary within some sufficient subset, even if alternative sufficient subsets exist. This is weaker than strict necessity but stronger than mere correlation.
What it does not establish: That the circuit is the only sufficient set. By definition, INUS allows for multiple sufficient sets; the metric classifies heads within the identified circuit.
Protocol A11 — Actual Causation
Section titled “Protocol A11 — Actual Causation”Source: Halpern & Pearl (2005); Shapley (1953).
Framework: Halpern-Pearl actual causation (AC1—AC3 conditions) plus Shapley value interaction analysis.
Metrics and thresholds:
| Metric | Threshold |
|---|---|
eap | > 0.5 |
atp_star | > 0.5 |
pairwise_synergy | > 0.0 |
shapley_interactions | > 0.0 |
What it establishes: Whether circuit heads are actual causes (not just counterfactual dependencies) of the task behavior. AC conditions handle preemption and overdetermination — cases where standard counterfactual tests give misleading results. Shapley interactions reveal whether heads work synergistically or redundantly.
What it does not establish: A complete causal model. Actual causation identifies specific causes in specific contexts; it does not provide a general law-like relationship.
Protocol EX03 — Linguistic Probes
Section titled “Protocol EX03 — Linguistic Probes”Source: Linzen et al. (2016); Marvin & Linzen (2018); Futrell et al. (2019).
Framework: Psycholinguistic evaluation. Tests whether the circuit handles specific linguistic phenomena that the original task requires.
Metrics and thresholds:
| Metric | Threshold |
|---|---|
priming | > 0.0 |
garden_path | > 0.0 |
binding_theory | > 0.5 |
animacy | > 0.5 |
What it establishes: Whether the circuit’s behavior is consistent with known psycholinguistic phenomena. If the IOI circuit correctly handles binding theory constraints and animacy distinctions, this provides content validity evidence — the circuit engages with the linguistic structure the task requires.
What it does not establish: That the circuit implements these linguistic computations in the way humans do. Behavioral consistency with psycholinguistic patterns does not imply mechanistic similarity.
Summary Table
Section titled “Summary Table”| Metric ID | Name | Criteria | Evidence Family | Pass Condition |
|---|---|---|---|---|
| C5 | Mediation Analysis | I1 Necessity | Causal | Report-only |
| C5v2 | Mediation Analysis v2 | I1 Necessity | Causal | Report-only |
| C24 | Path-Specific Effect | I1 Necessity | Causal | Report-only |
| E63.cka | Cross-Task CKA Decline | E5 Cross-task | Representational | Report-only |
| E63.sub | Cross-Task Subspace Overlap | E5 Cross-task | Representational | Report-only |
| G2b | AtP* Attribution Patching | Causal | Causal | >= 80% edges above threshold |
| p_a01 | SCM Pearl | Pearl hierarchy | Protocol | See metric thresholds |
| p_a02 | Counterfactual DAS | DAS alignment | Protocol | See metric thresholds |
| p_a03 | Rubin CATE | Potential outcomes | Protocol | cate > 0.5, specificity > 0.7 |
| p_a04 | Woodward Interventionism | Stability/proportionality | Protocol | See metric thresholds |
| p_a05 | MDC/Glennan Mechanisms | Mechanistic explanation | Protocol | See metric thresholds |
| p_a06 | Mediation Analysis | Mediation | Protocol | See metric thresholds |
| p_a10 | Regularity/INUS | INUS conditions | Protocol | minimality_class > 0.5 |
| p_a11 | Actual Causation | HP actual causation | Protocol | See metric thresholds |
| p_ex03 | Linguistic Probes | Psycholinguistic | Protocol | See metric thresholds |
Connection to Neuroscience Lens
Section titled “Connection to Neuroscience Lens”The neuroscience lens is documented at the Neuroscience lens page. The core insight is that mechanistic interpretability faces the same evidential challenges as cognitive neuroscience: establishing that a component is causally involved in a function requires multiple converging lines of evidence, not a single ablation experiment.
The metrics on this page operationalize that insight:
- Mediation metrics (C5, C5v2, C24) decompose total effects into direct and indirect paths, following Pearl’s mediation framework — the same approach used in neuroimaging to establish that a brain region mediates between stimulus and response.
- Attribution patching (G2b) adapts gradient-based attribution with cancellation correction, analogous to effective connectivity analysis in neuroscience.
- Cross-task overlap (E63) tests whether circuits share representations across tasks, paralleling cross-decoding analyses in cognitive neuroscience.
- Protocols (A01—A11, EX03) organize these individual tests into coherent evaluation strategies drawn from specific philosophical traditions of causal inference.