Skip to content

Internal Validity — Formal Specification

Section titled “Internal Validity — Formal Specification”
QuestionDoes the evidence establish that the component implements the computation, not merely participates in it?
LensNeuroscience
CriteriaI1–I5
DependencyInternal validity is the workhorse — most MI evidence is internal-validity evidence. But it says nothing about whether the finding generalizes (external), whether the instrument is reliable (measurement), whether the construct is coherent (construct), or whether the narrative is correct (interpretive).
Status in MIBest-addressed by existing methods; still routinely method-conditional
Last updated16 May 2026

Internal validity asks whether the causal inference from intervention to behavior is licensed within the experimental setup. The Neuroscience lens explains the intellectual background. This page gives the formal definitions, quantitative thresholds, and calibration data.

Full criterion page →

Removing the component should degrade the behavior. For circuit CC, model ff, input xx, and counterfactual value Cˉ\bar{C}:

Necessity(C)=f(x)f(xdo(C:=Cˉ))f(x)\text{Necessity}(C) = \frac{f(x) - f(x \mid \text{do}(C := \bar{C}))}{f(x)}

Pass condition: Necessity(C)>0.10\text{Necessity}(C) > 0.10 with an equal-size random-component baseline producing Necessity(Crandom)<0.05\text{Necessity}(C_{\text{random}}) < 0.05.

Necessity\text{Necessity} valueInterpretation
>0.80> 0.80Strong necessity — component is critical for the behavior
0.300.800.30 - 0.80Moderate — component contributes but is not the sole driver
0.100.300.10 - 0.30Weak — component participates but may be one of many
<0.10< 0.10Not necessary — indistinguishable from random components

Ablation method is part of the claim. Necessity scores are a joint property of the component and the ablation type. Miller, Chughtai & Saunders (2024) show that the same circuit’s faithfulness varies from 87% under mean ablation to below 50% under other methods. The full claim must state the ablation method.

Common confounds:

  • Bottleneck confound. A component that many computations route through is necessary for all of them, but implements none in particular.
  • Off-manifold confound. Zero and mean ablation push activations to values the model never encounters during training.

Calibration:

CircuitMethodNecessity\text{Necessity}Notes
IOI name-movers (Wang et al. 2022)Mean ablation0.87\approx 0.87Drops logit diff from 3.56 to 0.46
IOI name-moversResample ablation<0.50< 0.50Method-dependent; same circuit, weaker score
Induction heads (Olsson et al. 2022)Mean ablationHigh (qualitative)Stronger on repeated sequences, weaker on non-repeated

Full criterion page →

Isolating or restoring the component should reproduce the behavior. The recovery fraction is:

R=fcircuit(x)ffull(x)R = \frac{f_{\text{circuit}}(x)}{f_{\text{full}}(x)}

Pass condition: R0.70R \geq 0.70 on held-out prompts, with the complement ablation method stated.

RR valueInterpretation
R>0.90R > 0.90Strong sufficiency — circuit reproduces nearly all of the behavior in isolation
0.70R0.900.70 \leq R \leq 0.90Moderate — circuit captures most of the behavior
0.50R<0.700.50 \leq R < 0.70Weak — circuit contributes substantially but something is missing
R<0.50R < 0.50Not sufficient — circuit alone does not drive the behavior

The asymmetry with necessity. Necessity requires ablating the circuit. Sufficiency requires ablating everything outside the circuit. Resample ablation of the complement is a stricter test than mean ablation, since mean ablation leaves systematic residual signal.

Two forms of sufficiency:

  • Isolation sufficiency: Run only the circuit; ablate the complement. This is what RR measures.
  • Restoration sufficiency: In a corrupted prompt where the behavior fails, restoring only the circuit restores the behavior. This is the activation-patching form and typically yields higher RR because the rest of the model remains intact.

Calibration:

CircuitMethodRRNotes
IOI (Wang et al. 2022)Mean ablation of complement0.87\approx 0.8787% of logit diff recovered
Greater-Than (Hanna et al. 2023)Mean ablation of complement0.895\approx 0.89589.5% of probability diff recovered

Full criterion page →

The component should be more necessary for the target behavior than for unrelated behaviors.

Pass condition: Specificity(C)>1.0\text{Specificity}(C) > 1.0 against at least one related off-task behavior.

Specificity(C,B,B)=Necessity(C,B)Necessity(C,B)\text{Specificity}(C, B, B') = \frac{\text{Necessity}(C, B)}{\text{Necessity}(C, B')}

Specificity valueInterpretation
>3.0> 3.0Strong specificity — component is much more necessary for BB than BB'
1.53.01.5 - 3.0Moderate specificity
1.01.51.0 - 1.5Weak specificity
<1.0< 1.0Inverted — component is more necessary for the control behavior (red flag)

Off-task selection matters. The control behavior BB' must be related, not trivially distinct.

Target taskInformative off-taskTrivial off-task
IOISubject-verb agreementModular arithmetic
Greater-ThanSuccessorTranslation
Gendered pronounsIOIFactual recall

The double dissociation test. The strongest specificity evidence is a double dissociation: ablating circuit AA impairs behavior XX but not YY, and ablating circuit BB impairs YY but not XX.

Calibration: No published circuit paper reports a formal specificity ratio against a related task. Induction heads have implicit specificity (stronger on repeated sequences than non-repeated), but this is not quantified as a ratio.

Full criterion page →

The effect should replicate across contexts sufficient to rule out an artifact of the discovery distribution.

Pass condition: Replication across at least two of three axes, with bootstrap confidence intervals on the principal metrics.

AxisWhat it testsExample
Cross-promptTemplate or paraphrase robustnessIOI with varied syntactic structures
Cross-seedIndependence from random initializationSame circuit found in independently trained copies
Cross-checkpointStability across trainingCircuit present at step 50k, 100k, and 200k

Calibration:

CircuitCross-promptCross-seedCross-checkpointAssessment
IOI (Wang et al. 2022)Partial (name substitutions, ABBA/BABA)Not testedNot testedOne axis, partially
Induction heads (Olsson et al. 2022)Yes (any repeated sequence)Yes (multiple model families)Yes (training dynamics)All three axes — unusually strong
Greater-Than (Hanna et al. 2023)Partial (year ranges)Not testedNot testedOne axis, partially

Full criterion page →

The observed effect should not be explained by collateral disruption to non-circuit components.

Pass condition: At least two ablation methods compared, with consistent results.

ConfoundMechanismMitigation
Off-manifold ablationZero and mean ablation push activations to out-of-distribution valuesUse resample ablation against a counterfactual distribution
Backup suppressionAblating one component can suppress or activate backup mechanismsTest individual and joint ablation; report backup activation
Layer-norm redistributionAblating a component changes layer-norm statistics for all subsequent componentsCompare effects with and without freezing layer-norm parameters

Method comparison protocol: Report the same metric under at least two ablation methods. If the results diverge substantially, the finding is method-conditional — flag it as such.

Calibration:

CircuitMethods comparedConsistent?Notes
IOI (Wang et al. 2022)Mean ablation onlyN/A — single methodMiller et al. (2024) later showed method-dependence
IOI (Miller et al. 2024)Mean vs. resample vs. othersNo — substantial divergenceFaithfulness ranges 87% to below 50% depending on method
Evidence patternCriteria metInterpretationRecommended language
Necessary but not sufficientI1Distributed or incomplete circuit”Causally implicated, not localized”
Sufficient but not necessaryI2Redundancy or forced route”A capable route, not shown necessary”
Necessary + sufficient, not specificI1, I2General-capability component”Real mechanism, not task-specific”
Necessary + sufficient + specific, not consistentI1, I2, I3Benchmark artifact possible”Locally established, not yet robust”
Strong I1 + I2, single ablation methodI1, I2 (conditional)Method-conditional claim”Sufficient under [method]; not tested under alternatives”
All five metI1–I5Full internal validityUpgrade to external validity testing

For circuit CC and behavior BB:

  1. I1. Ablate CC; record Necessity(C)\text{Necessity}(C) under at least two methods. Compare to equal-size random baseline.
  2. I2. Ablate complement; record RR. Use held-out prompts not used for discovery.
  3. I3. Compute Necessity(C,B)\text{Necessity}(C, B') for one related off-task BB'. Report specificity ratio.
  4. I4. Replicate across at least two of: cross-prompt, cross-seed, cross-checkpoint.
  5. I5. Compare results across ablation methods. If inconsistent, report the range and flag as method-conditional.