Verdict Tier 3: Mechanistically Supported
Section titled “Verdict Tier 3: Mechanistically Supported”| Tier | 3 of 5 (progressive) |
| What it means | Necessity, sufficiency, and specificity all established under at least one method |
| Minimum evidence | I1 (necessity) + I2 (sufficiency) + I3 (specificity) + I4 (consistency) + M1 (reliability 0.7) |
| Upgrade to Triangulated | Multi-method convergence (C5) + external robustness (E5) + cross-procedure agreement (V2) |
| Downgrade to Causally suggestive | If specificity (I3) fails, or if sufficiency (I2) is shown to be method-conditional |
What this tier establishes
Section titled “What this tier establishes”A Mechanistically Supported claim has demonstrated that the mechanism is both necessary and sufficient for the target behavior, and that this effect is specific to the mechanism rather than reflecting general model degradation. The claim has moved from “this is involved” to “this is specifically and sufficiently responsible.”
The key transition from Tier 2 is the conjunction of sufficiency and specificity. Either alone is insufficient: a mechanism can be sufficient but non-specific (a large enough chunk of any model reproduces any behavior), or specific but not sufficient (the component does exactly one thing, but other components also contribute). Mechanistically Supported requires both.
Sufficiency is method-dependent. The complement ablation method (zero, mean, resample) is part of the claim. Miller et al. (2024) demonstrated that IOI’s recovery ratio under mean ablation drops below 0.50 under resample ablation — the same circuit changes tier depending on the method declared.
Example verdict statement
Section titled “Example verdict statement”Verdict: Mechanistically supported —
[implementational-topographic]Claim: Heads L9H9, L9H6, L10H0 are necessary and sufficient for name-mover behavior in IOI. Met: I1 (necessity, logit diff > 0.7 under zero + mean ablation), I2 (sufficiency, 87% recovery), I3 (SI = 14.2 vs. SVA task), I4 (consistent across 3 prompt templates), M1 () Open: E5 (cross-model), C5 (multi-method convergence), V2 (cross-procedure agreement) Scope: GPT-2 Small, IOI task, Wang et al. prompt distribution
Minimum reporting for this tier
Section titled “Minimum reporting for this tier”- Sufficiency metric: recovery ratio with stated threshold
- Specificity metric: selectivity index or cross-task comparison showing selective effect
- Complement ablation method named as part of the claim
- Replication across at least two of: prompt templates, ablation methods, random seeds
- Bootstrap confidence interval on the principal metric
- At least one published reference point for calibration
Upgrade and downgrade
Section titled “Upgrade and downgrade”| Direction | What’s required |
|---|---|
| → Triangulated | At least two methods with non-overlapping assumptions confirm the same mechanism (C5). External robustness across distributions or model sizes (E5). Cross-procedure agreement characterized quantitatively (V2). |
| → Causally suggestive (downgrade) | Specificity (I3) fails: the ablation equally impairs unrelated tasks. Or sufficiency (I2) is method-conditional: recovery drops below threshold under a more appropriate ablation method. |
Characteristic occupants
Section titled “Characteristic occupants”- Induction heads (Olsson et al., 2022) — necessity, sufficiency, and specificity all demonstrated for in-context copying behavior
- Greater-Than circuit (Hanna et al., 2023) — strong structural plausibility with specificity evidence across related numerical tasks
- Copy suppression heads (McDougall et al., 2023) — unusually clean specificity: the heads suppress repeated tokens specifically, with minimal off-target effects
Key references
Section titled “Key references”- Olsson et al. (2022). In-context Learning and Induction Heads. arXiv:2209.11895
- Hanna et al. (2023). How does GPT-2 compute greater-than? arXiv:2305.00586
- McDougall et al. (2023). Copy Suppression. arXiv:2310.04625
- Miller et al. (2024). Faithfulness Metrics for Circuit Discovery. arXiv:2407.08734
- Wang et al. (2022). Interpretability in the Wild. arXiv:2211.00593