External Validity — Formal Specification
Section titled “External Validity — Formal Specification”| Question | Does the claim generalize beyond the conditions in which it was tested? |
| Lens | Pharmacology |
| Criteria | E1–E6 |
| Dependency | External validity converts an internally valid result into a property of the model rather than a property of the experiment |
| Status in MI | Improving with recent benchmarks; still routinely overstated |
| Last updated | 16 May 2026 |
External validity asks whether the claim made on the conditions tested generalizes to other prompt distributions, other model sizes, other model families, and other intervention strengths. A result that passes all internal-validity tests on one prompt distribution at one intervention strength is a local result. External validity is what makes it a finding.
The Pharmacology lens operationalizes external validity because pharmacology has developed the most rigorous standards for exactly this set of questions: dose-response curves, therapeutic windows, selectivity ratios, and cross-population generalization. The MI analogs map cleanly onto this framework.
Why external validity is distinct from internal consistency
Section titled “Why external validity is distinct from internal consistency”External validity overlaps with the I4 (Consistency) criterion of internal validity, but they are not the same. Consistency is the internal-validity criterion that asks whether the causal pattern replicates across contexts within the same experimental setup. External validity is broader: it includes the quantitative shape of the effect across intervention strengths (the dose-response curve), the selectivity ratio between on-task and off-task effects, the absolute magnitude of the effect, and the transfer of the effect across model architectures. Consistency answers whether the result replicates; external validity answers whether the mechanism generalizes.
E1 — Intervention Reach
Section titled “E1 — Intervention Reach”The intervention must demonstrably modify the proposed target component, separately from whether the behavioral outcome changed.
Pass condition: Direct measurement of the target activation (or weight-based proxy) confirms the intervention reached the intended component with specificity above baseline collateral disruption.
Formal requirement:
where is a same-layer adjacent component not in the circuit.
Calibration: Activation patching has high reach (intervention is localized by construction). Zero and mean ablation have lower reach (distributional shift propagates through layer norm). Weight-based interventions have perfect reach in principle but require verification that the modified weights are the ones driving the behavior.
E2 — Graded Response
Section titled “E2 — Graded Response”The effect must scale monotonically with intervention strength, with a measurable threshold and plateau.
Pass condition: At least 5 intervention strengths tested, spanning subthreshold to saturating. The dose-response curve is monotonic with .
Formal requirement: Let be intervention strength (0 = no intervention, 1 = full ablation). The graded response is:
A graded response requires for all , with a defined threshold below which effect is negligible and a plateau above which additional strength produces no additional effect.
MI-specific threats:
- Single-dose reporting. Most MI papers report one ablation strength (usually , full ablation). A mechanism that only appears at extreme intervention strengths — where the model is generally degraded — is less specific than one that appears at low strengths with a wide therapeutic window.
- Non-monotonic responses. Some circuits are suppressed by over-intervention (backup mechanisms activate). This would appear as a non-monotonic curve and must be reported rather than hidden.
E3 — Selectivity
Section titled “E3 — Selectivity”The ratio of on-task to off-task effect at matched intervention strength must be substantial.
Pass condition: at the threshold intervention strength .
| Selectivity value | Interpretation |
|---|---|
| High selectivity — circuit is primarily task-specific | |
| Moderate selectivity — acceptable for most MI claims | |
| Low selectivity — circuit has substantial off-task effects | |
| Insufficient — effect is not task-selective |
Selectivity-by-default failure: On-target measurements are reported without off-target measurements, and the absence of a reported off-target effect is treated as evidence of selectivity. This is the most common external-validity failure in current MI.
E4 — Effect Magnitude
Section titled “E4 — Effect Magnitude”The absolute effect must be large enough to support the computational story being told.
Pass condition: Recovery fraction on held-out prompts (from I2); logit difference or behavioral metric at least 1 SD above the population mean effect of random same-size ablations.
Magnitude underspecification failure: A statistically reliable but small effect is reported as evidence of the mechanism without the absolute recovery fraction. Statistical significance does not establish computational relevance.
E5 — Robustness
Section titled “E5 — Robustness”The result must survive paraphrase, alternative templates, and within-family scale transfer.
Pass condition: The effect magnitude degrades by no more than 30% across at least two prompt variants, and is identifiable (possibly at reduced magnitude) in at least one other model in the same family (e.g., GPT-2 Small → GPT-2 Medium).
MI-specific threats:
- Template overfitting. A circuit discovered on “When Mary and John went to the store, John gave a drink to ___” may not transfer to paraphrased versions. The circuit may be a template-specific shortcut rather than a general IOI mechanism.
- Within-family overreach. A mechanism identified in GPT-2 Small is treated as a transformer-wide property without explicit cross-family testing.
Calibration:
| Circuit | Template variants | Cross-scale | Assessment |
|---|---|---|---|
| IOI (Wang et al. 2022) | ABBA/BABA, name substitutions | Not tested | Partial — template robustness only |
| Induction heads (Olsson et al. 2022) | Any repeated sequence | Across GPT families, Pythia | Strong — most externally robust MI result |
E6 — Cross-Architecture Generalization
Section titled “E6 — Cross-Architecture Generalization”The mechanism should be identifiable in at least one other model family, against a stated matching criterion.
Pass condition: A quantitative matching criterion is stated before the match is attempted (e.g., Jaccard similarity of component sets , or cosine similarity of weight-space signatures ). The match is reported as a score, not a binary yes/no.
MI-specific gap: There is no agreed criterion by which a circuit match across architectures counts. The project’s recommendation is that the matching criterion be stated before the match is attempted. Cross-model matches reported without a pre-stated criterion are analogous to underpowered clinical trials reporting significance without pre-specified endpoints.
Calibration:
| Circuit | Matched in | Matching criterion | Assessment |
|---|---|---|---|
| Induction heads | Multiple GPT families, Pythia, LLaMA | Behavioral (prefix matching score) | Strong — criterion is pre-specified and cross-family |
| IOI | Not yet matched cross-family | None stated | No cross-architecture evidence |
| Weight-space circuit signatures | Gemma, Qwen (initial exploration) | Cosine similarity of weight directions | Preliminary — criterion partially specified |
Partial-pass interpretation
Section titled “Partial-pass interpretation”| Evidence pattern | Criteria met | Interpretation | Recommended language |
|---|---|---|---|
| Robust, no cross-architecture test | E1–E5 | Robust within one family | ”Robust in GPT-2; cross-family transfer not tested” |
| Strong graded response, low selectivity | E2, E4 | Effect is real but not task-specific | ”Dose-responsive but non-selective” |
| High selectivity, single template | E3 | Template-specific selectivity | ”Selective on discovery template; paraphrase robustness needed” |
| Cross-architecture match, no graded response | E6 | Cross-family presence established, mechanism unclear | ”Present across architectures; dose-response not characterized” |
Gaps in current practice
Section titled “Gaps in current practice”- No standard for the dose-response curve. The project recommends at least five intervention strengths spanning threshold and plateau, with off-task degradation reported on the same axes.
- No standard for cross-architecture matching. The project recommends stating the matching criterion before the match is attempted.
- Underreporting of within-family scale transfer. Cross-scale transfer is a relatively cheap test that is routinely omitted. The project recommends treating within-family scale transfer as the minimum external-validity test.
Protocol
Section titled “Protocol”For circuit and behavior :
- E1. Verify that the intervention reaches the target (measure activation at the target vs. adjacent components).
- E2. Test at least five intervention strengths. Plot and fit a dose-response curve. Identify threshold and plateau.
- E3. Compute against one related off-task behavior.
- E4. Report the absolute recovery fraction from I2. Report the effect in SD units relative to random same-size ablations.
- E5. Test at least two prompt variants. Test at least one other model in the same family.
- E6. State the matching criterion in advance. Attempt the match in at least one other architecture. Report the match as a quantitative score.