Construct Validity — Formal Specification
Section titled “Construct Validity — Formal Specification”| Question | Is the thing being claimed a coherent theoretical entity? |
| Lens | Philosophy of Science |
| Criteria | C1–C5 |
| Dependency | Construct validity is prior to all other validity types — ambiguity here propagates downstream |
| Status in MI | Most neglected type; most circuit papers name the construct without specifying it |
| Last updated | 16 May 2026 |
Construct validity asks whether the entity being claimed exists as a well-defined theoretical object. The Philosophy of Science lens explains the intellectual background and shows the criteria applied to real cases. This page gives the formal definitions, quantitative thresholds, and calibration data.
C1 — Falsifiability
Section titled “C1 — Falsifiability”A claim is falsifiable when a disconfirming observation is specified before evidence collection. The specification must name three things:
Pass condition: All three components stated in advance. If retrospective, this is disclosed.
Formal requirement: There exists a measurement of circuit on dataset such that:
Examples of valid conditions:
- under resample ablation
- on template-varied prompts
Examples of invalid conditions:
- “If the circuit doesn’t work” (no metric, no threshold, no dataset)
- “If faithfulness is low” (no threshold)
- “If the ablation fails on the same prompts used for discovery” (discovery set, not held-out)
Calibration: No published circuit paper we are aware of states a quantitative falsifiability condition in advance. This criterion is aspirational but enforceable going forward.
C2 — Structural Plausibility
Section titled “C2 — Structural Plausibility”A component’s weight-space signature must match its claimed computational role.
Pass condition: For every named component role, the weight-space measurement is consistent with the claim.
Formal requirements by role type:
Copying head (name-mover, induction head): The matrix should approximate a copying operation. We measure the copying score:
where is a relevant token vocabulary, is the embedding, and is the unembedding. A copying head should have .
Ordinal head (successor, Greater-Than): The should encode monotonic ordering:
Structural plausibility requires across relevant token pairs.
Inhibition head (S-inhibition): Attention pattern should peak at the position of the repeated subject. Measured as:
Structural plausibility requires on clean IOI prompts (above uniform attention for a 15-token sequence).
Failure threshold: Any mismatch between role label and weight-space signature must be flagged. A “name-mover” with fails C2.
C3 — Task Specificity
Section titled “C3 — Task Specificity”The circuit should not score highly on unrelated tasks under the same evaluation.
Pass condition: The selectivity ratio is positive on at least one related off-task.
Selectivity ratio: For circuit discovered on task and evaluated on related task :
| value | Interpretation |
|---|---|
| Strong task specificity — circuit is substantially more faithful on its discovery task | |
| Moderate specificity — circuit has some off-task faithfulness but favors discovery task | |
| No specificity — circuit is equally faithful on both tasks (bottleneck or general-purpose) | |
| Inverted specificity — circuit is more faithful on the off-task (red flag) |
Off-task selection: The off-task must be related, not trivially distinct.
| Discovery task | Informative off-task | Trivial off-task (too easy) |
|---|---|---|
| IOI | Subject-verb agreement | Modular arithmetic |
| Greater-Than | Successor | Translation |
| Gendered pronouns | IOI | Factual recall |
Calibration: No published circuit paper reports a selectivity ratio. This is the gap C3 is designed to close.
C4 — Minimality
Section titled “C4 — Minimality”Every component must be individually necessary given the others.
Pass condition: For every , ablating while leaving all other members intact produces a performance decrease exceeding threshold .
Formal definition: Circuit is minimal if and only if:
where is the faithfulness score and is the minimum meaningful effect. A reasonable default is (2% faithfulness drop).
Joint vs individual necessity: Two components are jointly redundant if:
This pattern indicates backup mechanisms. Wang et al. (2022) found this with IOI backup name-movers. Both components and their relationship should be reported.
Calibration:
| Circuit | Components | After pruning | Redundant members found |
|---|---|---|---|
| IOI (Wang et al. 2022) | 26 heads | ~20 core + 6 backup | Yes — backup name-movers |
| Greater-Than (Hanna et al. 2023) | ~12 heads | Not reported | Not tested |
C5 — Convergent Validity
Section titled “C5 — Convergent Validity”Multiple independent instruments should identify the same components.
Pass condition: between instruments from different evidence families.
Jaccard similarity:
| value | Interpretation |
|---|---|
| Strong convergent validity — methods agree on most components | |
| Moderate — partial agreement, investigate discrepancies | |
| Weak — circuit is method-dependent | |
| Failed — methods identify different components entirely |
Independence requirement: The two instruments must come from different evidence families with non-overlapping major assumptions.
| Valid pair | Why independent |
|---|---|
| Activation patching + weight classifier | Causal (interventionist) vs structural (static weights) |
| DAS-IIA + SVD spectral analysis | Representational (learned subspace) vs structural (spectral) |
| EAP + linear probe | Causal (gradient-based) vs representational (supervised) |
| Invalid pair | Why dependent |
|---|---|
| Zero ablation + mean ablation | Both causal, both interventionist, share confound structure |
| Activation patching + path patching | Same framework, one is a refinement of the other |
MTMM inequality (Campbell & Fiske 1959): For trait measured by methods and , convergent validity requires:
Two methods should agree more about the same circuit than about different circuits measured by the same method. When this inequality fails, the method is driving the result more than the mechanism.
Calibration:
| Circuit pair | Methods | Interpretation | |
|---|---|---|---|
| IOI: patching vs weight classifier | Causal vs structural | ~0.67 (project estimate) | Strong convergent validity |
| SVA: weight circuit vs EAP circuit | Structural vs causal | ~0.0 (observed in this project) | Failed — underdetermined |
| Induction heads: behavioral vs structural | Behavioral vs structural | High (qualitative) | Cross-model agreement supports convergent validity |
Partial-pass interpretation
Section titled “Partial-pass interpretation”| Pattern | Criteria met | Interpretation | Recommended language |
|---|---|---|---|
| Pre-registered, structurally coherent, but single-method | C1, C2 | Well-defined construct, method-dependent identification | ”Coherent construct, convergence not yet tested” |
| Convergent, but not task-specific | C1, C5 | Real entity, but may be general-purpose | ”Convergent but non-discriminant” |
| Minimal and specific, but no convergence | C3, C4 | Task-specific finding from one method | ”Task-specific by one instrument, convergence needed” |
| All met except falsifiability | C2–C5 | Strong post-hoc case, but not pre-registered | ”Retrospectively well-supported, not prospectively falsifiable” |
| None met | — | Label without construct backing | ”Named but not validated as a construct” |
Protocol
Section titled “Protocol”For a proposed circuit and behavior :
- C1. State before collecting evidence.
- C2. For every named role, compute the relevant weight-space metric (CopyScore, attention fraction, or effect correlation). Flag mismatches.
- C3. Evaluate on at least one related task. Compute .
- C4. Per-component leave-one-out ablation. Report for each .
- C5. Identify one method from a different evidence family. Compute .