Skip to content

Construct Validity — Formal Specification

Section titled “Construct Validity — Formal Specification”
QuestionIs the thing being claimed a coherent theoretical entity?
LensPhilosophy of Science
CriteriaC1–C5
DependencyConstruct validity is prior to all other validity types — ambiguity here propagates downstream
Status in MIMost neglected type; most circuit papers name the construct without specifying it
Last updated16 May 2026

Construct validity asks whether the entity being claimed exists as a well-defined theoretical object. The Philosophy of Science lens explains the intellectual background and shows the criteria applied to real cases. This page gives the formal definitions, quantitative thresholds, and calibration data.

Full criterion page →

A claim is falsifiable when a disconfirming observation is specified before evidence collection. The specification must name three things:

Falsifiability condition=(metric m,  threshold τ,  dataset D)\text{Falsifiability condition} = (\text{metric } m, \; \text{threshold } \tau, \; \text{dataset } D)

Pass condition: All three components stated in advance. If retrospective, this is disclosed.

Formal requirement: There exists a measurement m(C,D)m(C, D) of circuit CC on dataset DD such that:

m(C,D)<τ    claim is disconfirmedm(C, D) < \tau \implies \text{claim is disconfirmed}

Examples of valid conditions:

  • IIA(C,Dheld-out)<0.10\text{IIA}(C, D_{\text{held-out}}) < 0.10
  • Faithfulness(C,Dparaphrase)<0.50\text{Faithfulness}(C, D_{\text{paraphrase}}) < 0.50 under resample ablation
  • Logit diff recovery<0.30\text{Logit diff recovery} < 0.30 on template-varied prompts

Examples of invalid conditions:

  • “If the circuit doesn’t work” (no metric, no threshold, no dataset)
  • “If faithfulness is low” (no threshold)
  • “If the ablation fails on the same prompts used for discovery” (discovery set, not held-out)

Calibration: No published circuit paper we are aware of states a quantitative falsifiability condition in advance. This criterion is aspirational but enforceable going forward.

Full criterion page →

A component’s weight-space signature must match its claimed computational role.

Pass condition: For every named component role, the weight-space measurement is consistent with the claim.

Formal requirements by role type:

Copying head (name-mover, induction head): The WOVW_{OV} matrix should approximate a copying operation. We measure the copying score:

CopyScore(h)=1VtV(WUWOV(h)WE)t,tmaxj(WUWOV(h)WE)t,j\text{CopyScore}(h) = \frac{1}{|V|} \sum_{t \in V} \frac{(W_U \, W_{OV}^{(h)} \, W_E)_{t,t}}{\max_j (W_U \, W_{OV}^{(h)} \, W_E)_{t,j}}

where VV is a relevant token vocabulary, WEW_E is the embedding, and WUW_U is the unembedding. A copying head should have CopyScore>0.5\text{CopyScore} > 0.5.

Ordinal head (successor, Greater-Than): The WOVW_{OV} should encode monotonic ordering:

effect(y1,y2)=ey2WUWOV(h)WEey1\text{effect}(y_1, y_2) = e_{y_2}^\top \, W_U \, W_{OV}^{(h)} \, W_E \, e_{y_1}

Structural plausibility requires Corr(effect(y1,y2),  sign(y2y1))>0.7\text{Corr}(\text{effect}(y_1, y_2), \; \text{sign}(y_2 - y_1)) > 0.7 across relevant token pairs.

Inhibition head (S-inhibition): Attention pattern should peak at the position of the repeated subject. Measured as:

AttnFrac(h,posS)=Afinal,posS(h)jAfinal,j(h)\text{AttnFrac}(h, \text{pos}_S) = \frac{A^{(h)}_{\text{final}, \text{pos}_S}}{\sum_j A^{(h)}_{\text{final}, j}}

Structural plausibility requires AttnFrac>0.3\text{AttnFrac} > 0.3 on clean IOI prompts (above uniform attention 0.07\approx 0.07 for a 15-token sequence).

Failure threshold: Any mismatch between role label and weight-space signature must be flagged. A “name-mover” with CopyScore<0.3\text{CopyScore} < 0.3 fails C2.

Full criterion page →

The circuit should not score highly on unrelated tasks under the same evaluation.

Pass condition: The selectivity ratio is positive on at least one related off-task.

Selectivity ratio: For circuit CC discovered on task TdiscT_{\text{disc}} and evaluated on related task ToffT_{\text{off}}:

S(C)=F(C,Tdisc)F(C,Toff)F(C,Tdisc)S(C) = \frac{F(C, T_{\text{disc}}) - F(C, T_{\text{off}})}{F(C, T_{\text{disc}})}

SS valueInterpretation
S>0.5S > 0.5Strong task specificity — circuit is substantially more faithful on its discovery task
0<S0.50 < S \leq 0.5Moderate specificity — circuit has some off-task faithfulness but favors discovery task
S0S \approx 0No specificity — circuit is equally faithful on both tasks (bottleneck or general-purpose)
S<0S < 0Inverted specificity — circuit is more faithful on the off-task (red flag)

Off-task selection: The off-task must be related, not trivially distinct.

Discovery taskInformative off-taskTrivial off-task (too easy)
IOISubject-verb agreementModular arithmetic
Greater-ThanSuccessorTranslation
Gendered pronounsIOIFactual recall

Calibration: No published circuit paper reports a selectivity ratio. This is the gap C3 is designed to close.

Full criterion page →

Every component must be individually necessary given the others.

Pass condition: For every ciCc_i \in C, ablating cic_i while leaving all other members intact produces a performance decrease exceeding threshold δ\delta.

Formal definition: Circuit C={c1,,cn}C = \{c_1, \ldots, c_n\} is minimal if and only if:

ciC:F(C)F(C{ci})>δ\forall \, c_i \in C: \quad F(C) - F(C \setminus \{c_i\}) > \delta

where FF is the faithfulness score and δ\delta is the minimum meaningful effect. A reasonable default is δ=0.02\delta = 0.02 (2% faithfulness drop).

Joint vs individual necessity: Two components ci,cjc_i, c_j are jointly redundant if:

F(C{ci})F(C)andF(C{cj})F(C)butF(C{ci,cj})F(C)F(C \setminus \{c_i\}) \approx F(C) \quad \text{and} \quad F(C \setminus \{c_j\}) \approx F(C) \quad \text{but} \quad F(C \setminus \{c_i, c_j\}) \ll F(C)

This pattern indicates backup mechanisms. Wang et al. (2022) found this with IOI backup name-movers. Both components and their relationship should be reported.

Calibration:

CircuitComponentsAfter pruningRedundant members found
IOI (Wang et al. 2022)26 heads~20 core + 6 backupYes — backup name-movers
Greater-Than (Hanna et al. 2023)~12 headsNot reportedNot tested

Full criterion page →

Multiple independent instruments should identify the same components.

Pass condition: J(CA,CB)0.5J(C_A, C_B) \geq 0.5 between instruments from different evidence families.

Jaccard similarity:

J(CA,CB)=CACBCACBJ(C_A, C_B) = \frac{|C_A \cap C_B|}{|C_A \cup C_B|}

JJ valueInterpretation
J>0.6J > 0.6Strong convergent validity — methods agree on most components
0.3J0.60.3 \leq J \leq 0.6Moderate — partial agreement, investigate discrepancies
J<0.3J < 0.3Weak — circuit is method-dependent
J0J \approx 0Failed — methods identify different components entirely

Independence requirement: The two instruments must come from different evidence families with non-overlapping major assumptions.

Valid pairWhy independent
Activation patching + weight classifierCausal (interventionist) vs structural (static weights)
DAS-IIA + SVD spectral analysisRepresentational (learned subspace) vs structural (spectral)
EAP + linear probeCausal (gradient-based) vs representational (supervised)
Invalid pairWhy dependent
Zero ablation + mean ablationBoth causal, both interventionist, share confound structure
Activation patching + path patchingSame framework, one is a refinement of the other

MTMM inequality (Campbell & Fiske 1959): For trait ii measured by methods aa and bb, convergent validity requires:

ria,ib>ria,jbfor all jir_{ia, ib} > r_{ia, jb} \quad \text{for all } j \neq i

Two methods should agree more about the same circuit than about different circuits measured by the same method. When this inequality fails, the method is driving the result more than the mechanism.

Calibration:

Circuit pairMethodsJJInterpretation
IOI: patching vs weight classifierCausal vs structural~0.67 (project estimate)Strong convergent validity
SVA: weight circuit vs EAP circuitStructural vs causal~0.0 (observed in this project)Failed — underdetermined
Induction heads: behavioral vs structuralBehavioral vs structuralHigh (qualitative)Cross-model agreement supports convergent validity
PatternCriteria metInterpretationRecommended language
Pre-registered, structurally coherent, but single-methodC1, C2Well-defined construct, method-dependent identification”Coherent construct, convergence not yet tested”
Convergent, but not task-specificC1, C5Real entity, but may be general-purpose”Convergent but non-discriminant”
Minimal and specific, but no convergenceC3, C4Task-specific finding from one method”Task-specific by one instrument, convergence needed”
All met except falsifiabilityC2–C5Strong post-hoc case, but not pre-registered”Retrospectively well-supported, not prospectively falsifiable”
None metLabel without construct backing”Named but not validated as a construct”

For a proposed circuit CC and behavior BB:

  1. C1. State (m,τ,D)(m, \tau, D) before collecting evidence.
  2. C2. For every named role, compute the relevant weight-space metric (CopyScore, attention fraction, or effect correlation). Flag mismatches.
  3. C3. Evaluate F(C,Toff)F(C, T_{\text{off}}) on at least one related task. Compute S(C)S(C).
  4. C4. Per-component leave-one-out ablation. Report F(C)F(C{ci})F(C) - F(C \setminus \{c_i\}) for each cic_i.
  5. C5. Identify one method from a different evidence family. Compute J(Cmethod 1,Cmethod 2)J(C_{\text{method 1}}, C_{\text{method 2}}).