Measurement Validity — Formal Specification
Section titled “Measurement Validity — Formal Specification”| Question | Are the instruments that produced the evidence reliable, calibrated, and selective? |
| Lens | Measurement Theory |
| Criteria | M1–M6 |
| Source theory | Classical test theory (Lord & Novick 1968); multi-trait multi-method (Campbell & Fiske 1959); signal detection theory (Green & Swets 1966) |
| Dependency | Measurement validity evaluates the instrument, not the claim; it is prior to interpreting what the instrument measures |
| Status in MI | Most easily corrected failures; baselines often omitted |
| Last updated | 16 May 2026 |
Measurement validity asks whether the instruments used to produce evidence for a circuit claim are themselves valid measurements. It is the type whose failures are most correctable — typically because the remedy is a baseline that was always feasible to compute but was not reported.
Why measurement validity is independent of the claim’s validity
Section titled “Why measurement validity is independent of the claim’s validity”The other validity types evaluate the claim being made. Measurement validity evaluates the instruments that produced the evidence. The distinction matters because failures in the two have different remedies. A failure of construct validity is remedied by clarifying the construct. A failure of internal validity is remedied by adding interventions. A failure of measurement validity is remedied by validating the instrument — typically by adding controls or baselines — without changing the claim itself.
The pharmacology analogy is the distinction between assay validation and drug efficacy. A pharmacologist validates the assay (instrument) before drawing conclusions about the drug (claim). An assay whose precision and selectivity have not been characterized cannot support a drug-efficacy conclusion regardless of how strong the measured effect appears.
M1 — Reliability
Section titled “M1 — Reliability”Scores from the instrument should be stable across prompt subsamples, random seeds, and checkpoints.
Formal requirement (Lord & Novick 1968): An observed score where is the true score and is measurement error. Reliability is:
An instrument with contributes more noise than signal; any validity claim built on it is attenuated.
Pass condition: Test-retest reliability across two independent prompt subsamples of equal size, or bootstrap 95% CI on the principal metric spans .
MI-specific threat: Single-split reliability. Scores reported from a single prompt split without bootstrap or test-retest evaluation hide instability. A faithfulness score of 0.87 computed on 50 prompts may have a bootstrap 95% CI of , making it a property of the specific sample rather than the circuit.
M2 — Invariance
Section titled “M2 — Invariance”The instrument should produce comparable scores across model sizes within a family, or the score difference should be attributable to a measurable difference in the model rather than to a property of the instrument.
Pass condition: Instrument scores on matched circuits from the same model family differ by no more than the empirical standard deviation across seeds. If scores differ substantially across sizes, the instrument must be recalibrated per model size.
MI-specific threat: Scale contamination. An instrument calibrated on GPT-2 Small may systematically over- or under-estimate faithfulness in GPT-2 Medium because absolute logit differences scale with model size. Reports should always state the model on which the instrument is calibrated.
M3 — Baseline Separation
Section titled “M3 — Baseline Separation”The instrument’s output on a real construct must be substantially above its output on a random or untrained control.
This is the measurement criterion MI most consistently fails. Three baselines are required:
Random-vector baseline: Replace the circuit’s component activations with random unit vectors of the same dimensionality. The IIA or faithfulness score under this substitution is the floor.
Pass condition: . Without this gap, the score is dominated by the alignment map’s capacity rather than the circuit’s structure.
Untrained-model baseline: Compute the same metric on an identically architected but randomly initialized (untrained) model. Sutter et al. (2025) proved that unconstrained nonlinear IIA achieves near-perfect scores on random-initialization models. Without this baseline, a high IIA score says nothing about the trained model’s representations — it measures the map’s degrees of freedom.
Published reference baseline: Report the instrument’s output on published reference circuits (IOI, Greater-Than, induction heads) to establish where the claimed result sits on a common scale.
The canonical example:
An IIA of 0.48 at layer 8 MLP is right in the published SAE/transcoder baseline range of 0.40–0.60 for GPT-2 Small SVA (Lazo et al. 2025). Without the random-vector baseline and the published reference range, 0.48 is an uninterpretable number. With both baselines, it is a competitive and publishable result.
Three required baselines for IIA:
| Baseline | What it controls | Required? |
|---|---|---|
| Random-vector baseline | Alignment map capacity | Yes |
| Untrained-model baseline | Architecture-induced correlation | Yes |
| Published reference range | Calibration to literature | Strongly recommended |
M4 — Sensitivity
Section titled “M4 — Sensitivity”The instrument should detect real circuits at acceptable hit rates without excessive false positives.
Pass condition: and when the instrument is applied to a held-out set containing known circuits and known non-circuits.
AUROC vs AUPRC for small circuits: For circuits with few components in models with many components, AUROC can be high while precision is low. The precision-recall distinction is rarely reported but is essential for sparse discovery tasks.
Formal requirement using signal detection theory (Green & Swets 1966):
A discriminating instrument has (roughly equivalent to AUROC ).
M5 — Calibration
Section titled “M5 — Calibration”The instrument’s score should be interpretable against published reference values for the same metric on matched tasks.
Pass condition: At least one published reference value for the same metric on a comparable task is cited, and the claimed result is positioned relative to it.
Reference calibration table for GPT-2 Small:
| Task | Metric | Full model | Best circuit | Source | |---|---|---|---| | IOI | Logit diff | 3.56 | 3.10 (87% recovery) | Wang et al. 2022 | | Greater-Than | Prob diff | 81.7% | 72.7% (89.5% recovery) | Hanna et al. 2023 | | SVA | Logit diff | 0.70 | 0.65 (93% recovery) | Lazo et al. 2025 | | SVA (BLiMP behavioral) | Accuracy | 95–97% | — | Warstadt et al. 2020 | | Gendered pronoun (BLiMP) | Accuracy | ~99% | full model | Mathwin 2023 |
MIB Causal Variable IIA reference:
| Method | IIA range | Notes |
|---|---|---|
| DAS (best) | 86–95% | Dominates on MIB |
| SAE features | < DAS | MIB’s headline finding: SAE features worse than raw neurons |
| Random baseline | ~10–15% | Instrument floor |
M6 — Construct Coverage
Section titled “M6 — Construct Coverage”The instrument should measure what it claims to measure rather than a property of the instrument’s own degrees of freedom.
Pass condition: The instrument produces different scores on constructs that are theoretically distinguishable. If two circuits that should be distinct produce indistinguishable instrument scores, the instrument lacks construct coverage.
Discovery-evaluation overlap failure: Using the same prompts for circuit discovery and evaluation inflates apparent reliability because the circuit was optimized on those prompts. Discovery and evaluation sets must be disjoint.
Alignment map capacity failure: If IIA is high only with unconstrained nonlinear maps (MLP, large linear subspace), the finding is about map flexibility, not the circuit’s geometry. The alignment architecture must be specified and the score must be reported as a function of alignment capacity — showing how IIA degrades as alignment is constrained toward a single linear direction.
Partial-pass interpretation
Section titled “Partial-pass interpretation”| Pattern | Criteria met | Interpretation | Recommended language |
|---|---|---|---|
| Reliable but no baseline | M1 | Score is stable but uninterpretable | ”Reliable; baseline gap not established” |
| Baseline gap established, not calibrated | M1, M3 | Score interpretable as above floor; not positioned in literature | ”Above random; reference calibration needed” |
| Calibrated, no sensitivity test | M1, M3, M5 | Positioned in literature; false-positive rate unknown | ”Calibrated; sensitivity not characterized” |
| All six met | M1–M6 | Full measurement validity | Proceed to interpret internal validity results |
Protocol
Section titled “Protocol”For instrument and circuit :
- M1. Bootstrap the principal metric over at least 5 resampled prompt splits. Report 95% CI.
- M2. If comparing across model sizes, verify scale comparability or recalibrate.
- M3. Compute random-vector baseline and untrained-model baseline. Report .
- M4. If using the instrument for discovery, compute AUROC and AUPRC on held-out known/unknown circuit sets.
- M5. Cite at least one published reference value. Position the result on the calibration scale.
- M6. Ensure discovery and evaluation sets are disjoint. Report IIA as a function of alignment capacity.