The Measurement Theory Lens
Section titled “The Measurement Theory Lens”This lens asks one question: is the instrument that produced the number trustworthy?
Every circuit finding begins with a number. An IIA score of 0.48. A faithfulness recovery of 87%. A logit difference of 3.10. The other lenses evaluate the claim that number supports — whether the causal logic holds, whether the effect generalizes, whether the interpretation is licensed. This lens evaluates something more basic: whether the number itself means what it appears to mean.
Measurement validity is the step MI most consistently skips. We run the instrument, get a number, and proceed directly to interpretation. What we skip is the question a measurement theorist would ask first: is this instrument reliable enough that the number is telling us about the model rather than about our choice of prompts? Is the score calibrated to anything we can interpret? Does the instrument measure the construct it claims to measure, or is it measuring its own capacity?
The distinction is the same one pharmacology makes between assay validation and drug efficacy. You validate the assay before interpreting what it measures. A failed assay produces numbers regardless — they just don’t mean what you think.
Key Distinctions
Section titled “Key Distinctions”Reliability vs validity
Section titled “Reliability vs validity”A measurement can be perfectly reliable (same result every time) and completely invalid (measuring the wrong thing). A probe that consistently returns 0.85 accuracy on a representation does not mean the representation encodes the claimed variable — it means the probe consistently extracts something. Reliability is necessary for validity but does not establish it.
In MI: bootstrap stability (F01) tells us our IIA score is reproducible. It does not tell us the score reflects the circuit’s representation rather than the instrument’s capacity to fit noise. A reliable instrument pointed at the wrong target produces confident wrong answers. This is why baseline separation (M3) exists as a separate criterion — it tests whether the instrument would produce similar scores on a model with no learned structure.
Sensitivity vs specificity
Section titled “Sensitivity vs specificity”Signal detection theory (Green & Swets 1966) separates two properties of any detection instrument: sensitivity (can it detect a real signal when one exists?) and specificity (does it correctly reject non-signals?). Hit rate alone is meaningless without the false alarm rate. A smoke detector that rings for everything has perfect sensitivity and zero specificity.
In MI: an instrument that identifies every head as “part of the circuit” has perfect sensitivity and zero specificity — it never misses a real component but also never rejects an irrelevant one. Conversely, a very conservative threshold might miss real components (low sensitivity) but never falsely includes irrelevant ones (high specificity). The metric combines both into a single discriminability score. Current MI practice rarely reports false alarm rates — we report which heads are in the circuit but not how many non-circuit heads the method incorrectly flags.
True score vs observed score
Section titled “True score vs observed score”Classical test theory decomposes every measurement into true score plus error: . The observed faithfulness score of 87% is not the circuit’s true faithfulness — it is the true faithfulness plus whatever noise the prompt sample, random seed, and measurement procedure introduced. The proportion of variance attributable to the true score is the reliability coefficient.
In MI: when we report IIA = 0.48, we are reporting an observed score. The true score might be 0.52 (prompt sample was slightly unfavorable) or 0.44 (prompt sample was favorable). Without a confidence interval, we cannot know. Two circuits with observed scores of 0.48 and 0.52 may have overlapping true-score distributions — the apparent difference may be entirely measurement error. Reporting point estimates without confidence intervals invites over-interpretation of noise.
Convergent vs discriminant validity
Section titled “Convergent vs discriminant validity”Campbell and Fiske (1959) argued that validity requires two things simultaneously: instruments measuring the same construct should agree (convergent validity), AND instruments measuring different constructs should disagree (discriminant validity). Agreement alone is not enough — if all your instruments agree about everything, they may share a bias rather than measuring a real signal.
In MI: if activation patching and weight-space analysis identify the same heads as the IOI circuit (convergent validity), that is strong evidence. But if they also identify the same heads for every other task (poor discriminant validity), the agreement reflects shared methodological bias rather than a real task-specific structure. The MTMM matrix formalizes this: cross-method agreement on the same circuit should exceed same-method agreement across different circuits.
Analytical Constructs
Section titled “Analytical Constructs”The multitrait-multimethod matrix
Section titled “The multitrait-multimethod matrix”The signature artifact of measurement-theoretic evaluation is the multitrait-multimethod (MTMM) matrix (Campbell & Fiske 1959): a structured correlation table crossing k traits (circuits or mechanisms) with m methods (instruments or discovery procedures).
For k circuits measured by m methods, the MTMM matrix is a correlation matrix with a specific block structure:
- Monotrait-heteromethod correlations (convergent validity) — do different methods agree about the same circuit? These should be high. If activation patching and weight-space analysis identify the same heads for the IOI circuit, that is convergent validity.
- Heterotrait-monomethod correlations (method effects) — do same-method measurements of different circuits correlate? These should be low. If activation patching gives similar scores to the IOI circuit and the Greater-Than circuit, that may reflect method bias rather than real similarity.
- Heterotrait-heteromethod correlations (discriminant validity) — do different methods measuring different circuits disagree? These should be lowest. This is the noise floor.
The validity condition: convergent > method effect > discriminant. Formally:
In MI terms: the correlation between EAP-identified IOI circuit and weight-identified IOI circuit should exceed the correlation between EAP-identified IOI circuit and EAP-identified Greater-Than circuit, which should exceed the correlation between EAP-identified IOI circuit and weight-identified Greater-Than circuit.
To construct the matrix: identify k circuits and m discovery/evaluation methods. Run each method on each circuit. Compute pairwise Jaccard similarities (or correlation of attribution scores) between all km measurements. Arrange into the MTMM block structure. Check the validity ordering.
When the ordering is violated — when same-method correlations across circuits exceed cross-method correlations within circuits — the instruments share more variance with each other than with the construct they claim to measure. This is method bias, and it means the “circuit” may partly be an artifact of the discovery procedure.
Sources
Section titled “Sources”| Source | Year | Field | Principle |
|---|---|---|---|
| Cronbach & Meehl, “Construct validity in psychological tests” | 1955 | Measurement Theory | Reliability as prerequisite — no construct validity claim is stronger than the measurement validity of the instrument supporting it |
| Campbell & Fiske, “Convergent and discriminant validation by the multitrait-multimethod matrix” | 1959 | Measurement Theory | MTMM and invariance — an instrument is valid across contexts only if it produces comparable results under systematic variation of those contexts |
| Green & Swets, Signal Detection Theory and Psychophysics | 1966 | Signal Detection | and AUROC/AUPRC — separate discriminative ability from response bias; hit rate without false alarm rate is not sensitivity |
| Lord & Novick, Statistical Theories of Mental Test Scores | 1968 | Measurement Theory | Classical test theory — observed score = true score + error; reliability as the ratio of true-score variance to observed variance |
| Cronbach, Gleser, Nanda & Rajaratnam, The Dependability of Behavioral Measurements | 1972 | Measurement Theory | Generalizability theory — decompose error into identifiable sources (prompt sampling, seed variance, checkpoint) to know where measurement effort should go |
| Hewitt & Liang, “A structural probe for finding syntax in word representations” | 2019 | Natural Language Processing | Selectivity = linguistic accuracy control accuracy — probe accuracy without a baseline measures instrument capacity, not representation structure |
| Sutter et al., “How to evaluate satisfiability of interpretability claims” | 2025 | Mechanistic Interpretability | Baseline separation — unconstrained nonlinear IIA achieves near-perfect scores on random-init models; the baseline is not optional |
Classical test theory (Lord & Novick 1968): An observed score , where is the true score and is measurement error. Reliability is the proportion of observed variance attributable to the true score. An instrument with carries as much noise as signal.
The difference between measurement theory and the other lenses is scope. The neuroscience lens asks whether a component implements a computation. The pharmacology lens asks whether the effect scales and generalizes. This lens asks whether the instrument that produced the numbers to evaluate those questions is itself reliable, calibrated, and measuring what it claims to measure. Instrument validity is prior to claim validity. A perfectly designed experiment with an unreliable instrument produces nothing.
Generalizability theory, developed by Cronbach and colleagues in 1972, extends classical test theory by decomposing the error term into identifiable sources: in our context, prompt sampling variance, random seed variance, and checkpoint variance. This decomposition matters for practice. If most of the variance is from prompt sampling, the fix is a larger prompt set. If most is from seed variance, the model itself is unstable and no prompt set will help. If most is from checkpoint variance, the mechanism is still being learned at the evaluated checkpoint. Knowing which source dominates tells us where effort should go.
The criteria
Section titled “The criteria”Reliability
Section titled “Reliability”An instrument whose output changes substantially under irrelevant perturbations cannot support any validity claim. If we resample prompts from the same distribution and the IIA score swings from 0.41 to 0.58, the score is a property of the specific prompt set, not of the circuit.
The Spearman-Brown formula connects current reliability to the prompt count needed to reach a target:
where is the factor by which we multiply the number of prompts. If our current reliability is on 50 prompts, doubling to 100 prompts gives . This predicts whether a larger prompt set solves the problem or whether the variance is structural and a larger set won’t help.
Conventional reliability thresholds from measurement theory (Nunnally 1978): below 0.5, the instrument is too noisy for any validity inference; 0.7 is acceptable; 0.9 is sufficient for interpretable small differences. These thresholds are not universal laws, but they provide orientation in the absence of domain-specific norms.
The most common reliability failure in current MI practice is discovery-evaluation overlap: the same prompts used to select the circuit are also used to evaluate it. The circuit was optimized to perform well on those prompts, so the apparent reliability is inflated. The fix is straightforward: hold out a prompt partition before running discovery and evaluate on it afterward.
What to report. Bootstrap the principal score across at least 100 prompt subsamples and report the 95% confidence interval. Compute split-half reliability: partition the prompt set, run the instrument on each half, report the Pearson correlation. Report internal consistency among circuit components if the circuit is large enough for it to be meaningful.
Worked example: bootstrap confidence intervals on IOI circuit faithfulness
Wang et al. (2022) report 87% faithfulness for the IOI circuit. This is the point estimate on the full evaluation set. To establish reliability, we can resample the evaluation prompts with replacement and recompute faithfulness on each bootstrap sample.
Suppose we draw 200 bootstrap samples of size 100 from the evaluation set and compute faithfulness on each. If the resulting distribution has mean 0.87 and standard deviation 0.06, the 95% confidence interval is approximately [0.75, 0.99]. That interval is wide. An instrument with on a score bounded between 0 and 1 has substantial prompt-sampling variance. The Spearman-Brown formula predicts that increasing from 100 to 400 prompts would reduce to approximately 0.03, bringing the CI to [0.81, 0.93] — more interpretable.
A reliability check also reveals whether different prompt templates agree. If IOI faithfulness is 0.87 on the original template (“When Mary and John went to the store, John gave a drink to”) but 0.61 on a paraphrased template, the score is template-specific and the reliability across templates is low. This is separate from the bootstrap CI, which only captures within-template prompt-sampling variance.
Invariance
Section titled “Invariance”An instrument should give comparable results across model sizes and families. If IIA is 0.78 on GPT-2 Small and 0.31 on Pythia-160M, the difference could mean two things: the mechanism is weaker in Pythia, or the instrument is measuring something different in the two models. Invariance testing distinguishes these cases.
The measurement theory framework for invariance comes from confirmatory factor analysis. We test three levels sequentially. Configural invariance: the same constructs are present in both models (the same instrument structure is appropriate). Metric invariance: the loadings are equal across models (a unit change in the latent construct produces the same change in the measured score in both models). Scalar invariance: the intercepts are equal (a circuit with zero true effect produces the same baseline score in both models). Comparisons across models are only valid if at least metric invariance holds.
In practice, full measurement invariance testing is a substantial undertaking for MI instruments. A practical substitute is to include the untrained-model baseline for each model separately: if the baseline is 0.44 in GPT-2 Small and 0.29 in Pythia, the gap of 0.04 (trained minus random, GPT-2) vs. 0.02 (Pythia) is an apples-to-apples comparison even if the absolute scores differ.
What to report. At least two model sizes or families. The untrained-model baseline for each. Any observed differences characterized as potentially reflecting different mechanism strengths, different baseline levels, or potential instrument non-invariance.
Baseline separation
Section titled “Baseline separation”Delta over a random-vector baseline and an untrained-model baseline should be substantially above zero.
This is the criterion whose absence most often produces false findings in current MI practice.
Sutter et al. (NeurIPS 2025) formally proved that unconstrained nonlinear IIA achieves near-perfect scores on random-initialization models. The alignment map has enough degrees of freedom to find a transformation that maps the source activations onto the target variable, regardless of whether the model’s representation encodes that variable. The IIA score is a real measurement — it is a correct description of the alignment map’s behavior. But without a baseline, it is not a measurement of the circuit’s representation.
The minimum report for any IIA-based finding is three numbers: the score itself (), the random-vector baseline (), and the untrained-model baseline (). The interpretable findings are:
tells us how much the model’s actual representations contribute, over random directions. tells us how much the trained weights contribute, over the architectural prior (initialization structure, weight geometry). A large with a small is a large number with a small finding. A modest with a large and a large is a modest number with a genuine finding.
Worked example: interpreting IIA = 0.48 at L8.MLP for GPT-2 Small SVA
We measure IIA at layer 8’s MLP and obtain 0.48. The published transcoder range for GPT-2 Small SVA is approximately 0.4–0.6. At first glance, 0.48 looks competitive with the literature.
Now add the baselines. Suppose we run the same alignment procedure on random unit vectors drawn from the same -dimensional space, obtaining . We also run it on the same model before training (randomly initialized weights), obtaining .
The deltas are and . These are the actual findings. They say: the trained model’s L8.MLP representations carry about 10 percentage points more causal information about SVA than random directions, and about 15 points more than the untrained architecture.
This is a real but modest signal. Whether it is a publishable finding depends on (a) whether the delta is stable across bootstrap resamples — if the CI on is , the signal is real but noisy — and (b) whether the method has fewer parameters than DAS (which achieves 0.86–0.95), which would make a 0.10 delta at lower parameter cost an interesting result. Without the baselines, none of this analysis is possible.
Sensitivity
Section titled “Sensitivity”A circuit with 12 components in a model with thousands of heads and neurons is a low-prevalence signal. In low-prevalence settings, AUROC can be misleadingly high while precision is poor — the instrument ranks circuit members above most non-members, but when it calls something a member, it is wrong most of the time.
Signal detection theory measures this with :
where is the inverse normal CDF. A means the instrument cannot distinguish circuit members from non-members at all. A indicates moderate discriminability. A is strong.
For circuit detection specifically, AUPRC (area under the precision-recall curve) is more informative than AUROC when the base rate is low. A circuit of 12 heads in a model with 144 total heads has a base rate of . At this base rate, AUROC can reach 0.9 while precision is below 0.1 — the instrument correctly ranks circuit members above non-members most of the time, but when it calls something a member, it is almost always wrong.
What to report. AUPRC alongside AUROC for any circuit with fewer than 25 components. The base rate. Whether the reference circuit used to compute these metrics was discovered by the same instrument family, in which case agreement is partly mechanical.
Calibration
Section titled “Calibration”A score is calibrated when we can locate it on a known scale. Without calibration, a number is a relative ranking within one experiment, not a measurement. Two papers reporting “87% faithfulness” may be measuring different quantities; calibration requires enough specificity to determine whether they are comparable.
The following table provides calibration reference points for common tasks and models:
| Task | Model | Metric | Full-model baseline | Circuit baseline | Recovery | Source |
|---|---|---|---|---|---|---|
| IOI | GPT-2 Small | Logit difference | 3.56 | 3.10 | 87% | Wang et al. 2022 |
| Greater-Than | GPT-2 Small | Prob. difference | 81.7% | 72.7% | 89.5% | Hanna et al. 2023 |
| SVA | GPT-2 Small | Logit diff / acc. | 0.70 | 0.65 | 93% | Lazo et al. 2025 |
| SVA (DAS) | GPT-2 Small | IIA | — | 0.86–0.95 | — | Mueller et al. 2025 |
| SVA (transcoders) | GPT-2 Small | IIA | — | 0.4–0.6 | — | Published range |
| SVA (SAE features) | GPT-2 Small | IIA | — | Below raw neurons | — | Mueller et al. 2025 |
All faithfulness numbers should be read as: “recovery under [ablation method] on [prompt distribution].” The IOI circuit’s 87% is under mean ablation with the Wang et al. prompt set; Miller et al. (2024) show that different choices produce substantially different numbers for the same circuit. A new IIA score of 0.52 on GPT-2 Small SVA sits in the transcoder range and well below the DAS range — whether that is good or bad depends on the method’s parameter count and the claim being made.
Construct coverage
Section titled “Construct coverage”An instrument should measure what it claims to measure rather than a correlated proxy.
Hewitt and Liang (EMNLP 2019) showed this failure mode concretely for probes: a probe achieving 90% syntactic accuracy may achieve 85% on a control task where labels are shuffled into word-type statistics. The probe is measuring its own capacity, not the representation’s structure. The selectivity — the 5 percentage point gap — is the valid measurement.
Sutter et al. (NeurIPS 2025) showed the same pattern for IIA: unconstrained nonlinear alignment maps achieve near-perfect IIA on random-initialization models. What gets measured is the map’s flexibility, not the model’s representational geometry. Linear IIA (DAS with a linear map) makes a specific, falsifiable claim about representational geometry: that the causal variable is linearly encoded. Nonlinear IIA makes a much weaker claim about which the alignment map architecture provides essentially no information.
The practical test is to vary the alignment map’s capacity. If IIA remains high when the map dimension is reduced from to , the finding is robust to map complexity. If IIA collapses, it was measuring map flexibility. A control task — same probe architecture, labels that require no representational information — provides a direct test in the Hewitt and Liang sense.
What to report. The alignment map architecture stated explicitly. IIA measured across at least two map capacities. A control task at matched capacity if the construct coverage claim is central.
Evidence patterns
Section titled “Evidence patterns”| Evidence pattern | What it establishes | Recommended language |
|---|---|---|
| Score, no baselines | Instrument capacity | ”Uncalibrated score; baselines pending” |
| Score + random baseline only | Signal over chance | "" |
| Score + both baselines | Signal over chance and arch prior | ” (, )“ |
| AUROC, no AUPRC, low base rate | Ranking, not detection | ”Ranks circuit members above non-members” |
| High IIA, collapsed with linear map | Map flexibility, not linear geometry | ”IIA achievable; not linearly encoded” |
Verdicts
Section titled “Verdicts”Measurement validity gates the interpretation of every other evidence type:
- Any verdict above Proposed requires at least a bootstrap CI (reliability) and a random-vector baseline (baseline separation). Without these, a score is a data point, not a finding.
- Causally suggestive → Mechanistically supported: Requires calibration against at least one published reference point.
- Mechanistically supported → Triangulated: Requires invariance across at least two models and construct coverage confirmation.
Protocol
Section titled “Protocol”For any reported score from a circuit evaluation instrument:
- Reliability. Bootstrap across 100+ prompt subsamples; report 95% CI. Compute split-half correlation. If , apply Spearman-Brown to determine whether a feasible prompt increase would bring reliability above threshold.
- Invariance. Test on at least two model sizes or families with separate untrained-model baselines.
- Baseline separation. Report , , , and . These are the primary reported quantities, not alone.
- Sensitivity. AUPRC alongside AUROC for circuits with fewer than 25 components; state the base rate.
- Calibration. Locate the score against at least one published baseline on the same task and model; state the ablation method and prompt distribution precisely.
- Construct coverage. State the alignment map architecture; vary its capacity; run a control task at matched capacity if the representational geometry claim is central.
A skipped step must be named in the verdict.
Case studies
Section titled “Case studies”For full worked examples applying all five lenses (including measurement validity) to published claims:
- IOI Circuit — reliability untested; single prompt template
- Induction Heads — multiple independent measurements converge
- SAE Features — baseline separation is the central question
- Probing Classifiers — measurement without construct coverage (Hewitt & Liang)
- Othello World Model — calibration question: linear decodability vs. world model
- Grokking — full measurement validity (toy model, exact weights known)