Skip to content

This lens asks one question: is the instrument that produced the number trustworthy?

Every circuit finding begins with a number. An IIA score of 0.48. A faithfulness recovery of 87%. A logit difference of 3.10. The other lenses evaluate the claim that number supports — whether the causal logic holds, whether the effect generalizes, whether the interpretation is licensed. This lens evaluates something more basic: whether the number itself means what it appears to mean.

Measurement validity is the step MI most consistently skips. We run the instrument, get a number, and proceed directly to interpretation. What we skip is the question a measurement theorist would ask first: is this instrument reliable enough that the number is telling us about the model rather than about our choice of prompts? Is the score calibrated to anything we can interpret? Does the instrument measure the construct it claims to measure, or is it measuring its own capacity?

The distinction is the same one pharmacology makes between assay validation and drug efficacy. You validate the assay before interpreting what it measures. A failed assay produces numbers regardless — they just don’t mean what you think.

A measurement can be perfectly reliable (same result every time) and completely invalid (measuring the wrong thing). A probe that consistently returns 0.85 accuracy on a representation does not mean the representation encodes the claimed variable — it means the probe consistently extracts something. Reliability is necessary for validity but does not establish it.

In MI: bootstrap stability (F01) tells us our IIA score is reproducible. It does not tell us the score reflects the circuit’s representation rather than the instrument’s capacity to fit noise. A reliable instrument pointed at the wrong target produces confident wrong answers. This is why baseline separation (M3) exists as a separate criterion — it tests whether the instrument would produce similar scores on a model with no learned structure.

Signal detection theory (Green & Swets 1966) separates two properties of any detection instrument: sensitivity (can it detect a real signal when one exists?) and specificity (does it correctly reject non-signals?). Hit rate alone is meaningless without the false alarm rate. A smoke detector that rings for everything has perfect sensitivity and zero specificity.

In MI: an instrument that identifies every head as “part of the circuit” has perfect sensitivity and zero specificity — it never misses a real component but also never rejects an irrelevant one. Conversely, a very conservative threshold might miss real components (low sensitivity) but never falsely includes irrelevant ones (high specificity). The dd' metric combines both into a single discriminability score. Current MI practice rarely reports false alarm rates — we report which heads are in the circuit but not how many non-circuit heads the method incorrectly flags.

Classical test theory decomposes every measurement into true score plus error: X=T+EX = T + E. The observed faithfulness score of 87% is not the circuit’s true faithfulness — it is the true faithfulness plus whatever noise the prompt sample, random seed, and measurement procedure introduced. The proportion of variance attributable to the true score is the reliability coefficient.

In MI: when we report IIA = 0.48, we are reporting an observed score. The true score might be 0.52 (prompt sample was slightly unfavorable) or 0.44 (prompt sample was favorable). Without a confidence interval, we cannot know. Two circuits with observed scores of 0.48 and 0.52 may have overlapping true-score distributions — the apparent difference may be entirely measurement error. Reporting point estimates without confidence intervals invites over-interpretation of noise.

Campbell and Fiske (1959) argued that validity requires two things simultaneously: instruments measuring the same construct should agree (convergent validity), AND instruments measuring different constructs should disagree (discriminant validity). Agreement alone is not enough — if all your instruments agree about everything, they may share a bias rather than measuring a real signal.

In MI: if activation patching and weight-space analysis identify the same heads as the IOI circuit (convergent validity), that is strong evidence. But if they also identify the same heads for every other task (poor discriminant validity), the agreement reflects shared methodological bias rather than a real task-specific structure. The MTMM matrix formalizes this: cross-method agreement on the same circuit should exceed same-method agreement across different circuits.

The signature artifact of measurement-theoretic evaluation is the multitrait-multimethod (MTMM) matrix (Campbell & Fiske 1959): a structured correlation table crossing k traits (circuits or mechanisms) with m methods (instruments or discovery procedures).

For k circuits measured by m methods, the MTMM matrix is a km×kmkm \times km correlation matrix with a specific block structure:

  • Monotrait-heteromethod correlations (convergent validity) — do different methods agree about the same circuit? These should be high. If activation patching and weight-space analysis identify the same heads for the IOI circuit, that is convergent validity.
  • Heterotrait-monomethod correlations (method effects) — do same-method measurements of different circuits correlate? These should be low. If activation patching gives similar scores to the IOI circuit and the Greater-Than circuit, that may reflect method bias rather than real similarity.
  • Heterotrait-heteromethod correlations (discriminant validity) — do different methods measuring different circuits disagree? These should be lowest. This is the noise floor.

The validity condition: convergent > method effect > discriminant. Formally:

r(traiti,methoda;traiti,methodb)>r(traiti,methoda;traitj,methoda)>r(traiti,methoda;traitj,methodb)r(\text{trait}_i, \text{method}_a; \text{trait}_i, \text{method}_b) > r(\text{trait}_i, \text{method}_a; \text{trait}_j, \text{method}_a) > r(\text{trait}_i, \text{method}_a; \text{trait}_j, \text{method}_b)

In MI terms: the correlation between EAP-identified IOI circuit and weight-identified IOI circuit should exceed the correlation between EAP-identified IOI circuit and EAP-identified Greater-Than circuit, which should exceed the correlation between EAP-identified IOI circuit and weight-identified Greater-Than circuit.

To construct the matrix: identify k circuits and m discovery/evaluation methods. Run each method on each circuit. Compute pairwise Jaccard similarities (or correlation of attribution scores) between all km measurements. Arrange into the MTMM block structure. Check the validity ordering.

When the ordering is violated — when same-method correlations across circuits exceed cross-method correlations within circuits — the instruments share more variance with each other than with the construct they claim to measure. This is method bias, and it means the “circuit” may partly be an artifact of the discovery procedure.

SourceYearFieldPrinciple
Cronbach & Meehl, “Construct validity in psychological tests”1955Measurement TheoryReliability as prerequisite — no construct validity claim is stronger than the measurement validity of the instrument supporting it
Campbell & Fiske, “Convergent and discriminant validation by the multitrait-multimethod matrix”1959Measurement TheoryMTMM and invariance — an instrument is valid across contexts only if it produces comparable results under systematic variation of those contexts
Green & Swets, Signal Detection Theory and Psychophysics1966Signal Detectiondd' and AUROC/AUPRC — separate discriminative ability from response bias; hit rate without false alarm rate is not sensitivity
Lord & Novick, Statistical Theories of Mental Test Scores1968Measurement TheoryClassical test theory — observed score = true score + error; reliability as the ratio of true-score variance to observed variance
Cronbach, Gleser, Nanda & Rajaratnam, The Dependability of Behavioral Measurements1972Measurement TheoryGeneralizability theory — decompose error into identifiable sources (prompt sampling, seed variance, checkpoint) to know where measurement effort should go
Hewitt & Liang, “A structural probe for finding syntax in word representations”2019Natural Language ProcessingSelectivity = linguistic accuracy - control accuracy — probe accuracy without a baseline measures instrument capacity, not representation structure
Sutter et al., “How to evaluate satisfiability of interpretability claims”2025Mechanistic InterpretabilityBaseline separation — unconstrained nonlinear IIA achieves near-perfect scores on random-init models; the baseline is not optional

Classical test theory (Lord & Novick 1968): An observed score X=T+EX = T + E, where TT is the true score and EE is measurement error. Reliability ρXX=σT2/(σT2+σE2)\rho_{XX'} = \sigma^2_T / (\sigma^2_T + \sigma^2_E) is the proportion of observed variance attributable to the true score. An instrument with ρXX=0.5\rho_{XX'} = 0.5 carries as much noise as signal.

The difference between measurement theory and the other lenses is scope. The neuroscience lens asks whether a component implements a computation. The pharmacology lens asks whether the effect scales and generalizes. This lens asks whether the instrument that produced the numbers to evaluate those questions is itself reliable, calibrated, and measuring what it claims to measure. Instrument validity is prior to claim validity. A perfectly designed experiment with an unreliable instrument produces nothing.

Generalizability theory, developed by Cronbach and colleagues in 1972, extends classical test theory by decomposing the error term EE into identifiable sources: in our context, prompt sampling variance, random seed variance, and checkpoint variance. This decomposition matters for practice. If most of the variance is from prompt sampling, the fix is a larger prompt set. If most is from seed variance, the model itself is unstable and no prompt set will help. If most is from checkpoint variance, the mechanism is still being learned at the evaluated checkpoint. Knowing which source dominates tells us where effort should go.

An instrument whose output changes substantially under irrelevant perturbations cannot support any validity claim. If we resample prompts from the same distribution and the IIA score swings from 0.41 to 0.58, the score is a property of the specific prompt set, not of the circuit.

The Spearman-Brown formula connects current reliability to the prompt count needed to reach a target:

ρnn=nρXX1+(n1)ρXX\rho_{nn'} = \frac{n \cdot \rho_{XX'}}{1 + (n-1) \cdot \rho_{XX'}}

where nn is the factor by which we multiply the number of prompts. If our current reliability is ρXX=0.6\rho_{XX'} = 0.6 on 50 prompts, doubling to 100 prompts gives ρnn=2×0.6/(1+0.6)=0.75\rho_{nn'} = 2 \times 0.6 / (1 + 0.6) = 0.75. This predicts whether a larger prompt set solves the problem or whether the variance is structural and a larger set won’t help.

Conventional reliability thresholds from measurement theory (Nunnally 1978): below 0.5, the instrument is too noisy for any validity inference; 0.7 is acceptable; 0.9 is sufficient for interpretable small differences. These thresholds are not universal laws, but they provide orientation in the absence of domain-specific norms.

The most common reliability failure in current MI practice is discovery-evaluation overlap: the same prompts used to select the circuit are also used to evaluate it. The circuit was optimized to perform well on those prompts, so the apparent reliability is inflated. The fix is straightforward: hold out a prompt partition before running discovery and evaluate on it afterward.

What to report. Bootstrap the principal score across at least 100 prompt subsamples and report the 95% confidence interval. Compute split-half reliability: partition the prompt set, run the instrument on each half, report the Pearson correlation. Report internal consistency among circuit components if the circuit is large enough for it to be meaningful.

Worked example: bootstrap confidence intervals on IOI circuit faithfulness

Wang et al. (2022) report 87% faithfulness for the IOI circuit. This is the point estimate on the full evaluation set. To establish reliability, we can resample the evaluation prompts with replacement and recompute faithfulness on each bootstrap sample.

Suppose we draw 200 bootstrap samples of size 100 from the evaluation set and compute faithfulness on each. If the resulting distribution has mean 0.87 and standard deviation 0.06, the 95% confidence interval is approximately [0.75, 0.99]. That interval is wide. An instrument with σ=0.06\sigma = 0.06 on a score bounded between 0 and 1 has substantial prompt-sampling variance. The Spearman-Brown formula predicts that increasing from 100 to 400 prompts would reduce σ\sigma to approximately 0.03, bringing the CI to [0.81, 0.93] — more interpretable.

A reliability check also reveals whether different prompt templates agree. If IOI faithfulness is 0.87 on the original template (“When Mary and John went to the store, John gave a drink to”) but 0.61 on a paraphrased template, the score is template-specific and the reliability across templates is low. This is separate from the bootstrap CI, which only captures within-template prompt-sampling variance.

An instrument should give comparable results across model sizes and families. If IIA is 0.78 on GPT-2 Small and 0.31 on Pythia-160M, the difference could mean two things: the mechanism is weaker in Pythia, or the instrument is measuring something different in the two models. Invariance testing distinguishes these cases.

The measurement theory framework for invariance comes from confirmatory factor analysis. We test three levels sequentially. Configural invariance: the same constructs are present in both models (the same instrument structure is appropriate). Metric invariance: the loadings are equal across models (a unit change in the latent construct produces the same change in the measured score in both models). Scalar invariance: the intercepts are equal (a circuit with zero true effect produces the same baseline score in both models). Comparisons across models are only valid if at least metric invariance holds.

In practice, full measurement invariance testing is a substantial undertaking for MI instruments. A practical substitute is to include the untrained-model baseline for each model separately: if the baseline is 0.44 in GPT-2 Small and 0.29 in Pythia, the gap of 0.04 (trained minus random, GPT-2) vs. 0.02 (Pythia) is an apples-to-apples comparison even if the absolute scores differ.

What to report. At least two model sizes or families. The untrained-model baseline for each. Any observed differences characterized as potentially reflecting different mechanism strengths, different baseline levels, or potential instrument non-invariance.

Delta over a random-vector baseline and an untrained-model baseline should be substantially above zero.

This is the criterion whose absence most often produces false findings in current MI practice.

Sutter et al. (NeurIPS 2025) formally proved that unconstrained nonlinear IIA achieves near-perfect scores on random-initialization models. The alignment map has enough degrees of freedom to find a transformation that maps the source activations onto the target variable, regardless of whether the model’s representation encodes that variable. The IIA score is a real measurement — it is a correct description of the alignment map’s behavior. But without a baseline, it is not a measurement of the circuit’s representation.

The minimum report for any IIA-based finding is three numbers: the score itself (ScircuitS_{\text{circuit}}), the random-vector baseline (SrandomS_{\text{random}}), and the untrained-model baseline (SuntrainedS_{\text{untrained}}). The interpretable findings are:

Δrandom=ScircuitSrandom\Delta_{\text{random}} = S_{\text{circuit}} - S_{\text{random}} Δarch=ScircuitSuntrained\Delta_{\text{arch}} = S_{\text{circuit}} - S_{\text{untrained}}

Δrandom\Delta_{\text{random}} tells us how much the model’s actual representations contribute, over random directions. Δarch\Delta_{\text{arch}} tells us how much the trained weights contribute, over the architectural prior (initialization structure, weight geometry). A large ScircuitS_{\text{circuit}} with a small Δrandom\Delta_{\text{random}} is a large number with a small finding. A modest ScircuitS_{\text{circuit}} with a large Δrandom\Delta_{\text{random}} and a large Δarch\Delta_{\text{arch}} is a modest number with a genuine finding.

Worked example: interpreting IIA = 0.48 at L8.MLP for GPT-2 Small SVA

We measure IIA at layer 8’s MLP and obtain 0.48. The published transcoder range for GPT-2 Small SVA is approximately 0.4–0.6. At first glance, 0.48 looks competitive with the literature.

Now add the baselines. Suppose we run the same alignment procedure on random unit vectors drawn from the same dmodeld_{\text{model}}-dimensional space, obtaining Srandom=0.38S_{\text{random}} = 0.38. We also run it on the same model before training (randomly initialized weights), obtaining Suntrained=0.33S_{\text{untrained}} = 0.33.

The deltas are Δrandom=0.480.38=0.10\Delta_{\text{random}} = 0.48 - 0.38 = 0.10 and Δarch=0.480.33=0.15\Delta_{\text{arch}} = 0.48 - 0.33 = 0.15. These are the actual findings. They say: the trained model’s L8.MLP representations carry about 10 percentage points more causal information about SVA than random directions, and about 15 points more than the untrained architecture.

This is a real but modest signal. Whether it is a publishable finding depends on (a) whether the delta is stable across bootstrap resamples — if the CI on Δrandom\Delta_{\text{random}} is [0.02,0.18][0.02, 0.18], the signal is real but noisy — and (b) whether the method has fewer parameters than DAS (which achieves 0.86–0.95), which would make a 0.10 delta at lower parameter cost an interesting result. Without the baselines, none of this analysis is possible.

A circuit with 12 components in a model with thousands of heads and neurons is a low-prevalence signal. In low-prevalence settings, AUROC can be misleadingly high while precision is poor — the instrument ranks circuit members above most non-members, but when it calls something a member, it is wrong most of the time.

Signal detection theory measures this with dd':

d=z(hit rate)z(false alarm rate)d' = z(\text{hit rate}) - z(\text{false alarm rate})

where zz is the inverse normal CDF. A d=0d' = 0 means the instrument cannot distinguish circuit members from non-members at all. A d>1d' > 1 indicates moderate discriminability. A d>2d' > 2 is strong.

For circuit detection specifically, AUPRC (area under the precision-recall curve) is more informative than AUROC when the base rate is low. A circuit of 12 heads in a model with 144 total heads has a base rate of 12/1440.0812/144 \approx 0.08. At this base rate, AUROC can reach 0.9 while precision is below 0.1 — the instrument correctly ranks circuit members above non-members most of the time, but when it calls something a member, it is almost always wrong.

What to report. AUPRC alongside AUROC for any circuit with fewer than 25 components. The base rate. Whether the reference circuit used to compute these metrics was discovered by the same instrument family, in which case agreement is partly mechanical.

Signal Detection Framework — two-panel d-prime comparison showing standard vs high random baseline

A score is calibrated when we can locate it on a known scale. Without calibration, a number is a relative ranking within one experiment, not a measurement. Two papers reporting “87% faithfulness” may be measuring different quantities; calibration requires enough specificity to determine whether they are comparable.

The following table provides calibration reference points for common tasks and models:

TaskModelMetricFull-model baselineCircuit baselineRecoverySource
IOIGPT-2 SmallLogit difference3.563.1087%Wang et al. 2022
Greater-ThanGPT-2 SmallProb. difference81.7%72.7%89.5%Hanna et al. 2023
SVAGPT-2 SmallLogit diff / acc.0.700.6593%Lazo et al. 2025
SVA (DAS)GPT-2 SmallIIA0.86–0.95Mueller et al. 2025
SVA (transcoders)GPT-2 SmallIIA0.4–0.6Published range
SVA (SAE features)GPT-2 SmallIIABelow raw neuronsMueller et al. 2025

All faithfulness numbers should be read as: “recovery under [ablation method] on [prompt distribution].” The IOI circuit’s 87% is under mean ablation with the Wang et al. prompt set; Miller et al. (2024) show that different choices produce substantially different numbers for the same circuit. A new IIA score of 0.52 on GPT-2 Small SVA sits in the transcoder range and well below the DAS range — whether that is good or bad depends on the method’s parameter count and the claim being made.

An instrument should measure what it claims to measure rather than a correlated proxy.

Hewitt and Liang (EMNLP 2019) showed this failure mode concretely for probes: a probe achieving 90% syntactic accuracy may achieve 85% on a control task where labels are shuffled into word-type statistics. The probe is measuring its own capacity, not the representation’s structure. The selectivity — the 5 percentage point gap — is the valid measurement.

Sutter et al. (NeurIPS 2025) showed the same pattern for IIA: unconstrained nonlinear alignment maps achieve near-perfect IIA on random-initialization models. What gets measured is the map’s flexibility, not the model’s representational geometry. Linear IIA (DAS with a linear map) makes a specific, falsifiable claim about representational geometry: that the causal variable is linearly encoded. Nonlinear IIA makes a much weaker claim about which the alignment map architecture provides essentially no information.

The practical test is to vary the alignment map’s capacity. If IIA remains high when the map dimension is reduced from dmodeld_{\text{model}} to dmodel/4d_{\text{model}} / 4, the finding is robust to map complexity. If IIA collapses, it was measuring map flexibility. A control task — same probe architecture, labels that require no representational information — provides a direct test in the Hewitt and Liang sense.

What to report. The alignment map architecture stated explicitly. IIA measured across at least two map capacities. A control task at matched capacity if the construct coverage claim is central.

Evidence patternWhat it establishesRecommended language
Score, no baselinesInstrument capacity”Uncalibrated score; baselines pending”
Score + random baseline onlySignal over chance"Δrandom=X\Delta_{\text{random}} = X"
Score + both baselinesSignal over chance and arch priorS=XS = X (Δrandom=X1\Delta_{\text{random}} = X_1, Δarch=X2\Delta_{\text{arch}} = X_2)“
AUROC, no AUPRC, low base rateRanking, not detection”Ranks circuit members above non-members”
High IIA, collapsed with linear mapMap flexibility, not linear geometry”IIA achievable; not linearly encoded”

Measurement validity gates the interpretation of every other evidence type:

  • Any verdict above Proposed requires at least a bootstrap CI (reliability) and a random-vector baseline (baseline separation). Without these, a score is a data point, not a finding.
  • Causally suggestive → Mechanistically supported: Requires calibration against at least one published reference point.
  • Mechanistically supported → Triangulated: Requires invariance across at least two models and construct coverage confirmation.

For any reported score from a circuit evaluation instrument:

  1. Reliability. Bootstrap across 100+ prompt subsamples; report 95% CI. Compute split-half correlation. If ρXX<0.7\rho_{XX'} < 0.7, apply Spearman-Brown to determine whether a feasible prompt increase would bring reliability above threshold.
  2. Invariance. Test on at least two model sizes or families with separate untrained-model baselines.
  3. Baseline separation. Report SrandomS_{\text{random}}, SuntrainedS_{\text{untrained}}, Δrandom\Delta_{\text{random}}, and Δarch\Delta_{\text{arch}}. These are the primary reported quantities, not ScircuitS_{\text{circuit}} alone.
  4. Sensitivity. AUPRC alongside AUROC for circuits with fewer than 25 components; state the base rate.
  5. Calibration. Locate the score against at least one published baseline on the same task and model; state the ablation method and prompt distribution precisely.
  6. Construct coverage. State the alignment map architecture; vary its capacity; run a control task at matched capacity if the representational geometry claim is central.

A skipped step must be named in the verdict.

For full worked examples applying all five lenses (including measurement validity) to published claims:

  • IOI Circuit — reliability untested; single prompt template
  • Induction Heads — multiple independent measurements converge
  • SAE Features — baseline separation is the central question
  • Probing Classifiers — measurement without construct coverage (Hewitt & Liang)
  • Othello World Model — calibration question: linear decodability vs. world model
  • Grokking — full measurement validity (toy model, exact weights known)