Skip to content

The framework has five layers plus a verdict annotation. Each layer asks a different question and occupies a different epistemic role. The layers are summarized here in bottom-up order — the order in which a researcher builds a claim.

Layer A is the concrete runnable test: the thing you actually execute on the model. Instruments include activation patching, resample ablation, DAS-IIA, weight classifiers, bootstrap stability checks, and spectral SVD of weight matrices. An instrument produces a number or a set of numbers. By itself, a number from Layer A is not a finding — it is data that must be interpreted upward through the hierarchy. The same instrument (for example, causal scrubbing) can contribute evidence for criteria in multiple validity types depending on what question it is being used to answer.

Layer B classifies an instrument’s output by the kind of signal it produces, independent of the specific tool used. There are six families: causal (interventions and counterfactuals), structural (weight-space analysis with no forward pass), representational (latent geometry), behavioral (held-out task and generalization tests), information-theoretic (coding properties), and measurement (reliability and calibration audits). The evidence-family classification is what allows the framework to check convergent validity: two instruments from different families that agree on the same components constitute stronger evidence than two instruments from the same family, because they have structurally different failure modes.

Layer C specifies the exact condition that must be met for a claim to count as satisfying a given dimension of validity. There are approximately 27 criteria across the five validity types. Each criterion is a falsifiable, operationally defined condition with a minimum-reporting rule. For example, baseline separation (in measurement validity) requires that the IIA score exceed both the random-vector baseline and the untrained-model baseline by a practically meaningful margin — it is not satisfied by pointing at a high absolute score.

Layer D is the level at which the framework’s evaluative logic lives. The five validity types — construct, internal, external, measurement, interpretive — are the five abstract questions a circuit claim must answer. A validity type is satisfied when all its criteria are met; partial satisfaction is reported as partial validity. The types are not equivalent: a claim can be internally valid (the causal inference is licensed within the experiment) without being externally valid (the result may not generalize), and a claim can pass all four traditional validity types while failing interpretive validity (the verdict may be stated at the wrong description level).

Layer E is the claim itself, stated with explicit scope. A verdict names: (1) the component or set of components, (2) the computation or behavior attributed to them, (3) the model and task, (4) the verdict-strength tier (Proposed, Causally suggestive, Mechanistically supported, Triangulated, Validated, Underdetermined, or Disconfirmed), and (5) the description-mode tag. The verdict is not a score. It is a structured statement about which validity types have been addressed, which criteria within those types have been met, and what remains open.

The mode tag is an annotation on Layer E, not a separate layer. It names the type of claim the verdict is making: [computational], [algorithmic], [representational], [implementational], [architectural], [structural], or [transportable]. The tag is applied last, after the verdict has been assembled from Layers A–D, because description level is a property of the claim rather than of the evidence that supports it. Interpretive validity exists specifically to check whether the declared mode tag is licensed by the evidence.