Skip to content

Mechanistic Validity is a framework for evaluating claims about circuits in neural networks. It defines five validity types, the criteria within each type, and the instruments that produce evidence for each criterion. Its purpose is to make explicit which part of a mechanistic claim a given measurement supports and which parts remain unaddressed.

The framework does not introduce new measurement methods. It organizes existing methods — ablation, activation patching, IIA, causal scrubbing, weight analysis, baseline calibration — under a common evaluative vocabulary drawn from the standard typology of validity in philosophy of science and adapted to the conditions of mechanistic interpretability.

The framework applies to claims of the form component C implements computation T in model M. Such claims are made routinely in circuit-discovery papers and are typically supported by a small number of instruments — most often activation patching plus one form of ablation. The framework specifies what additional evidence is required for the claim to be considered validated under each of five named dimensions, and what verdict is licensed when only a subset of that evidence is present.

The framework’s central commitment is that a single high score does not validate a circuit claim. Validation is a pattern of evidence across multiple dimensions, and a claim is only as strong as the dimension on which it has the weakest support.

The framework has two layers. The upper layer is the five validity types — construct, internal, external, measurement, interpretive. These are the abstract questions a claim must answer. The lower layer is the five foundations — Philosophy of Science, Neuroscience, Pharmacology, Measurement Theory, Mechanistic Interpretability. These are the operational toolkits, one per validity type, that translate the abstract question into criteria, instruments, and reporting rules.

Validity typeFoundationWhat the foundation provides
ConstructPhilosophy of ScienceFalsifiability, structural plausibility, task specificity, minimality, convergent validity
InternalNeuroscienceNecessity, sufficiency, specificity, consistency
ExternalPharmacologyIntervention reach, graded response, selectivity, effect magnitude, robustness, cross-architecture generalization
MeasurementMeasurement TheoryReliability, invariance, baseline separation, sensitivity, calibration, construct coverage
InterpretiveMechanistic InterpretabilityLevel declaration, level-evidence match, narrative coherence, alternative exclusion, scope honesty

The validity-type pages explain what each type asks of a claim and where current MI practice falls short. The foundation pages give the operational criteria, the instruments that produce evidence for each criterion, the failure modes that appear in practice, and a minimum reporting protocol.

The framework does not rank circuits. It produces a structured verdict — a pattern of which dimensions have evidence and which do not — rather than a scalar score. Two circuits with the same scalar faithfulness can have very different verdict structures under the framework, and the framework’s value is in making that difference visible.

The framework also does not assume any particular discovery method is correct. Activation patching, EAP, IIA, weight classifiers, and causal scrubbing all appear in the foundation pages as instruments that produce evidence for one or more criteria. None is privileged. The framework’s role is to specify what each instrument actually establishes.

A reader new to the framework should begin with Reading this site, which explains how the pages link together and how to use the framework to audit a circuit claim. Readers who already know which validity type they are concerned with can skip to that type’s page; readers who already know which method they are using can skip to the corresponding foundation.