Skip to content

Each case study below takes a published mechanistic claim and evaluates it through all five validity lenses — construct, internal, external, measurement, and interpretive. The goal is not to rank papers but to show what the framework looks like in practice: where evidence is strong, where it is absent, and what the composite verdict means.

The case studies are ordered roughly by overall verdict strength, from the strongest claims to the weakest.


Claims with complete evidence across all lenses — limited only by scope.

Case StudyClaimKey insight
Grokking / Modular AdditionFourier algorithm in toy transformerThe ceiling — what “fully understood” looks like. Every weight matrix explained.
SuperpositionFeatures packed as near-orthogonal directionsValidated theory awaiting real-model confirmation. Toy → real gap is the open question.

Evidence converges across multiple independent lenses.

Case StudyClaimKey insight
Induction HeadsTwo-head composition for in-context copyingThe gold standard in real models. Simple mechanism, broad replication, thick nomological network.

Strong causal evidence with identifiable gaps preventing advancement.

Case StudyClaimKey insight
IOI Circuit26-head indirect object identification mechanismMost thoroughly analyzed circuit. Strong I1/I2, but method-conditional and specificity untested.
Greater-ThanSuccessor heads encoding ordinal year comparisonBest structural plausibility in MI. WOVW_{OV} ordering evidence is the model for C2.
Successor HeadsGeneral-purpose ordinal mechanism across domainsCross-domain generalization as convergent evidence. Stronger “natural kind” case than single-task circuits.
Copy SuppressionHeads that actively suppress incorrect token copyingUnusually clean specificity — ablation produces a specific error type, not general degradation.
Docstring CircuitVariable binding in Python docstringsIllustrates label risk: “variable binding” vs. simpler “positional copying” not distinguished.
Knowledge Neurons / ROMEFactual knowledge localized in MLP layersA tool can work for the wrong reasons. Strong intervention, weak mechanistic story.
Othello World ModelLinear board-state representationInterpretive inflation: “world model” carries implications beyond “linearly decodable.”

Claims where evidence has not yet established validity beyond initial identification.

Case StudyClaimKey insight
SAE FeaturesDictionary directions as computational unitsThin nomological network. Features may be properties of the dictionary, not the model.
Probing ClassifiersLinear decodability implies representationMeasurement without intervention = no internal validity. Decodable ≠ encoded.
Gender Bias CircuitsBias localized in removable componentsConstruct incoherence: bias and knowledge share circuits. The construct itself may not be separable.

Each case study follows the same structure:

  1. Introduction — what the claim is and why it matters
  2. Five lens evaluations — each with per-criterion verdicts (Pass / Partial / Not tested / Weak) and a summary table
  3. Composite verdict — a table showing the strongest and weakest criterion per lens, plus the overall verdict

The per-criterion verdicts use consistent language:

  • Pass — evidence is present and sufficient
  • Partial — some evidence exists but with gaps
  • Not tested — this criterion was not evaluated in the published work
  • Weak — evidence exists but is inadequate or contradicted
  • N/A — the criterion does not apply to this type of claim

Several patterns emerge from evaluating these claims side by side:

The sufficiency gap. Most circuits demonstrate necessity (I1) but not sufficiency (I2). Only induction heads and grokking demonstrate path-level or full sufficiency.

Method-conditional results. IOI’s headline numbers (87% faithfulness) are specific to mean ablation. Miller et al. (2024) show these drop below 50% under other methods. Ablation type is part of the claim.

The toy-model ceiling. Grokking and superposition reach Validated — but only within toy scope. The gap between toy-model proof-of-concept and real-model confirmation is the field’s central challenge.

Interpretive inflation. “World model,” “deception feature,” “knowledge neuron” — labels that carry theoretical implications beyond what the evidence supports. The framework systematically identifies where labels exceed evidence (V5 scope honesty).

Construct incoherence. Gender bias circuits fail not because evidence is lacking but because the construct itself cannot be separated from legitimate gender processing. Sometimes the right answer is “this question is not well-posed,” not “we need more data.”