Case Studies
Section titled “Case Studies”Each case study below takes a published mechanistic claim and evaluates it through all five validity lenses — construct, internal, external, measurement, and interpretive. The goal is not to rank papers but to show what the framework looks like in practice: where evidence is strong, where it is absent, and what the composite verdict means.
The case studies are ordered roughly by overall verdict strength, from the strongest claims to the weakest.
Validated (within scope)
Section titled “Validated (within scope)”Claims with complete evidence across all lenses — limited only by scope.
| Case Study | Claim | Key insight |
|---|---|---|
| Grokking / Modular Addition | Fourier algorithm in toy transformer | The ceiling — what “fully understood” looks like. Every weight matrix explained. |
| Superposition | Features packed as near-orthogonal directions | Validated theory awaiting real-model confirmation. Toy → real gap is the open question. |
Triangulated
Section titled “Triangulated”Evidence converges across multiple independent lenses.
| Case Study | Claim | Key insight |
|---|---|---|
| Induction Heads | Two-head composition for in-context copying | The gold standard in real models. Simple mechanism, broad replication, thick nomological network. |
Causally suggestive
Section titled “Causally suggestive”Strong causal evidence with identifiable gaps preventing advancement.
| Case Study | Claim | Key insight |
|---|---|---|
| IOI Circuit | 26-head indirect object identification mechanism | Most thoroughly analyzed circuit. Strong I1/I2, but method-conditional and specificity untested. |
| Greater-Than | Successor heads encoding ordinal year comparison | Best structural plausibility in MI. ordering evidence is the model for C2. |
| Successor Heads | General-purpose ordinal mechanism across domains | Cross-domain generalization as convergent evidence. Stronger “natural kind” case than single-task circuits. |
| Copy Suppression | Heads that actively suppress incorrect token copying | Unusually clean specificity — ablation produces a specific error type, not general degradation. |
| Docstring Circuit | Variable binding in Python docstrings | Illustrates label risk: “variable binding” vs. simpler “positional copying” not distinguished. |
| Knowledge Neurons / ROME | Factual knowledge localized in MLP layers | A tool can work for the wrong reasons. Strong intervention, weak mechanistic story. |
| Othello World Model | Linear board-state representation | Interpretive inflation: “world model” carries implications beyond “linearly decodable.” |
Proposed
Section titled “Proposed”Claims where evidence has not yet established validity beyond initial identification.
| Case Study | Claim | Key insight |
|---|---|---|
| SAE Features | Dictionary directions as computational units | Thin nomological network. Features may be properties of the dictionary, not the model. |
| Probing Classifiers | Linear decodability implies representation | Measurement without intervention = no internal validity. Decodable ≠ encoded. |
| Gender Bias Circuits | Bias localized in removable components | Construct incoherence: bias and knowledge share circuits. The construct itself may not be separable. |
Reading the case studies
Section titled “Reading the case studies”Each case study follows the same structure:
- Introduction — what the claim is and why it matters
- Five lens evaluations — each with per-criterion verdicts (Pass / Partial / Not tested / Weak) and a summary table
- Composite verdict — a table showing the strongest and weakest criterion per lens, plus the overall verdict
The per-criterion verdicts use consistent language:
- Pass — evidence is present and sufficient
- Partial — some evidence exists but with gaps
- Not tested — this criterion was not evaluated in the published work
- Weak — evidence exists but is inadequate or contradicted
- N/A — the criterion does not apply to this type of claim
Patterns across case studies
Section titled “Patterns across case studies”Several patterns emerge from evaluating these claims side by side:
The sufficiency gap. Most circuits demonstrate necessity (I1) but not sufficiency (I2). Only induction heads and grokking demonstrate path-level or full sufficiency.
Method-conditional results. IOI’s headline numbers (87% faithfulness) are specific to mean ablation. Miller et al. (2024) show these drop below 50% under other methods. Ablation type is part of the claim.
The toy-model ceiling. Grokking and superposition reach Validated — but only within toy scope. The gap between toy-model proof-of-concept and real-model confirmation is the field’s central challenge.
Interpretive inflation. “World model,” “deception feature,” “knowledge neuron” — labels that carry theoretical implications beyond what the evidence supports. The framework systematically identifies where labels exceed evidence (V5 scope honesty).
Construct incoherence. Gender bias circuits fail not because evidence is lacking but because the construct itself cannot be separated from legitimate gender processing. Sometimes the right answer is “this question is not well-posed,” not “we need more data.”