Case Study: Successor Heads
Section titled “Case Study: Successor Heads”Building on the Greater-Than analysis, subsequent work (Gould et al. 2023) identifies successor heads as a general-purpose mechanism — attention heads whose matrices encode ordinal succession across multiple domains: days of the week (Monday → Tuesday), months (January → February), numbers (1 → 2), and alphabetical sequences (A → B). The claim is that these heads do not merely implement year comparison but encode a general ordinal-successor function that the model reuses across sequence types.
This extends the Greater-Than claim from a task-specific circuit to a general computational primitive — a reusable building block. The construct validity question shifts accordingly: is “successor head” a natural kind (a real computational unit) or a family resemblance (a label applied to heads that happen to do ordinal things)?
Composite Verdict
Section titled “Composite Verdict”| Lens | Strongest | Weakest | Overall |
|---|---|---|---|
| Construct | C2 Structural plausibility | C5 Convergent | Strong |
| Internal | I1 Necessity | I5 Confound control | Causally suggestive |
| External | E5 Robustness | E1/E6 | Partial |
| Measurement | M2/M3 Invariance + Separation | M1 Reliability | Strong |
| Interpretive | V2/V3 Match + Coherence | V4 Alternative exclusion | Strong |
Overall verdict: Causally suggestive, approaching Mechanistically supported. Successor heads benefit from the same structural clarity as the Greater-Than circuit, with the additional strength of cross-domain generalization. The multi-domain pattern makes the “general computational primitive” claim more convincing than a single-task circuit claim. The case for successor heads as a natural kind is stronger than for most circuits because the same structural signature appears across unrelated domains — this is convergent evidence from the phenomenon itself, even without formal C5 convergent validity from multiple discovery methods.
Philosophy of Science Lens — Construct Validity
Section titled “Philosophy of Science Lens — Construct Validity”Is “successor head” a coherent construct?
Criteria
Section titled “Criteria”C1 — Falsifiability: Pass. The claim predicts that successor heads should encode ordinal structure across multiple domains in their matrices. A head that encodes year ordering but not month ordering is a year-specific head, not a general successor head. This is a discriminating prediction.
C2 — Structural plausibility: Pass. matrices are inspected and shown to encode ordinal structure across domains. The same heads that boost “32 → 33” also boost “Monday → Tuesday” and “B → C.” The structural evidence spans domains.
C3 — Task specificity: N/A (honest scope — general purpose). Successor heads are claimed to be general-purpose, not task-specific. The evidence confirms this — they fire across domains. This is the same honest-scope pattern as induction heads: a general mechanism, honestly described as general.
C4 — Minimality: Pass. A small number of heads show the multi-domain successor pattern. Not every head in the model does this — the set is selective.
C5 — Convergent validity: Partial. Evidence from structural analysis () and behavioral analysis (ablation effects on successor tasks) converges. A third method (e.g., probing for ordinal features, EAP discovery) would strengthen convergence.
| Criterion | Verdict | Key evidence |
|---|---|---|
| C1 Falsifiability | Pass | Cross-domain structural predictions |
| C2 Structural plausibility | Pass | Multi-domain ordering |
| C3 Task specificity | N/A (general) | Multi-domain by design |
| C4 Minimality | Pass | Small selective set |
| C5 Convergent validity | Partial | Structural + behavioral |
Key Distinctions
Section titled “Key Distinctions”- Confirmation vs corroboration: The multi-domain pattern provides genuine corroboration — the same structural signature independently discovered in years, months, days, and letters constitutes multiple independent tests of the “general successor” hypothesis, not a single confirmation replayed across domains.
- Natural kind vs family resemblance: The structural signature ( encoding ordinal succession) is consistent across domains, suggesting “successor head” is a natural kind rather than a loose family resemblance label. The same mechanism, not just the same behavior.
- Operationalism vs realism: “Successor head” is defined by both observable behavior (cross-domain ordinal effects) and structural properties ( geometry). This dual grounding makes the label more realist than purely operationalist — the mechanism exists in the weights, not just in the measurements.
Nomological Network
Section titled “Nomological Network”The successor head construct connects to:
- Weight structure — encodes ordinal ordering across multiple domains (structural, confirmed)
- Cross-domain behavior — same heads produce successor effects for years, months, days, letters (behavioral, confirmed)
- Ablation effects — removing successor heads degrades ordinal predictions (causal, confirmed)
- Non-ordinal control — successor heads do not drive non-ordinal tasks (specificity, partially confirmed)
- Training dynamics — when does the successor structure emerge during training? (untested)
- Cross-model prediction — do other architectures develop the same mechanism? (untested)
- Probing convergence — do probes for ordinal features align with successor head directions? (untested)
Four nodes confirmed, three unconnected. A moderately thick network — the cross-domain confirmation is particularly strong as it represents multiple independent tests of the same structural prediction.
Neuroscience Lens — Internal Validity
Section titled “Neuroscience Lens — Internal Validity”Does the evidence establish that successor heads implement ordinal computation?
Criteria
Section titled “Criteria”I1 — Necessity: Pass. Ablating successor heads degrades performance on ordinal/successor tasks across domains. The effect is measurable and domain-general (not just years).
I2 — Sufficiency: Partial. The structure implies the heads can compute succession. But full isolation (can these heads alone drive successor behavior with everything else ablated?) is not reported.
I3 — Specificity: Partial. The cross-domain pattern provides implicit specificity: successor heads are specific to ordinal tasks. They should not fire on tasks without ordinal structure (sentiment, syntax). This is partially verified — ablation on non-ordinal tasks shows smaller effects.
I4 — Consistency: Partial. Cross-domain consistency is strong (the same heads work across years, months, letters). Cross-model consistency is limited — are the same heads successors in GPT-2 Medium? In Pythia?
I5 — Confound control: Not tested. Single ablation method.
| Criterion | Verdict | Key evidence |
|---|---|---|
| I1 Necessity | Pass | Cross-domain ablation effects |
| I2 Sufficiency | Partial | Structural implication, not isolation |
| I3 Specificity | Partial | Ordinal vs. non-ordinal contrast |
| I4 Consistency | Partial | Cross-domain strong; cross-model limited |
| I5 Confound control | Not tested | Single method |
Key Distinctions
Section titled “Key Distinctions”- Single vs double dissociation: Cross-domain ablation provides partial double dissociation — successor heads impair ordinal tasks but show smaller effects on non-ordinal tasks. This is stronger than pure single dissociation but not a formal double-dissociation design with a matched control circuit.
- Lesion vs stimulation: Only lesion-style evidence (ablation) is reported. Stimulation (amplifying successor head signals to force ordinal predictions) would test whether the mechanism is steerable, not just necessary.
Dissociation Matrix
Section titled “Dissociation Matrix”| Year succession | Month succession | Letter succession | Non-ordinal control | |
|---|---|---|---|---|
| Ablate successor heads | ↓↓ | ↓↓ | ↓↓ | ↓ (small) |
| Ablate non-successor heads | ? | ? | ? | ? |
The top row is well-filled across domains — a strength. The contrast between ordinal (large effect) and non-ordinal (small effect) provides implicit specificity. However, the converse (ablating non-successor heads and measuring ordinal task impact) is not tested, leaving the formal double-dissociation incomplete.
Pharmacology Lens — External Validity
Section titled “Pharmacology Lens — External Validity”Does intervening on successor heads produce predictable downstream effects?
Criteria
Section titled “Criteria”E1 — Intervention reach: Not tested. Can you steer the model toward successor behavior (make it always predict the next item in any sequence) by stimulating successor heads? Untested.
E2 — Graded response: Implicit. The structure implies graded effects — items further from the reference should receive proportionally stronger boosts. Not directly measured as a dose-response.
E3 — Selectivity: Partial. The cross-domain generality is both a strength and a limitation: the mechanism is selective for ordinal tasks but not selective for any particular ordinal domain.
E4 — Effect magnitude: Moderate. Successor heads contribute meaningfully to ordinal predictions but are not the sole mechanism.
E5 — Robustness: Strong (within scope). Works across years, months, days, letters, numbers. The robustness across domains is the primary evidence.
E6 — Cross-architecture: Not tested. Is the successor mechanism a universal attention-head computation or specific to GPT-2’s architecture?
| Criterion | Verdict | Key evidence |
|---|---|---|
| E1 Intervention reach | Not tested | — |
| E2 Graded response | Implicit | Structural prediction |
| E3 Selectivity | Partial | Selective for ordinal class |
| E4 Effect magnitude | Moderate | Contributing, not sole mechanism |
| E5 Robustness | Strong | Multi-domain generalization |
| E6 Cross-architecture | Not tested | — |
Key Distinctions
Section titled “Key Distinctions”- Affinity vs efficacy: Successor heads show both affinity (they activate on ordinal sequences) and efficacy (ablation degrades ordinal predictions). The cross-domain evidence means this affinity-efficacy pairing is confirmed across multiple independent test cases.
- Therapeutic window: Since successor heads are claimed as general-purpose primitives (not task-specific), the concept of a “therapeutic window” shifts — any intervention affects all ordinal tasks simultaneously. There is no selective dosing for one domain.
- Receptor reserve: Whether backup mechanisms compensate when successor heads are ablated is not characterized. The partial (non-total) effect of ablation suggests some redundancy exists.
Dose-Response Curve
Section titled “Dose-Response Curve”The successor heads’ dose-response is largely uncharacterized. We have:
- Complete ablation: measurable degradation across ordinal domains
- Cross-domain confirmation: the same intervention degrades multiple domains (consistent direction)
- Non-ordinal control: smaller effect on non-ordinal tasks (selectivity boundary exists)
What’s missing:
- No parametric sweep — no graded ablation between 0% and 100%
- No stimulation experiment — amplifying successor signals to test whether predictions shift toward “next item”
- No EC₅₀ characterization — at what intervention strength does the ordinal effect become detectable?
The structural prediction (items further from reference get proportionally larger boosts) implies a dose-response exists in the weights, but this has not been measured as a behavioral curve.
Measurement Theory Lens — Measurement Validity
Section titled “Measurement Theory Lens — Measurement Validity”Are the instruments reliable and well-calibrated?
Criteria
Section titled “Criteria”M1 — Reliability: Not reported. No confidence intervals on ordinal structure measurements.
M2 — Invariance: Pass (within model). The same heads show successor structure across domains — strong within-model invariance.
M3 — Baseline separation: Pass. Non-successor heads do not show multi-domain ordinal structure. The measurement cleanly separates.
M4 — Sensitivity: Good. The multi-domain criterion is more sensitive than a single-domain criterion — it distinguishes general successor heads from domain-specific ordinal heads.
M5 — Calibration: Not reported.
M6 — Construct coverage: Good. Both structural () and behavioral (multi-domain ablation) evidence. Good coverage.
| Criterion | Verdict | Key evidence |
|---|---|---|
| M1 Reliability | Not reported | — |
| M2 Invariance | Pass | Cross-domain consistency |
| M3 Baseline separation | Pass | Clear successor/non-successor distinction |
| M4 Sensitivity | Good | Multi-domain criterion is discriminating |
| M5 Calibration | Not reported | — |
| M6 Construct coverage | Good | Structural + behavioral |
Key Distinctions
Section titled “Key Distinctions”- Reliability vs validity: The multi-domain criterion provides strong face validity (the measurement captures something real about ordinal computation). But without confidence intervals or test-retest measurements, reliability is assumed rather than demonstrated.
- Convergent vs discriminant validity: The same measurement (ordinal structure in ) converges across domains — this is implicit convergent validity from the phenomenon. Discriminant validity (do these heads score low on non-ordinal structure metrics?) is partially demonstrated through the non-ordinal control.
MTMM Matrix
Section titled “MTMM Matrix”| analysis (years) | analysis (months) | Ablation (years) | Ablation (months) | |
|---|---|---|---|---|
| analysis (years) | — | high (same heads) | moderate | ? |
| analysis (months) | high | — | ? | moderate |
| Ablation (years) | moderate | ? | — | high (same heads) |
| Ablation (months) | ? | moderate | high | — |
Cross-domain convergence (off-diagonal same-method cells) is high — the same heads identified structurally in one domain appear in another. Cross-method convergence (structural vs. ablation for the same domain) is moderate — structural identification and causal effects point to overlapping head sets. This is an unusually well-filled MTMM for an MI result, though formal correlation values are not reported.
MI Lens — Interpretive Validity
Section titled “MI Lens — Interpretive Validity”Is the “general successor primitive” interpretation warranted?
Criteria
Section titled “Criteria”V1 — Level declaration: Pass. Structural + algorithmic — names what the heads compute (ordinal succession) and how ( encodes ordering).
V2 — Level-evidence match: Strong. Structural evidence ( analysis) directly supports a structural/algorithmic claim.
V3 — Narrative coherence: Strong. “Some heads are reusable successor-computing primitives” is a clean, falsifiable story that explains cross-domain generalization.
V4 — Alternative exclusion: Partial. Could these heads be doing something more general (attention to “related items”) that happens to include succession? The structural evidence constrains this — specifically encodes ordering, not general similarity. But whether “successor” is exactly right versus “ordinal proximity” is debatable.
V5 — Scope honesty: Good. “General-purpose ordinal mechanism” matches the evidence scope.
| Criterion | Verdict | Key evidence |
|---|---|---|
| V1 Level declaration | Pass | Structural + algorithmic |
| V2 Level-evidence match | Strong | Direct structural support |
| V3 Narrative coherence | Strong | Cross-domain generalization explained |
| V4 Alternative exclusion | Partial | ”Successor” vs. “ordinal proximity” |
| V5 Scope honesty | Good | Matches evidence |
Key Distinctions
Section titled “Key Distinctions”- Description vs explanation: The “general successor primitive” account is genuinely explanatory — it explains why the same heads appear across ordinal domains (shared mechanism) and predicts where they should appear (any ordinal task). This goes beyond mere description of which heads are active.
- Component identity vs component role: The role label “successor head” is well-grounded: the structural signature ( ordering) independently confirms the functional label (ordinal succession behavior). The label is not just based on behavioral observation but has architectural backing.
- Faithfulness vs understanding: The evidence supports both — the identified heads are causally important (faithfulness via ablation) AND the mechanism is understood (ordinal structure in ). This combination is rare in MI.
Evidence Convergence Map
Section titled “Evidence Convergence Map”- Implementational → Interpretation: Strong. weight analysis directly shows ordinal encoding; ablation confirms causal role. Multiple implementational sub-modes converge.
- Algorithmic → Interpretation: Strong. “Compute the next item in an ordinal sequence” is a specified algorithm that the structural evidence directly supports. The cross-domain pattern confirms it is a general algorithm, not a task-specific shortcut.
- Computational → Interpretation: Moderate. “Ordinal succession” is well-defined as a computational goal. Whether it is exactly “successor” (next item) or “ordinal proximity” (nearby items) is the remaining ambiguity.
Intervention-Interpretation Matrix
Section titled “Intervention-Interpretation Matrix”| Necessity | Sufficiency | Representational | Algorithmic | Computational | |
|---|---|---|---|---|---|
| Ablation | ✓ (cross-domain) | — | ∅ | ∅ | ∅ |
| analysis | — | — | ✓ | ✓ (partial) | — |
| Steering | — | — | — | — | — |
| Cross-domain generalization | — | implicit | — | ✓ | ✓ |
The filled cells span both rows and columns more broadly than most MI results. Ablation provides necessity; weight analysis provides representational and partial algorithmic evidence; cross-domain generalization provides algorithmic and computational support. The main gap is sufficiency (no isolation experiment) and steering (no stimulation test).
Causal Sufficiency Graph
Section titled “Causal Sufficiency Graph”- Ordinal input → successor head attention: solid (heads attend to ordinal tokens, confirmed across domains)
- Successor head → boosted next-item logit: solid (structural analysis confirms the weight pathway)
- Successor heads → output prediction: solid (ablation confirms causal contribution)
- Input encoding → successor head selection: dashed (how the model identifies that a token is part of an ordinal sequence is not characterized)
The output pathway (successor head → prediction) is solid and multi-domain confirmed. The input pathway (how tokens are identified as ordinal) is the main uncharacterized link — the heads clearly compute succession, but the upstream mechanism that routes ordinal inputs to them is not described.