Skip to content

Building on the Greater-Than analysis, subsequent work (Gould et al. 2023) identifies successor heads as a general-purpose mechanism — attention heads whose WOVW_{OV} matrices encode ordinal succession across multiple domains: days of the week (Monday → Tuesday), months (January → February), numbers (1 → 2), and alphabetical sequences (A → B). The claim is that these heads do not merely implement year comparison but encode a general ordinal-successor function that the model reuses across sequence types.

This extends the Greater-Than claim from a task-specific circuit to a general computational primitive — a reusable building block. The construct validity question shifts accordingly: is “successor head” a natural kind (a real computational unit) or a family resemblance (a label applied to heads that happen to do ordinal things)?

LensStrongestWeakestOverall
ConstructC2 Structural plausibilityC5 ConvergentStrong
InternalI1 NecessityI5 Confound controlCausally suggestive
ExternalE5 RobustnessE1/E6Partial
MeasurementM2/M3 Invariance + SeparationM1 ReliabilityStrong
InterpretiveV2/V3 Match + CoherenceV4 Alternative exclusionStrong

Overall verdict: Causally suggestive, approaching Mechanistically supported. Successor heads benefit from the same structural clarity as the Greater-Than circuit, with the additional strength of cross-domain generalization. The multi-domain pattern makes the “general computational primitive” claim more convincing than a single-task circuit claim. The case for successor heads as a natural kind is stronger than for most circuits because the same structural signature appears across unrelated domains — this is convergent evidence from the phenomenon itself, even without formal C5 convergent validity from multiple discovery methods.


Philosophy of Science Lens — Construct Validity

Section titled “Philosophy of Science Lens — Construct Validity”

Is “successor head” a coherent construct?

C1 — Falsifiability: Pass. The claim predicts that successor heads should encode ordinal structure across multiple domains in their WOVW_{OV} matrices. A head that encodes year ordering but not month ordering is a year-specific head, not a general successor head. This is a discriminating prediction.

C2 — Structural plausibility: Pass. WOVW_{OV} matrices are inspected and shown to encode ordinal structure across domains. The same heads that boost “32 → 33” also boost “Monday → Tuesday” and “B → C.” The structural evidence spans domains.

C3 — Task specificity: N/A (honest scope — general purpose). Successor heads are claimed to be general-purpose, not task-specific. The evidence confirms this — they fire across domains. This is the same honest-scope pattern as induction heads: a general mechanism, honestly described as general.

C4 — Minimality: Pass. A small number of heads show the multi-domain successor pattern. Not every head in the model does this — the set is selective.

C5 — Convergent validity: Partial. Evidence from structural analysis (WOVW_{OV}) and behavioral analysis (ablation effects on successor tasks) converges. A third method (e.g., probing for ordinal features, EAP discovery) would strengthen convergence.

CriterionVerdictKey evidence
C1 FalsifiabilityPassCross-domain structural predictions
C2 Structural plausibilityPassMulti-domain WOVW_{OV} ordering
C3 Task specificityN/A (general)Multi-domain by design
C4 MinimalityPassSmall selective set
C5 Convergent validityPartialStructural + behavioral
  • Confirmation vs corroboration: The multi-domain pattern provides genuine corroboration — the same structural signature independently discovered in years, months, days, and letters constitutes multiple independent tests of the “general successor” hypothesis, not a single confirmation replayed across domains.
  • Natural kind vs family resemblance: The structural signature (WOVW_{OV} encoding ordinal succession) is consistent across domains, suggesting “successor head” is a natural kind rather than a loose family resemblance label. The same mechanism, not just the same behavior.
  • Operationalism vs realism: “Successor head” is defined by both observable behavior (cross-domain ordinal effects) and structural properties (WOVW_{OV} geometry). This dual grounding makes the label more realist than purely operationalist — the mechanism exists in the weights, not just in the measurements.

The successor head construct connects to:

  • Weight structureWOVW_{OV} encodes ordinal ordering across multiple domains (structural, confirmed)
  • Cross-domain behavior — same heads produce successor effects for years, months, days, letters (behavioral, confirmed)
  • Ablation effects — removing successor heads degrades ordinal predictions (causal, confirmed)
  • Non-ordinal control — successor heads do not drive non-ordinal tasks (specificity, partially confirmed)
  • Training dynamics — when does the successor structure emerge during training? (untested)
  • Cross-model prediction — do other architectures develop the same mechanism? (untested)
  • Probing convergence — do probes for ordinal features align with successor head directions? (untested)

Four nodes confirmed, three unconnected. A moderately thick network — the cross-domain confirmation is particularly strong as it represents multiple independent tests of the same structural prediction.


Does the evidence establish that successor heads implement ordinal computation?

I1 — Necessity: Pass. Ablating successor heads degrades performance on ordinal/successor tasks across domains. The effect is measurable and domain-general (not just years).

I2 — Sufficiency: Partial. The WOVW_{OV} structure implies the heads can compute succession. But full isolation (can these heads alone drive successor behavior with everything else ablated?) is not reported.

I3 — Specificity: Partial. The cross-domain pattern provides implicit specificity: successor heads are specific to ordinal tasks. They should not fire on tasks without ordinal structure (sentiment, syntax). This is partially verified — ablation on non-ordinal tasks shows smaller effects.

I4 — Consistency: Partial. Cross-domain consistency is strong (the same heads work across years, months, letters). Cross-model consistency is limited — are the same heads successors in GPT-2 Medium? In Pythia?

I5 — Confound control: Not tested. Single ablation method.

CriterionVerdictKey evidence
I1 NecessityPassCross-domain ablation effects
I2 SufficiencyPartialStructural implication, not isolation
I3 SpecificityPartialOrdinal vs. non-ordinal contrast
I4 ConsistencyPartialCross-domain strong; cross-model limited
I5 Confound controlNot testedSingle method
  • Single vs double dissociation: Cross-domain ablation provides partial double dissociation — successor heads impair ordinal tasks but show smaller effects on non-ordinal tasks. This is stronger than pure single dissociation but not a formal double-dissociation design with a matched control circuit.
  • Lesion vs stimulation: Only lesion-style evidence (ablation) is reported. Stimulation (amplifying successor head signals to force ordinal predictions) would test whether the mechanism is steerable, not just necessary.
Year successionMonth successionLetter successionNon-ordinal control
Ablate successor heads↓↓↓↓↓↓↓ (small)
Ablate non-successor heads????

The top row is well-filled across domains — a strength. The contrast between ordinal (large effect) and non-ordinal (small effect) provides implicit specificity. However, the converse (ablating non-successor heads and measuring ordinal task impact) is not tested, leaving the formal double-dissociation incomplete.


Does intervening on successor heads produce predictable downstream effects?

E1 — Intervention reach: Not tested. Can you steer the model toward successor behavior (make it always predict the next item in any sequence) by stimulating successor heads? Untested.

E2 — Graded response: Implicit. The WOVW_{OV} structure implies graded effects — items further from the reference should receive proportionally stronger boosts. Not directly measured as a dose-response.

E3 — Selectivity: Partial. The cross-domain generality is both a strength and a limitation: the mechanism is selective for ordinal tasks but not selective for any particular ordinal domain.

E4 — Effect magnitude: Moderate. Successor heads contribute meaningfully to ordinal predictions but are not the sole mechanism.

E5 — Robustness: Strong (within scope). Works across years, months, days, letters, numbers. The robustness across domains is the primary evidence.

E6 — Cross-architecture: Not tested. Is the successor mechanism a universal attention-head computation or specific to GPT-2’s architecture?

CriterionVerdictKey evidence
E1 Intervention reachNot tested
E2 Graded responseImplicitStructural prediction
E3 SelectivityPartialSelective for ordinal class
E4 Effect magnitudeModerateContributing, not sole mechanism
E5 RobustnessStrongMulti-domain generalization
E6 Cross-architectureNot tested
  • Affinity vs efficacy: Successor heads show both affinity (they activate on ordinal sequences) and efficacy (ablation degrades ordinal predictions). The cross-domain evidence means this affinity-efficacy pairing is confirmed across multiple independent test cases.
  • Therapeutic window: Since successor heads are claimed as general-purpose primitives (not task-specific), the concept of a “therapeutic window” shifts — any intervention affects all ordinal tasks simultaneously. There is no selective dosing for one domain.
  • Receptor reserve: Whether backup mechanisms compensate when successor heads are ablated is not characterized. The partial (non-total) effect of ablation suggests some redundancy exists.

The successor heads’ dose-response is largely uncharacterized. We have:

  • Complete ablation: measurable degradation across ordinal domains
  • Cross-domain confirmation: the same intervention degrades multiple domains (consistent direction)
  • Non-ordinal control: smaller effect on non-ordinal tasks (selectivity boundary exists)

What’s missing:

  • No parametric sweep — no graded ablation between 0% and 100%
  • No stimulation experiment — amplifying successor signals to test whether predictions shift toward “next item”
  • No EC₅₀ characterization — at what intervention strength does the ordinal effect become detectable?

The structural prediction (items further from reference get proportionally larger WOVW_{OV} boosts) implies a dose-response exists in the weights, but this has not been measured as a behavioral curve.


Measurement Theory Lens — Measurement Validity

Section titled “Measurement Theory Lens — Measurement Validity”

Are the instruments reliable and well-calibrated?

M1 — Reliability: Not reported. No confidence intervals on ordinal structure measurements.

M2 — Invariance: Pass (within model). The same heads show successor structure across domains — strong within-model invariance.

M3 — Baseline separation: Pass. Non-successor heads do not show multi-domain ordinal structure. The measurement cleanly separates.

M4 — Sensitivity: Good. The multi-domain criterion is more sensitive than a single-domain criterion — it distinguishes general successor heads from domain-specific ordinal heads.

M5 — Calibration: Not reported.

M6 — Construct coverage: Good. Both structural (WOVW_{OV}) and behavioral (multi-domain ablation) evidence. Good coverage.

CriterionVerdictKey evidence
M1 ReliabilityNot reported
M2 InvariancePassCross-domain consistency
M3 Baseline separationPassClear successor/non-successor distinction
M4 SensitivityGoodMulti-domain criterion is discriminating
M5 CalibrationNot reported
M6 Construct coverageGoodStructural + behavioral
  • Reliability vs validity: The multi-domain criterion provides strong face validity (the measurement captures something real about ordinal computation). But without confidence intervals or test-retest measurements, reliability is assumed rather than demonstrated.
  • Convergent vs discriminant validity: The same measurement (ordinal structure in WOVW_{OV}) converges across domains — this is implicit convergent validity from the phenomenon. Discriminant validity (do these heads score low on non-ordinal structure metrics?) is partially demonstrated through the non-ordinal control.
WOVW_{OV} analysis (years)WOVW_{OV} analysis (months)Ablation (years)Ablation (months)
WOVW_{OV} analysis (years)high (same heads)moderate?
WOVW_{OV} analysis (months)high?moderate
Ablation (years)moderate?high (same heads)
Ablation (months)?moderatehigh

Cross-domain convergence (off-diagonal same-method cells) is high — the same heads identified structurally in one domain appear in another. Cross-method convergence (structural vs. ablation for the same domain) is moderate — structural identification and causal effects point to overlapping head sets. This is an unusually well-filled MTMM for an MI result, though formal correlation values are not reported.


Is the “general successor primitive” interpretation warranted?

V1 — Level declaration: Pass. Structural + algorithmic — names what the heads compute (ordinal succession) and how (WOVW_{OV} encodes ordering).

V2 — Level-evidence match: Strong. Structural evidence (WOVW_{OV} analysis) directly supports a structural/algorithmic claim.

V3 — Narrative coherence: Strong. “Some heads are reusable successor-computing primitives” is a clean, falsifiable story that explains cross-domain generalization.

V4 — Alternative exclusion: Partial. Could these heads be doing something more general (attention to “related items”) that happens to include succession? The structural evidence constrains this — WOVW_{OV} specifically encodes ordering, not general similarity. But whether “successor” is exactly right versus “ordinal proximity” is debatable.

V5 — Scope honesty: Good. “General-purpose ordinal mechanism” matches the evidence scope.

CriterionVerdictKey evidence
V1 Level declarationPassStructural + algorithmic
V2 Level-evidence matchStrongDirect structural support
V3 Narrative coherenceStrongCross-domain generalization explained
V4 Alternative exclusionPartial”Successor” vs. “ordinal proximity”
V5 Scope honestyGoodMatches evidence
  • Description vs explanation: The “general successor primitive” account is genuinely explanatory — it explains why the same heads appear across ordinal domains (shared mechanism) and predicts where they should appear (any ordinal task). This goes beyond mere description of which heads are active.
  • Component identity vs component role: The role label “successor head” is well-grounded: the structural signature (WOVW_{OV} ordering) independently confirms the functional label (ordinal succession behavior). The label is not just based on behavioral observation but has architectural backing.
  • Faithfulness vs understanding: The evidence supports both — the identified heads are causally important (faithfulness via ablation) AND the mechanism is understood (ordinal structure in WOVW_{OV}). This combination is rare in MI.
  • Implementational → Interpretation: Strong. WOVW_{OV} weight analysis directly shows ordinal encoding; ablation confirms causal role. Multiple implementational sub-modes converge.
  • Algorithmic → Interpretation: Strong. “Compute the next item in an ordinal sequence” is a specified algorithm that the structural evidence directly supports. The cross-domain pattern confirms it is a general algorithm, not a task-specific shortcut.
  • Computational → Interpretation: Moderate. “Ordinal succession” is well-defined as a computational goal. Whether it is exactly “successor” (next item) or “ordinal proximity” (nearby items) is the remaining ambiguity.
NecessitySufficiencyRepresentationalAlgorithmicComputational
Ablation✓ (cross-domain)
WOVW_{OV} analysis✓ (partial)
Steering
Cross-domain generalizationimplicit

The filled cells span both rows and columns more broadly than most MI results. Ablation provides necessity; weight analysis provides representational and partial algorithmic evidence; cross-domain generalization provides algorithmic and computational support. The main gap is sufficiency (no isolation experiment) and steering (no stimulation test).

  • Ordinal input → successor head attention: solid (heads attend to ordinal tokens, confirmed across domains)
  • Successor head WOVW_{OV} → boosted next-item logit: solid (structural analysis confirms the weight pathway)
  • Successor heads → output prediction: solid (ablation confirms causal contribution)
  • Input encoding → successor head selection: dashed (how the model identifies that a token is part of an ordinal sequence is not characterized)

The output pathway (successor head → prediction) is solid and multi-domain confirmed. The input pathway (how tokens are identified as ordinal) is the main uncharacterized link — the heads clearly compute succession, but the upstream mechanism that routes ordinal inputs to them is not described.