Criterion C2 — Structural Plausibility
Section titled “Criterion C2 — Structural Plausibility”| Validity type | Construct |
| Pass condition | Circuit components are at the layers and positions theory predicts, with weight-space signatures consistent with the claimed computational role |
| Evidence family | Structural (weight-space) |
| Minimum reporting | Layer/position of each component; relevant weight-space metric with value; comparison to null (random same-layer heads) |
| Common failure mode | Reporting only ablation results; never checking whether nominated heads have the right structural properties |
What this criterion requires
Section titled “What this criterion requires”Structural plausibility asks: if this component were doing what the circuit claim says, what would its weights look like — and do they?
Satisfied when:
- Layer and position are theoretically expected. Name-mover heads should be late-layer (layers 9–11 in GPT-2 Small). S-inhibition heads should attend to the subject token. An SVA MLP should be downstream of primary syntactic integration heads.
- Weight-space signatures are consistent. A copying head should have low-rank W_OV (near rank-1, singular value gap ≥ 0.8). An induction head should have predictable W_QK structure. An SVA head should have W_in directions geometrically close to the number-information subspace.
- The signature is specific to nominated components — not just any same-layer head.
This project’s results
Section titled “This project’s results”- An SAE direction aligned at cos = 0.82 with L8H6 W_OV — consistent with an output-writing role in IOI. This is a structural plausibility result, not a causal result.
- Another direction aligned at cos = 0.72 with L3H0 — consistent with an early syntactic role.
These findings satisfy structural plausibility for the IOI circuit claim (Finding 25). They do not establish causation; they make the construct coherent.
Primary instruments
Section titled “Primary instruments”- B01 SpectralSVD: spectral gap of W_OV for copying/inhibition heads
- B02 OVQK analysis: eigenvalue analysis for attention pattern detection
- B03 Weight alignment: cosine similarity of W_dec directions to known-circuit templates
- B07 Effective rank: participation ratio of W matrices per head
- B08 Template distance: Frobenius distance to known-circuit W_OV templates
Minimum reporting rule
Section titled “Minimum reporting rule”For each component: layer/position + why theoretically expected + one weight-space metric + comparison to null. If a component is at an unexpected layer or has an unexpected signature, this must be noted — not silently omitted.
Common failures
Section titled “Common failures”Ablation-only circuit discovery: Component added because ablating it degrades performance, with no check of whether its weights are consistent with the claimed role.
Layer anomalies ignored: A circuit method nominates a very early-layer head for a late-stage role without explanation.