B04 — Weight Alignment
Section titled “B04 — Weight Alignment”This framework asks: do circuit heads share privileged directions in weight space, and are those directions distinct from non-circuit heads?
Weight alignment measures the cosine similarity between the top SVD directions of different attention heads’ weight matrices. If multiple heads in a claimed circuit share aligned principal directions, this suggests they operate on a common subspace — potentially implementing a compositional pipeline where one head’s output lies in another head’s input space. Conversely, high alignment between circuit and non-circuit heads would undermine a claim of structural specialization.
This instrument bridges single-head structural analysis (B01-B03) to multi-head circuit topology. It tests whether the weight-level structure supports the compositional claims that circuit narratives make (e.g., “induction head Q aligns with previous-token head OV output”).
Theoretical grounding
Section titled “Theoretical grounding”| Source | Year | Key contribution |
|---|---|---|
| Elhage et al., “A Mathematical Framework for Transformer Circuits” | 2021 | Composition via shared subspaces in the residual stream |
| Merullo et al., arXiv 2305.16130 | 2023 | Linear representations and subspace alignment across components |
| Bricken et al., “Towards Monosemanticity” | 2023 | Feature directions and alignment in representation space |
| Hanna et al., arXiv 2305.00586 | 2023 | Compositional structure in GPT-2 circuits via weight analysis |
Core concept
Section titled “Core concept”For two heads ( h_1 ) and ( h_2 ), let ( u_1^{(1)} ) denote the top left singular vector of ( W_{OV}^{(h_1)} ) and ( v_1^{(2)} ) denote the top right singular vector of ( W_{QK}^{(h_2)} ). The alignment score is:
[ \text{align}(h_1 \to h_2) = \left| \cos\left( u_1^{(h_1)}, v_1^{(h_2)} \right) \right| = \frac{| u_1^{(h_1)} \cdot v_1^{(h_2)} |}{| u_1^{(h_1)} | | v_1^{(h_2)} |} ]
High alignment indicates that head ( h_1 )‘s primary output direction is head ( h_2 )‘s primary input direction — a necessary condition for sequential composition. The metric generalizes to subspace alignment using the top-k singular vectors and principal angles.
Within a circuit, we compute the mean pairwise alignment among circuit heads and compare it to the mean alignment between circuit and non-circuit heads. A significantly higher within-circuit alignment supports the claim that the circuit forms a coherent computational unit.
Instruments under B04
Section titled “Instruments under B04”Cosine Alignment of Top SVD Directions (18_weight_extended.py)
Section titled “Cosine Alignment of Top SVD Directions (18_weight_extended.py)”Computes pairwise ( |\cos(\theta)| ) between the top SVD directions of W_OV and W_QK for all head pairs. Reports: (1) within-circuit mean alignment, (2) between-circuit mean alignment, (3) alignment z-score relative to random head subsets.
What it establishes: Whether the identified circuit has structurally coherent weight directions that distinguish it from arbitrary head groupings.
What it does not establish: Causal dependence — aligned directions may exist but never be activated together on task-relevant inputs.
Usage:
uv run python 18_weight_extended.py --tasks ioi svaReading the scores
Section titled “Reading the scores”| Pattern | What it means |
|---|---|
| High within-circuit alignment, low between-circuit | Circuit heads form a structurally distinct group |
| Alignment between early-layer OV and late-layer QK | Evidence of sequential composition (V-composition) |
| Uniform alignment across all heads | No structural differentiation — circuit boundary may be arbitrary |
| High alignment z-score (> 2.0) | Within-circuit coherence is unlikely under random grouping |