Skip to content

This framework asks: do circuit heads share privileged directions in weight space, and are those directions distinct from non-circuit heads?

Weight alignment measures the cosine similarity between the top SVD directions of different attention heads’ weight matrices. If multiple heads in a claimed circuit share aligned principal directions, this suggests they operate on a common subspace — potentially implementing a compositional pipeline where one head’s output lies in another head’s input space. Conversely, high alignment between circuit and non-circuit heads would undermine a claim of structural specialization.

This instrument bridges single-head structural analysis (B01-B03) to multi-head circuit topology. It tests whether the weight-level structure supports the compositional claims that circuit narratives make (e.g., “induction head Q aligns with previous-token head OV output”).

SourceYearKey contribution
Elhage et al., “A Mathematical Framework for Transformer Circuits”2021Composition via shared subspaces in the residual stream
Merullo et al., arXiv 2305.161302023Linear representations and subspace alignment across components
Bricken et al., “Towards Monosemanticity”2023Feature directions and alignment in representation space
Hanna et al., arXiv 2305.005862023Compositional structure in GPT-2 circuits via weight analysis

For two heads ( h_1 ) and ( h_2 ), let ( u_1^{(1)} ) denote the top left singular vector of ( W_{OV}^{(h_1)} ) and ( v_1^{(2)} ) denote the top right singular vector of ( W_{QK}^{(h_2)} ). The alignment score is:

[ \text{align}(h_1 \to h_2) = \left| \cos\left( u_1^{(h_1)}, v_1^{(h_2)} \right) \right| = \frac{| u_1^{(h_1)} \cdot v_1^{(h_2)} |}{| u_1^{(h_1)} | | v_1^{(h_2)} |} ]

High alignment indicates that head ( h_1 )‘s primary output direction is head ( h_2 )‘s primary input direction — a necessary condition for sequential composition. The metric generalizes to subspace alignment using the top-k singular vectors and principal angles.

Within a circuit, we compute the mean pairwise alignment among circuit heads and compare it to the mean alignment between circuit and non-circuit heads. A significantly higher within-circuit alignment supports the claim that the circuit forms a coherent computational unit.

Cosine Alignment of Top SVD Directions (18_weight_extended.py)

Section titled “Cosine Alignment of Top SVD Directions (18_weight_extended.py)”

Computes pairwise ( |\cos(\theta)| ) between the top SVD directions of W_OV and W_QK for all head pairs. Reports: (1) within-circuit mean alignment, (2) between-circuit mean alignment, (3) alignment z-score relative to random head subsets.

What it establishes: Whether the identified circuit has structurally coherent weight directions that distinguish it from arbitrary head groupings.

What it does not establish: Causal dependence — aligned directions may exist but never be activated together on task-relevant inputs.

Usage:

uv run python 18_weight_extended.py --tasks ioi sva
PatternWhat it means
High within-circuit alignment, low between-circuitCircuit heads form a structurally distinct group
Alignment between early-layer OV and late-layer QKEvidence of sequential composition (V-composition)
Uniform alignment across all headsNo structural differentiation — circuit boundary may be arbitrary
High alignment z-score (> 2.0)Within-circuit coherence is unlikely under random grouping