Skip to content

This framework asks: what does each attention head copy (OV) and where does it look (QK), and do these decomposed circuits align with the claimed mechanism?

Every attention head performs two logically distinct operations: the QK circuit determines the attention pattern (which positions attend to which), and the OV circuit determines what information is moved once attention is allocated. The composed matrices ( W_E^T W_Q^T W_K W_E ) (full QK) and ( W_U W_O W_V W_E ) (full OV) can be analyzed independently to characterize a head’s structural role. A complete circuit explanation must account for both halves.

This decomposition is the foundation of the “mathematical framework for transformer circuits” and remains the most productive structural lens in mechanistic interpretability. It converts the question “what does this head do?” into two simpler questions with independently testable answers.

SourceYearKey contribution
Elhage et al., “A Mathematical Framework for Transformer Circuits”2021OV/QK decomposition as the fundamental unit of circuit analysis
Wang et al., arXiv 2211.005932022Applied OV/QK to IOI circuit (name movers, induction, S-inhibition)
Olsson et al., arXiv 2209.118952022Induction heads: QK implements “previous token attended to by current”
Nanda et al., arXiv 2301.052172023OV/QK analysis of indirect object identification in GPT-2

For a single attention head with parameters ( W_Q, W_K \in \mathbb{R}^{d_m \times d_h} ) and ( W_V, W_O \in \mathbb{R}^{d_m \times d_h} ), the two composed circuits are:

[ W_{QK} = W_Q W_K^T \in \mathbb{R}^{d_m \times d_m} ]

[ W_{OV} = W_O W_V \in \mathbb{R}^{d_m \times d_m} ]

The QK matrix determines attention scores: position ( i ) attends to position ( j ) proportionally to ( x_i^T W_{QK} x_j ). The OV matrix determines what is written to the residual stream: if position ( j ) is attended to, the contribution is ( W_{OV} x_j ). By composing with the embedding and unembedding:

[ W_{QK}^{\text{full}} = W_E^T W_{QK} W_E, \quad W_{OV}^{\text{full}} = W_U W_{OV} W_E ]

we can read off token-to-token attention preferences and token-to-logit contributions directly.

W_QK Spectral Analysis (18_weight_extended.py)

Section titled “W_QK Spectral Analysis (18_weight_extended.py)”

Computes the eigendecomposition of ( W_{QK} ) for circuit heads and reports: (1) the top eigenvalues and their associated input/output directions, (2) whether the QK circuit is symmetric (positional) or asymmetric (token-content), and (3) alignment between top QK directions and the task’s causal variables.

What it establishes: The structural attention pattern preference encoded in the head’s weights, independent of any specific input.

What it does not establish: Whether this pattern is actually realized on task-relevant inputs (requires activation-level verification from Pillar A).

Usage:

uv run python 18_weight_extended.py --tasks ioi sva
PatternWhat it means
OV top singular vector aligns with task-relevant token directionsHead structurally encodes the copy/suppress operation the circuit claims
QK has low rank with interpretable eigenvectorsHead implements a simple, characterizable attention rule
OV eigenspectrum is flatHead performs distributed transformation — harder to assign single role
QK and OV both align with circuit narrativeStrong structural corroboration of the mechanistic claim