Skip to content

This framework asks: can the circuit be described as entities and activities organized such that they produce the phenomenon — and can we verify this organization empirically?

The “new mechanism” philosophy (Machamer, Darden & Craver 2000; Glennan 2002) defines a mechanism as entities and activities organized in such a way that they are responsible for the phenomenon. This goes beyond mere causal sufficiency (A01) to require that the organization — the spatial, temporal, and functional arrangement of parts — be explanatorily relevant. In MI terms: it is not enough to show that a set of heads causally mediates IOI behavior; you must show that those heads are organized into a specific information-processing pipeline (name movers receive from S-inhibition heads, which receive from duplicate-token detectors) and that this organization is what produces the behavior.

This framework motivates two distinctive instruments: weight-space circuit verification (checking that the organizational structure exists in the weights independent of any particular input) and logic-gate analysis (verifying that individual components implement specific Boolean or arithmetic operations as the mechanism hypothesis predicts).

SourceYearKey contribution
Machamer, Darden & Craver, “Thinking About Mechanisms”2000Mechanisms as entities + activities organized to produce phenomena
Glennan, “Rethinking Mechanistic Explanation”2002Mechanisms as complex systems with interacting parts
Craver, Explaining the Brain2007Levels of mechanism; constitutive vs. etiological explanation
Elhage et al., “A Mathematical Framework for Transformer Circuits”2021QK/OV circuits as organized entities producing attention phenomena
Olsson et al., “In-context Learning and Induction Heads”2022Induction heads as a multi-component mechanism with specific organization

The MDC/Glennan framework distinguishes between two kinds of causal claim: (1) “these components are causally relevant to the output” (etiological) and (2) “these components are organized in this specific way to produce the output” (constitutive/mechanistic). Standard activation patching (A01) provides type (1). The mechanistic framework demands type (2): you must specify the organizational structure and verify it.

For transformer circuits, organization means: which heads write information that other heads read (via the residual stream), what computation each head performs on its inputs, and how the pipeline’s stages depend on each other. Weight-space analysis can verify organization independently of activation measurements — if head A’s OV circuit projects into the subspace that head B’s QK circuit reads from, the organizational link exists in the weights regardless of whether any particular input activates it.

C18 — Weight-Extended Circuit (18_weight_extended.py)

Section titled “C18 — Weight-Extended Circuit (18_weight_extended.py)”

Verifies circuit organization by checking weight-space connections between components. For each edge in the hypothesized mechanism, measures the alignment between the writing head’s output subspace and the reading head’s input subspace:

[ \text{Connection}(A \to B) = | W_O^A \cdot W_Q^B |_F / (|W_O^A|_F \cdot |W_Q^B|_F) ]

This establishes that the organizational structure hypothesized by the mechanism exists in the weights — the “wiring” is physically present, not just functionally inferred from activations.

What it establishes: Structural organization in weight space; that components are “wired” to communicate.

What it does not establish: That the wiring is used on any particular input (requires activation-level verification from A01/A02).

Usage:

uv run python 18_weight_extended.py --tasks ioi sva --n-prompts 40

Tests whether individual components implement specific Boolean or arithmetic operations as the mechanism hypothesis predicts. For example, verifying that an “AND-gate” head only fires when both its input conditions are satisfied, or that a “copy” head’s OV circuit has high rank-1 structure aligned with the identity.

What it establishes: That individual entities in the mechanism perform the activities the hypothesis attributes to them.

What it does not establish: That these activities are necessary for the overall phenomenon (requires ablation from A01).

Usage:

uv run python 19_logic_gates.py --tasks ioi --n-prompts 40
PatternWhat it means
High weight-space connection along hypothesized edgesOrganization exists structurally; mechanism is “wired”
Logic gates pass for all componentsEach entity performs its attributed activity
Weight connection exists but logic gate failsStructure is present but the component’s operation is mis-characterized
Logic gate passes but weight connection is weakComponent performs the operation but is not strongly connected to the next stage