Skip to content

This framework asks: does a circuit component serve a single interpretable role, or does it superpose multiple unrelated computations?

Polysemanticity — a single neuron or direction encoding multiple unrelated features — is the central obstacle to clean circuit explanations. If a head identified as “the name mover” also implements unrelated computations for other tasks, then its role in the circuit is not cleanly separable from its other functions. Measuring polysemanticity quantifies this threat: highly monosemantic components support clean circuit narratives; highly polysemantic components suggest the circuit boundary cuts through superposed representations.

This instrument connects structural analysis (what directions exist in a component) to intervention specificity (does intervening on one claimed function affect others). It operationalizes the question: is the circuit description a faithful decomposition, or is it projecting clean narratives onto a messier underlying computation?

SourceYearKey contribution
Elhage et al., “Toy Models of Superposition”2022Mathematical framework for feature superposition
Bricken et al., “Towards Monosemanticity”2023SAE decomposition revealing polysemantic neurons
Cunningham et al., arXiv 2309.086002023Sparse autoencoders for monosemantic feature extraction
Templeton et al., “Scaling Monosemanticity”2024Feature splitting and polysemanticity at scale
Marks et al., arXiv 2403.196472024Sparse feature circuits and intervention specificity

A component is polysemantic if it activates for multiple unrelated input classes. Formally, let ( a(x) ) be the activation of component ( c ) on input ( x ), and let ( T_1, T_2 ) be two unrelated task distributions. Component ( c ) is polysemantic if:

[ \mathbb{E}{x \sim T_1}[|a(x)|] > \epsilon \quad \text{and} \quad \mathbb{E}{x \sim T_2}[|a(x)|] > \epsilon ]

while ( T_1 ) and ( T_2 ) share no semantic overlap. Intervention specificity provides a causal test: intervene on the component for task ( T_1 ) and measure the effect on task ( T_2 ):

[ \text{specificity}(c, T_1, T_2) = 1 - \frac{|\Delta \text{perf}(T_2)|}{|\Delta \text{perf}(T_1)|} ]

High specificity (near 1) means the component is monosemantic for ( T_1 ); low specificity means intervention leaks into ( T_2 ).

Intervention Specificity (25_intervention_specificity.py)

Section titled “Intervention Specificity (25_intervention_specificity.py)”

For each circuit component, ablates it while measuring performance on both the target task and a set of control tasks. Reports: (1) per-component specificity scores, (2) mean circuit specificity, (3) identification of the most polysemantic components.

What it establishes: Whether circuit components are functionally dedicated to the target task or serve multiple tasks simultaneously.

What it does not establish: The identity of superposed features — that requires SAE decomposition or other feature-finding methods.

Usage:

uv run python 25_intervention_specificity.py --tasks ioi sva
PatternWhat it means
High specificity (> 0.8) for all circuit componentsCircuit is cleanly separable — monosemantic for this task
Low specificity (< 0.5) for key componentsCritical components serve multiple functions — superposition threat
Specificity varies across circuit componentsMixed circuit — some components dedicated, others shared
Low specificity correlates with high effective rank (B02)Structural evidence of superposition confirmed by functional test

Polysemanticity connects B02 (effective rank) to causal findings: high effective rank (B02) suggests capacity for multiple features, and B07 tests whether that capacity is actually used for multiple tasks. It also connects to B08 (ICA/NMF): source separation methods attempt to resolve polysemanticity by finding independent components within a superposed representation. The causal instruments in A01 provide the intervention; B07 adds the cross-task measurement to assess specificity.