B07 — Polysemanticity
Section titled “B07 — Polysemanticity”This framework asks: does a circuit component serve a single interpretable role, or does it superpose multiple unrelated computations?
Polysemanticity — a single neuron or direction encoding multiple unrelated features — is the central obstacle to clean circuit explanations. If a head identified as “the name mover” also implements unrelated computations for other tasks, then its role in the circuit is not cleanly separable from its other functions. Measuring polysemanticity quantifies this threat: highly monosemantic components support clean circuit narratives; highly polysemantic components suggest the circuit boundary cuts through superposed representations.
This instrument connects structural analysis (what directions exist in a component) to intervention specificity (does intervening on one claimed function affect others). It operationalizes the question: is the circuit description a faithful decomposition, or is it projecting clean narratives onto a messier underlying computation?
Theoretical grounding
Section titled “Theoretical grounding”| Source | Year | Key contribution |
|---|---|---|
| Elhage et al., “Toy Models of Superposition” | 2022 | Mathematical framework for feature superposition |
| Bricken et al., “Towards Monosemanticity” | 2023 | SAE decomposition revealing polysemantic neurons |
| Cunningham et al., arXiv 2309.08600 | 2023 | Sparse autoencoders for monosemantic feature extraction |
| Templeton et al., “Scaling Monosemanticity” | 2024 | Feature splitting and polysemanticity at scale |
| Marks et al., arXiv 2403.19647 | 2024 | Sparse feature circuits and intervention specificity |
Core concept
Section titled “Core concept”A component is polysemantic if it activates for multiple unrelated input classes. Formally, let ( a(x) ) be the activation of component ( c ) on input ( x ), and let ( T_1, T_2 ) be two unrelated task distributions. Component ( c ) is polysemantic if:
[ \mathbb{E}{x \sim T_1}[|a(x)|] > \epsilon \quad \text{and} \quad \mathbb{E}{x \sim T_2}[|a(x)|] > \epsilon ]
while ( T_1 ) and ( T_2 ) share no semantic overlap. Intervention specificity provides a causal test: intervene on the component for task ( T_1 ) and measure the effect on task ( T_2 ):
[ \text{specificity}(c, T_1, T_2) = 1 - \frac{|\Delta \text{perf}(T_2)|}{|\Delta \text{perf}(T_1)|} ]
High specificity (near 1) means the component is monosemantic for ( T_1 ); low specificity means intervention leaks into ( T_2 ).
Instruments under B07
Section titled “Instruments under B07”Intervention Specificity (25_intervention_specificity.py)
Section titled “Intervention Specificity (25_intervention_specificity.py)”For each circuit component, ablates it while measuring performance on both the target task and a set of control tasks. Reports: (1) per-component specificity scores, (2) mean circuit specificity, (3) identification of the most polysemantic components.
What it establishes: Whether circuit components are functionally dedicated to the target task or serve multiple tasks simultaneously.
What it does not establish: The identity of superposed features — that requires SAE decomposition or other feature-finding methods.
Usage:
uv run python 25_intervention_specificity.py --tasks ioi svaReading the scores
Section titled “Reading the scores”| Pattern | What it means |
|---|---|
| High specificity (> 0.8) for all circuit components | Circuit is cleanly separable — monosemantic for this task |
| Low specificity (< 0.5) for key components | Critical components serve multiple functions — superposition threat |
| Specificity varies across circuit components | Mixed circuit — some components dedicated, others shared |
| Low specificity correlates with high effective rank (B02) | Structural evidence of superposition confirmed by functional test |
Connection to other frameworks
Section titled “Connection to other frameworks”Polysemanticity connects B02 (effective rank) to causal findings: high effective rank (B02) suggests capacity for multiple features, and B07 tests whether that capacity is actually used for multiple tasks. It also connects to B08 (ICA/NMF): source separation methods attempt to resolve polysemanticity by finding independent components within a superposed representation. The causal instruments in A01 provide the intervention; B07 adds the cross-task measurement to assess specificity.