Case Study: SAE Features
Section titled “Case Study: SAE Features”Sparse autoencoder features (Bricken et al. 2023, Templeton et al. 2024) are directions in activation space extracted by training an overcomplete dictionary. Each feature is given a label — “Golden Gate Bridge,” “deception,” “code syntax” — based on the inputs that maximally activate it. The claim is that these features are real computational units: representational-level entities that the model uses during inference.
This case study evaluates SAE features as a class. Individual strong features (those that replicate and steer) score higher; the bulk of the dictionary scores lower. The evaluations below reflect the typical case.
Composite Verdict
Section titled “Composite Verdict”| Lens | Strongest criterion | Weakest criterion | Overall |
|---|---|---|---|
| Construct (Phil. Sci.) | C2 Structural plausibility (partial) | C5 Convergent validity | Weak |
| Internal (Neuroscience) | I2 Sufficiency (strong features) | I3/I4/I5 Most criteria | Weak |
| External (Pharmacology) | E1 Intervention reach (partial) | E3/E5/E6 Most criteria | Weak |
| Measurement (Measurement Theory) | M3 Baseline separation (partial) | M1 Reliability | Weak |
| Interpretive (MI) | V1 Level declaration | V4/V5 Alternatives + scope | Weak |
Overall verdict: Proposed to Causally suggestive. SAE features as a class sit at the lowest verdict tiers. The strongest individual features (those that replicate across seeds, respond to steering, and have coherent decoder vectors) approach Causally suggestive. The bulk of any SAE dictionary remains at Proposed — the features have been identified and labeled, but the evidence for their reality as model-intrinsic computational units is thin across all five lenses.
This is not a claim that SAE features are wrong — many may be real. It is a claim that the evidence for their validity, measured against the same standards applied to circuits, has not been marshaled. The primary gaps are convergent validity (C5 — does a different method find the same features?), measurement reliability (M1 — does a different training run find the same features?), and alternative exclusion (V4 — is the label the right one?). These three gaps share a common theme: the features may be properties of the dictionary rather than properties of the model.
Philosophy of Science Lens — Construct Validity
Section titled “Philosophy of Science Lens — Construct Validity”Is “SAE feature ” a coherent construct?
Criteria
Section titled “Criteria”C1 — Falsifiability: Unclear. What observation would disconfirm the claim that feature represents “deception”? If the disconfirming condition is “the feature does not activate on deceptive text,” this is circular — the feature was defined by its activations. A genuine falsifiability condition would be: “if steering along does not increase deceptive outputs, or if a different SAE trained with a different random seed produces a feature with overlap.” Most papers do not state such conditions.
C2 — Structural plausibility: Partial. The decoder vector should project onto semantically coherent tokens through the unembedding matrix. Some features pass this check (“Golden Gate Bridge” projects onto bridge-related tokens). Many features lack this structural verification.
C3 — Task specificity: Not tested. Features are evaluated on their maximally activating examples — a discovery-set evaluation. Specificity would require showing that a “deception” feature activates on deception and does not activate on closely related non-deception (sarcasm, fiction, hypotheticals). This discriminant testing is rarely performed.
C4 — Minimality: Open question. Does a feature correspond to one computational role, or is it a blend of multiple roles that co-occur in training data? Polysemantic features — those that activate on apparently unrelated concepts — fail this criterion. The extent of polysemanticity in typical SAE dictionaries is debated.
C5 — Convergent validity: Weak. SAE features are identified by one method. A different SAE with different hyperparameters or random seed may produce a different feature set. Cross-seed consistency is partially reported for strong features but not systematically measured at Jaccard level across the full dictionary.
| Criterion | Verdict | Key evidence |
|---|---|---|
| C1 Falsifiability | Unclear | No pre-registered disconfirming conditions |
| C2 Structural plausibility | Partial | Some decoder vectors project coherently |
| C3 Task specificity | Not tested | No discriminant evaluation |
| C4 Minimality | Open question | Polysemanticity unresolved |
| C5 Convergent validity | Weak | Single method, partial cross-seed |
Key Distinctions
Section titled “Key Distinctions”- Confirmation vs corroboration: Max-activating examples confirm the label (the feature activates on things matching the label), but this is circular — the label was derived from those same examples. Genuine corroboration would require an independent method (weight-space analysis, causal intervention) predicting the same concept before observing activations.
- Natural kind vs family resemblance: A polysemantic feature that activates on “Golden Gate Bridge” and “suspension bridges” and “orange paint” may be a natural kind (bridge-related concepts) or a family resemblance (co-occurring tokens in training data). Without structural grounding, the distinction is underdetermined.
- Operationalism vs realism: Feature labels like “deception” imply realism (the model has a deception concept). The evidence supports only operationalism (this direction activates on texts labeled deceptive by humans). The gap between these is the core validity question.
Nomological Network
Section titled “Nomological Network”| Prediction the construct makes | How you test it | Confirmed? |
|---|---|---|
| Activates on inputs matching the label | Max-activating examples | Circular |
| Steering along the feature changes outputs | Activation steering / clamping | Sometimes |
| Decoder vector projects onto coherent tokens | through unembedding | Sometimes |
| Does not activate on related non-matches | Discriminant testing | Rarely tested |
| Same feature appears under different SAE seeds | Cross-seed Jaccard comparison | Partially |
| Corresponds to one role, not co-occurrence | Polysemanticity analysis | Open |
A thin nomological network. Two rows partially confirmed, several untested. The circularity of the first row (the primary evidence) weakens the network further — one confirmed node is methodologically dependent on the discovery procedure rather than being an independent test.
Neuroscience Lens — Internal Validity
Section titled “Neuroscience Lens — Internal Validity”Does ablating/restoring a feature change behavior in the expected way?
Criteria
Section titled “Criteria”I1 — Necessity: Sometimes. Ablating (zeroing) strong features degrades behavior on their associated inputs. But “necessity” for an individual SAE feature is a weaker claim than circuit necessity — many features contribute small amounts, and removing one may be compensated by others. Necessity is established for a few strong features; for the bulk of the dictionary, it is untested.
I2 — Sufficiency: Sometimes (via steering). Clamping a feature to a high activation value can steer model outputs — the “Golden Gate Bridge” feature reliably produces bridge-related text. This is a form of sufficiency: the feature direction alone drives the behavior. But steering is blunt (high-magnitude clamping may go off-manifold), and many features do not produce coherent effects when steered.
I3 — Specificity: Not tested. Does ablating a “deception” feature selectively impair deception-related outputs without affecting other capabilities? This requires measuring collateral damage, which is rarely done for individual features.
I4 — Consistency: Weak. Cross-seed replication shows that strong features (high-frequency, high-magnitude) are relatively stable. Weaker features may not replicate. No systematic cross-checkpoint or cross-model consistency has been reported.
I5 — Confound control: Not tested. Steering typically uses a single method (activation addition at a fixed scale). Multi-method comparison (clamping at different layers, steering via different feature dictionaries) is not performed.
| Criterion | Verdict | Key evidence |
|---|---|---|
| I1 Necessity | Sometimes | Strong features: yes. Bulk: untested |
| I2 Sufficiency | Sometimes | Steering works for strong features |
| I3 Specificity | Not tested | No collateral damage measured |
| I4 Consistency | Weak | Strong features partially stable |
| I5 Confound control | Not tested | Single steering method |
Key Distinctions
Section titled “Key Distinctions”- Single vs double dissociation: Steering demonstrates single dissociation (activating the feature produces the expected behavior). Double dissociation (activating this feature does NOT produce a different behavior, and activating a different feature does NOT produce this behavior) is untested for the vast majority of features.
- Lesion vs stimulation: SAE features uniquely have both lesion (zeroing) and stimulation (clamping) evidence for strong features. However, the stimulation is at supraphysiological magnitudes (5-10x typical activation), making it unclear whether the observed effects reflect normal computation or off-manifold forcing.
Dissociation Matrix
Section titled “Dissociation Matrix”| Feature-labeled task | Related but distinct task | Unrelated task | |
|---|---|---|---|
| Ablate feature | ↓ (sometimes) | ? | ? |
| Clamp feature | ↑↑ (strong features) | ? | ? |
| Ablate neighboring feature | ? | ? | ? |
The matrix is extremely sparse. Even for the best-characterized features, only two cells (ablate → own task, clamp → own task) have data. Without the off-diagonal cells, we cannot distinguish “this feature specifically implements this computation” from “this direction in activation space correlates with this behavior when artificially amplified.”
Pharmacology Lens — External Validity
Section titled “Pharmacology Lens — External Validity”Does intervening on a feature produce predictable downstream effects?
Criteria
Section titled “Criteria”E1 — Intervention reach: Partial. Steering experiments show that clamping features can shift model behavior. The “Golden Gate Bridge” feature produces bridge-related responses across varied prompts. But the reach of most features (especially abstract or behavioral ones like “deception”) is not well-characterized.
E2 — Graded response: Partial. Clamping at different multipliers (1x, 5x, 10x the typical activation magnitude) produces graded effects — stronger clamping produces more extreme outputs. But the dose-response is often nonlinear and poorly characterized. At high multipliers, outputs become incoherent rather than showing more of the feature.
E3 — Selectivity: Not tested. Does steering along “deception” selectively increase deception without affecting fluency, factuality, or other behaviors? Off-target effects are rarely measured. The intervention may be producing general distributional shift rather than targeted behavioral change.
E4 — Effect magnitude: Variable. Some features produce large, clear effects (Golden Gate Bridge). Others produce weak or incoherent effects. The distribution of effect magnitudes across the dictionary is not systematically reported.
E5 — Robustness: Unknown. Does the Golden Gate Bridge feature work equally well on questions, stories, code prompts, and multilingual inputs? Robustness across prompt distributions is not systematically tested.
E6 — Cross-architecture: Not tested. SAE features are model-specific by construction — each dictionary is trained on one model’s activations. Whether “the same feature” exists across models requires a separate alignment step that is not standard.
| Criterion | Verdict | Key evidence |
|---|---|---|
| E1 Intervention reach | Partial | Works for some features, untested for most |
| E2 Graded response | Partial | Nonlinear, breaks at high magnitudes |
| E3 Selectivity | Not tested | Off-target effects unmeasured |
| E4 Effect magnitude | Variable | Some strong, most unknown |
| E5 Robustness | Unknown | No cross-distribution testing |
| E6 Cross-architecture | Not tested | Model-specific by construction |
Key Distinctions
Section titled “Key Distinctions”- Affinity vs efficacy: SAE features demonstrate affinity (they activate on relevant inputs) but efficacy (causal contribution to behavior) is demonstrated only for strong features under supraphysiological clamping. At normal activation magnitudes, most features have unmeasured efficacy.
- Therapeutic window: The dose-response breakdown at high clamping magnitudes (coherent output → feature-saturated output → incoherent output) implies a narrow therapeutic window. The useful range for steering is bounded above by off-manifold effects, but its lower bound (minimum effective dose) is uncharacterized.
- Off-target effects as the core problem: The pharmacology lens reveals the fundamental gap — steering interventions change outputs, but whether they change only the intended behavior is almost never measured. A drug that cures the disease but causes ten side effects is not well-understood.
Dose-Response Curve
Section titled “Dose-Response Curve”For a typical strong SAE feature (e.g., “Golden Gate Bridge”):
- 0x activation: baseline behavior
- 1x clamping: subtle shift toward feature-related content
- 5x clamping: clear feature-related output (the “demo” regime)
- 10x+ clamping: incoherent, repetitive, or degenerate output
What’s missing:
- No systematic EC₅₀ — at what magnitude does the behavioral shift become reliably detectable?
- No off-target measurement at each dose — fluency, factuality, and other capabilities are not tracked alongside the feature effect
- No comparison across features — do all features have similar dose-response shapes, or do concrete features (Golden Gate Bridge) behave differently from abstract features (deception)?
- No characterization for weak features — the dose-response for the bulk of the dictionary is entirely unknown
The dose-response evidence shows that something happens when you intervene, but the curve’s shape, selectivity boundary, and generality are uncharacterized.
Measurement Theory Lens — Measurement Validity
Section titled “Measurement Theory Lens — Measurement Validity”Is the SAE decomposition a reliable instrument?
Criteria
Section titled “Criteria”M1 — Reliability: Weak. Different SAE training runs (different seeds, hyperparameters) produce different dictionaries. The Jaccard overlap between features identified by two independent SAEs is low for most of the dictionary. The instrument’s test-retest reliability is poor.
M2 — Invariance: Not tested. Do SAE features show the same properties when the dictionary is trained on different data subsets? When applied to different layers? Measurement invariance across conditions is not reported.
M3 — Baseline separation: Partial. Strong features (high activation, clear semantic coherence) are clearly separated from noise. But the boundary between “real features” and “dictionary artifacts” is not well-defined. How many of the 16,384 features in a typical SAE are real?
M4 — Sensitivity: Unknown. Can the instrument distinguish between a genuine “deception” feature and a “formal language” feature that happens to co-occur with deception in the training data? The sensitivity to genuine semantic distinctions versus statistical co-occurrence is not characterized.
M5 — Calibration: Not reported. What activation level constitutes “the feature is on”? Thresholds are typically chosen post-hoc. Without calibration, activation magnitudes are hard to interpret.
M6 — Construct coverage: Weak. Max-activating examples capture the top-activating tail. They do not capture: the feature’s behavior at moderate activations, its interactions with other features, its role in downstream computation, or its boundary cases (what it almost fires on but doesn’t).
| Criterion | Verdict | Key evidence |
|---|---|---|
| M1 Reliability | Weak | Low cross-seed Jaccard for most features |
| M2 Invariance | Not tested | No cross-condition comparison |
| M3 Baseline separation | Partial | Strong features separated; boundary unclear |
| M4 Sensitivity | Unknown | Co-occurrence vs. semantics not distinguished |
| M5 Calibration | Not reported | Post-hoc thresholds |
| M6 Construct coverage | Weak | Max-activating tail only |
Key Distinctions
Section titled “Key Distinctions”- Reliability vs validity: Low cross-seed reliability (M1) places a ceiling on validity — if the instrument does not produce the same result twice, the result cannot be valid regardless of how compelling any single run appears. For SAE features, the reliability ceiling is low for most of the dictionary.
- Convergent vs discriminant validity: SAE features lack both. Convergent: does a different decomposition method (NMF, ICA, probing) find the same features? Discriminant: do features that should be distinct (deception vs. sarcasm) actually have low overlap? Neither is systematically tested.
- The instrument creates the object: Unlike probes or circuits (which measure pre-existing model properties), SAEs construct the feature set. The measurement and the measured object are not independent — a core measurement-theoretic concern.
MTMM Matrix
Section titled “MTMM Matrix”| SAE seed A (feature ) | SAE seed B (feature ) | Probing (concept ) | Weight analysis (direction ) | |
|---|---|---|---|---|
| SAE seed A | — | low-moderate Jaccard | ? | ? |
| SAE seed B | low-moderate | — | ? | ? |
| Probing | ? | ? | — | ? |
| Weight analysis | ? | ? | ? | — |
The only filled cell (cross-seed SAE comparison) shows low-moderate agreement for most features, with higher agreement for strong features. No cross-method comparisons exist — we do not know if SAE features, probing directions, and weight-space analyses converge on the same representational structure. Without this cross-method comparison, SAE features cannot be validated as model-intrinsic rather than method-specific.
MI Lens — Interpretive Validity
Section titled “MI Lens — Interpretive Validity”Are the feature labels warranted by the evidence?
Criteria
Section titled “Criteria”V1 — Level declaration: Pass. The claim is at the representational level — features are directions in activation space that encode information about inputs.
V2 — Level-evidence match: Weak. The primary evidence for feature identity is behavioral (max-activating examples, steering). But the claim is representational — it asserts that the model encodes this information, not just that manipulating the direction changes behavior. Behavioral evidence (steering) underdetermines representational claims: a direction can produce deceptive outputs when steered without being “the deception representation.”
V3 — Narrative coherence: Variable. “Golden Gate Bridge” is narratively coherent — the feature fires on bridge-related content and steers toward bridge-related output. “Deception” is less coherent — what exactly is the model encoding? Intent to deceive? Surface patterns associated with deceptive text? The narrative coherence varies by feature.
V4 — Alternative exclusion: Not done. For most features, alternative explanations are not considered. A “deception” feature might equally be a “formal language + negation” feature, a “long-sentence” feature, or a “training-data-artifact” feature. Without discriminant testing (C3), alternatives are not excluded.
V5 — Scope honesty: Often missing. Feature labels like “deception” imply a broad, abstract semantic concept. The evidence (max-activating examples from one model, one layer) supports only a narrow scope — “this direction in this layer activates on these inputs.” The label exceeds the evidence.
| Criterion | Verdict | Key evidence |
|---|---|---|
| V1 Level declaration | Pass | Representational level stated |
| V2 Level-evidence match | Weak | Behavioral evidence for representational claim |
| V3 Narrative coherence | Variable | Strong for concrete, weak for abstract features |
| V4 Alternative exclusion | Not done | No discriminant testing |
| V5 Scope honesty | Often missing | Labels exceed evidence scope |
Key Distinctions
Section titled “Key Distinctions”- Description vs explanation: SAE features are descriptive (they identify directions that correlate with concepts) but not explanatory (they do not specify the algorithm that produces or uses the representation). The label names the content but not the computation.
- Component identity vs component role: A feature’s identity (its decoder direction) is precisely specified. Its role (how the model uses this direction during inference) is almost entirely uncharacterized. We know what the feature “looks like” but not what it “does.”
- Faithfulness vs understanding: Even features with high steering faithfulness (Golden Gate Bridge) may not represent genuine understanding of the model’s computation — the direction may be exploitable without being the model’s actual representational strategy.
Evidence Convergence Map
Section titled “Evidence Convergence Map”- Implementational → Interpretation: Weak. No weight-space evidence identifies features independently. The decoder vectors are products of the SAE training, not independent structural analysis.
- Algorithmic → Interpretation: Very weak. How features interact during inference — which features compose with which, what algorithm they jointly implement — is almost entirely uncharacterized.
- Computational → Interpretation: Moderate for strong features. Steering shows that the direction is functionally relevant (it can shift computation). But “functionally relevant when artificially amplified” is weaker than “used by the model during normal inference.”
Intervention-Interpretation Matrix
Section titled “Intervention-Interpretation Matrix”| Necessity | Sufficiency | Representational | Algorithmic | Computational | |
|---|---|---|---|---|---|
| Zeroing (ablation) | partial (strong feat.) | — | ∅ | ∅ | ∅ |
| Clamping (steering) | — | partial (strong feat.) | ∅ | ∅ | partial |
| Cross-seed comparison | — | — | partial | — | — |
| Max-activating examples | — | — | circular | — | — |
| Decoder projection | — | — | partial | — | — |
Most cells empty or structurally invalid (∅). The two interventional rows (zeroing, clamping) provide partial evidence for strong features only. The observational rows (max-activating, decoder projection) provide representational evidence that is either circular or partial. No algorithmic evidence exists for any feature.
Causal Sufficiency Graph
Section titled “Causal Sufficiency Graph”- Input → feature activation: dashed (correlation observed via max-activating examples; causal direction not established)
- Feature activation → model behavior: dashed (demonstrated only under supraphysiological clamping; normal-regime causal contribution uncharacterized)
- Feature → downstream features: absent (feature interaction and composition is not mapped)
- Feature → output logits: dashed (decoder vector projects onto logits, but whether this pathway is causally active during normal inference is untested)
No solid edges. The entire causal graph for SAE features operates in the “suggestive but unconfirmed” regime. This is the fundamental interpretive gap: features are identified and labeled, but their causal role in the model’s computation is inferred rather than demonstrated.