Skip to content
Tier1 of 5 (progressive)
What it meansA claim is falsifiable and measured, but not yet causally tested
Minimum evidenceDefined construct + falsifiable prediction + at least one measurement
Upgrade to Causally suggestiveAt least one well-controlled causal experiment demonstrating necessity (I1)
DowngradeN/A (lowest progressive tier)

A Proposed claim has crossed the threshold from speculation into science: the entity is named, its boundaries are stated, and at least one quantitative measurement exists. What it has not done is demonstrate causal relevance. The evidence is correlational, structural, or statistical — probing accuracy, cosine similarity, activation patterns — but no intervention has been performed.

This is not a criticism. Many important findings begin here, and many remain here because the relevant causal experiments are expensive or technically difficult. The tier exists to distinguish “well-posed but untested” from “causally established,” preventing the conflation of correlation with mechanism.

A claim can remain at Proposed indefinitely without being wrong or uninteresting. What it cannot do is claim mechanistic status without causal evidence.

Verdict: Proposed — [representational-statistical] Claim: SAE feature f42f_{42} in GPT-2 Small layer 8 represents noun-hood. Met: Defined construct (noun-hood), falsifiable prediction (feature activates selectively on nouns), measurement (cosine similarity = 0.82 with probing direction, top-20 contexts are 18/20 nouns) Open: I1 (necessity), I2 (sufficiency), I3 (specificity vs. word frequency) Scope: GPT-2 Small, residual stream layer 8, Pile-10k distribution

  • Name of the construct and its operational definition
  • The specific measurement instrument (probe architecture, SAE variant, similarity metric)
  • At least one quantitative result with confidence interval or equivalent
  • Statement of what causal experiment would move the claim to Tier 2
DirectionWhat’s required
→ Causally suggestiveAt least one well-controlled ablation or patching experiment showing the mechanism is necessary (I1), with random-component control and named method
→ DisconfirmedThe measurement is shown to be artifactual (e.g., the probe achieves equal accuracy on a random feature direction)
  • SAE feature descriptions (Bricken et al., 2023) — monosemantic features identified by activation patterns, prior to causal intervention
  • Linear probing claims for Othello board state, syntactic number, or sentiment — high accuracy without interchange intervention
  • Weight-space structural analyses — SVD-based role identification, OV/QK decompositions — without behavioral confirmation
  • Superposition geometry (Elhage et al., 2022) in real models — geometric structure identified but causal role untested
Worked example: SAE features at Tier 1

A sparse autoencoder trained on GPT-2 Small residual stream activations produces a feature f42f_{42} whose decoder direction has high cosine similarity with the “is_noun” probing direction, whose top-activating contexts are predominantly nouns, and whose activation magnitude correlates with the model’s confidence on syntactic tasks.

This is a Proposed claim. The evidence is correlational and structural: the feature looks like it represents noun-hood. But no intervention has been performed. We do not know whether the feature is causally relevant to noun-related computation (I1), whether it is sufficient (I2), or whether it is specific to noun-hood rather than a correlated property like word frequency (I3).

The claim is well-posed (falsifiable, with a defined construct and quantitative measurements). It simply hasn’t been causally tested. Moving to Tier 2 requires ablating or patching the feature and demonstrating a noun-specific behavioral change.