Verdict Tier 1: Proposed
Section titled “Verdict Tier 1: Proposed”| Tier | 1 of 5 (progressive) |
| What it means | A claim is falsifiable and measured, but not yet causally tested |
| Minimum evidence | Defined construct + falsifiable prediction + at least one measurement |
| Upgrade to Causally suggestive | At least one well-controlled causal experiment demonstrating necessity (I1) |
| Downgrade | N/A (lowest progressive tier) |
What this tier establishes
Section titled “What this tier establishes”A Proposed claim has crossed the threshold from speculation into science: the entity is named, its boundaries are stated, and at least one quantitative measurement exists. What it has not done is demonstrate causal relevance. The evidence is correlational, structural, or statistical — probing accuracy, cosine similarity, activation patterns — but no intervention has been performed.
This is not a criticism. Many important findings begin here, and many remain here because the relevant causal experiments are expensive or technically difficult. The tier exists to distinguish “well-posed but untested” from “causally established,” preventing the conflation of correlation with mechanism.
A claim can remain at Proposed indefinitely without being wrong or uninteresting. What it cannot do is claim mechanistic status without causal evidence.
Example verdict statement
Section titled “Example verdict statement”Verdict: Proposed —
[representational-statistical]Claim: SAE feature in GPT-2 Small layer 8 represents noun-hood. Met: Defined construct (noun-hood), falsifiable prediction (feature activates selectively on nouns), measurement (cosine similarity = 0.82 with probing direction, top-20 contexts are 18/20 nouns) Open: I1 (necessity), I2 (sufficiency), I3 (specificity vs. word frequency) Scope: GPT-2 Small, residual stream layer 8, Pile-10k distribution
Minimum reporting for this tier
Section titled “Minimum reporting for this tier”- Name of the construct and its operational definition
- The specific measurement instrument (probe architecture, SAE variant, similarity metric)
- At least one quantitative result with confidence interval or equivalent
- Statement of what causal experiment would move the claim to Tier 2
Upgrade and downgrade
Section titled “Upgrade and downgrade”| Direction | What’s required |
|---|---|
| → Causally suggestive | At least one well-controlled ablation or patching experiment showing the mechanism is necessary (I1), with random-component control and named method |
| → Disconfirmed | The measurement is shown to be artifactual (e.g., the probe achieves equal accuracy on a random feature direction) |
Characteristic occupants
Section titled “Characteristic occupants”- SAE feature descriptions (Bricken et al., 2023) — monosemantic features identified by activation patterns, prior to causal intervention
- Linear probing claims for Othello board state, syntactic number, or sentiment — high accuracy without interchange intervention
- Weight-space structural analyses — SVD-based role identification, OV/QK decompositions — without behavioral confirmation
- Superposition geometry (Elhage et al., 2022) in real models — geometric structure identified but causal role untested
Key references
Section titled “Key references”- Bricken et al. (2023). Towards Monosemanticity. Transformer Circuits
- Elhage et al. (2022). Toy Models of Superposition. arXiv:2209.10652
- Hill, A. B. (1965). The Environment and Disease: Association or Causation? doi:10.1177/003591576505800503
- GRADE Working Group (2004). Grading quality of evidence. doi:10.1136/bmj.328.7454.1490
Worked example: SAE features at Tier 1
A sparse autoencoder trained on GPT-2 Small residual stream activations produces a feature whose decoder direction has high cosine similarity with the “is_noun” probing direction, whose top-activating contexts are predominantly nouns, and whose activation magnitude correlates with the model’s confidence on syntactic tasks.
This is a Proposed claim. The evidence is correlational and structural: the feature looks like it represents noun-hood. But no intervention has been performed. We do not know whether the feature is causally relevant to noun-related computation (I1), whether it is sufficient (I2), or whether it is specific to noun-hood rather than a correlated property like word frequency (I3).
The claim is well-posed (falsifiable, with a defined construct and quantitative measurements). It simply hasn’t been causally tested. Moving to Tier 2 requires ablating or patching the feature and demonstrating a noun-specific behavioral change.