Case Study: Probing Classifiers
Section titled “Case Study: Probing Classifiers”Linear probing (Alain & Bengio 2017, Belinkov 2022) trains a linear classifier on model activations to test whether a concept (part-of-speech, syntax tree depth, sentiment, factual knowledge) is linearly decodable from the representation. The claim is representational: if a probe succeeds, the model “encodes” or “represents” the probed concept.
This case study evaluates probing as a methodology rather than a specific circuit. The question is not “does probe X work?” but “what does probe success actually establish?”
Composite Verdict
Section titled “Composite Verdict”| Lens | Strongest | Weakest | Overall |
|---|---|---|---|
| Construct | C1 Falsifiability (partial) | C5 Convergent validity | Weak |
| Internal | I4 Consistency (partial) | I1/I2/I3 (all untested) | Very weak |
| External | — | All criteria N/A or untested | N/A |
| Measurement | M4 (variable) | M3 Baseline separation | Weak |
| Interpretive | V1 Level declaration | V4/V5 Alternative + Scope | Weak |
Overall verdict: Proposed (without causal follow-up). Standard linear probing, without causal intervention or control tasks, does not advance a claim beyond Proposed. The evidence establishes decodability but not encoding, representation, or use.
Probing with causal follow-up (DAS, causal abstraction, intervention along probe direction) can advance to Causally suggestive. Probing with control tasks + causal intervention + cross-method convergence can reach Mechanistically supported.
This case study illustrates a fundamental principle of the framework: a measurement without an intervention is a measurement without internal validity. Probing measures a property of the activation space. Whether that property is causally relevant to the model’s computation requires a different kind of evidence entirely.
Philosophy of Science Lens — Construct Validity
Section titled “Philosophy of Science Lens — Construct Validity”Is “the model represents X” a coherent construct when supported only by probe success?
Criteria
Section titled “Criteria”C1 — Falsifiability: Partial. A probe failure (low accuracy) would disconfirm the representation claim. But probe success is ambiguous — it could reflect genuine encoding or could reflect that the concept is linearly separable in any high-dimensional space, even without genuine representation. The falsifiability is asymmetric: failure is informative, success is not clearly so.
C2 — Structural plausibility: Weak. Probes operate on activations, not weights. They do not identify which parameters encode the concept or how the encoding is implemented in the model’s architecture. A probe success is consistent with intentional encoding, accidental encoding, and encoding-as-artifact.
C3 — Task specificity: Variable. Some probes test specific concepts (syntax tree depth). Others test broad concepts (sentiment). The discriminant question — does the probe fail on closely related non-matches? — is often not tested. A “syntax depth” probe might succeed because the activation space also linearly encodes sentence length, which correlates with depth.
C4 — Minimality: Not applicable. Probes identify a direction/subspace, not a circuit. The analog of minimality (is this the minimal subspace encoding the concept?) is rarely tested — probes operate at the layer level and do not investigate whether a lower-dimensional subspace would suffice.
C5 — Convergent validity: Weak. Probing is typically the only method used. Convergent validity would require confirming the representation claim via an independent method — causal intervention along the probe direction, weight-space analysis, or cross-method agreement. When probing alone is the evidence, convergent validity is absent by definition.
| Criterion | Verdict | Key evidence |
|---|---|---|
| C1 Falsifiability | Partial | Failure informative; success ambiguous |
| C2 Structural plausibility | Weak | Activation-level, not weight-level |
| C3 Task specificity | Variable | Confounds often untested |
| C4 Minimality | N/A | Not a circuit-level claim |
| C5 Convergent validity | Weak | Usually single-method |
Key Distinctions
Section titled “Key Distinctions”- Confirmation vs corroboration: Probe success is confirmation (the probe finds what you looked for) without corroboration (no independent method verifies the same representation claim). A concept that is decodable by a probe AND causally manipulable by intervention along the probe direction would constitute genuine corroboration.
- Underdetermination: High probe accuracy is consistent with multiple explanations — genuine encoding, incidental linear separability, confound encoding (correlated feature rather than target feature). The probe accuracy underdetermines the representational claim.
- Operationalism vs realism: “The model represents syntax depth” is a realist claim. “A linear classifier achieves 85% accuracy on syntax depth labels from layer 6 activations” is an operationalist statement. Probing provides the operationalist evidence; the realist interpretation requires additional justification.
Nomological Network
Section titled “Nomological Network”The probing-based representation claim connects to:
- Linear decodability — a probe can extract the concept from activations (observational, confirmed by definition)
- Causal use — the model reads from this direction during inference (causal, almost always untested)
- Weight-space grounding — specific parameters implement the encoding (structural, untested)
- Cross-layer consistency — the representation persists or transforms predictably across layers (partial, sometimes tested)
- Control task separation — accuracy exceeds random-label baseline (confound control, often untested)
- Cross-distribution transfer — probe generalizes to new text domains (robustness, sometimes tested)
- Intervention effect — patching along probe direction changes behavior (causal, tested only in DAS-style follow-ups)
One node confirmed by construction (decodability), one or two partially tested in good papers, four typically untested. The nomological network for standard probing is extremely thin — most of the predictive power of the “model represents X” claim is never tested.
Neuroscience Lens — Internal Validity
Section titled “Neuroscience Lens — Internal Validity”Does the evidence establish that the probed representation is causally used?
Criteria
Section titled “Criteria”I1 — Necessity: Not tested (typically). Standard probing does not ablate the identified direction and measure behavioral change. Without this, we do not know if the probed representation is causally used by the model. It could be a byproduct that the model never reads from.
I2 — Sufficiency: Not tested (typically). Patching along the probe direction to shift behavior is the sufficiency test. DAS (Geiger et al. 2023) does this — it is a probe + causal intervention combined. Standard probing without intervention tests neither necessity nor sufficiency.
I3 — Specificity: Not tested. Does intervening on the probed direction only change the probed concept’s behavior? Without the intervention, this cannot be assessed.
I4 — Consistency: Partial. Probes are typically tested on a single distribution. Cross-distribution consistency (different text domains, different model checkpoints) is sometimes reported but not standard.
I5 — Confound control: Weak — the critical gap. The fundamental confound: in high-dimensional activation spaces, many concepts are linearly decodable even from random representations (Hewitt & Liang 2019, “control tasks”). A probe may succeed not because the model encodes the concept but because the concept is linearly separable in any space with sufficient dimensionality. Without a control task (probing on random labels or on a related-but-different concept), probe accuracy is uninterpretable.
| Criterion | Verdict | Key evidence |
|---|---|---|
| I1 Necessity | Not tested | No ablation of probed direction |
| I2 Sufficiency | Not tested | No intervention along probe |
| I3 Specificity | Not tested | No off-target measurement |
| I4 Consistency | Partial | Single distribution typical |
| I5 Confound control | Weak | High-dimensional separability confound |
Key Distinctions
Section titled “Key Distinctions”- Single vs double dissociation: Standard probing provides no dissociation evidence at all — it is purely observational. DAS-style extensions provide single dissociation (intervening on the direction changes the target behavior). Double dissociation (intervening on direction A does NOT change behavior B, and intervening on direction B does NOT change behavior A) is almost never tested in the probing literature.
- Lesion vs stimulation: Probing uses neither. It is a passive measurement — the neural analog of recording without intervening. The transition from “recordable” to “causally relevant” requires an intervention that probing does not provide.
Dissociation Matrix
Section titled “Dissociation Matrix”| Probed concept (behavior) | Related concept (behavior) | Unrelated concept (behavior) | |
|---|---|---|---|
| Ablate probed direction | ? | ? | ? |
| Ablate related direction | ? | ? | ? |
| Patch along probed direction | ? (DAS only) | ? | ? |
Entirely empty for standard probing. Even DAS-style extensions fill at most one cell (patch along probed direction → probed concept changes). The matrix makes visible the complete absence of causal evidence in standard probing — every cell is a ”?” that would need to be filled to establish that the probed representation is causally real.
Pharmacology Lens — External Validity
Section titled “Pharmacology Lens — External Validity”Does intervening on the probed direction produce predictable downstream effects?
Criteria
Section titled “Criteria”E1 — Intervention reach: Not tested. Standard probes do not intervene. DAS-style extensions do, and they sometimes find that intervention along the probe direction does not produce the expected behavioral change — the representation is present but not causally used.
E2 — Graded response: N/A. No intervention = no dose-response.
E3 — Selectivity: N/A. No intervention = no selectivity assessment.
E4 — Effect magnitude: N/A. Probe accuracy is a measurement, not an intervention effect.
E5 — Robustness: Variable. Some probes generalize across text types. Many do not — performance degrades on out-of-distribution text.
E6 — Cross-architecture: Partial. Probing has been applied across many architectures. Cross-model comparison of probe results is informative but rarely done as a systematic convergent validity check.
| Criterion | Verdict | Key evidence |
|---|---|---|
| E1 Intervention reach | Not tested | Probes don’t intervene |
| E2 Graded response | N/A | — |
| E3 Selectivity | N/A | — |
| E4 Effect magnitude | N/A | — |
| E5 Robustness | Variable | Domain-dependent |
| E6 Cross-architecture | Partial | Applied broadly, rarely compared |
Key Distinctions
Section titled “Key Distinctions”- Affinity vs efficacy: Probing measures affinity only (the concept is present/decodable in the activation space). Efficacy (whether the model uses this information to drive behavior) is entirely untested. A representation with high affinity and zero efficacy is a byproduct, not a functional encoding.
- The fundamental pharmacological gap: A drug that binds to a receptor (affinity) but produces no physiological effect (no efficacy) is not a therapeutic agent. Similarly, a probe that detects a concept (affinity) but provides no evidence of causal use (no efficacy) does not establish functional representation. Probing is a binding assay, not a clinical trial.
Dose-Response Curve
Section titled “Dose-Response Curve”Standard probing produces no dose-response data — there is no intervention to dose. The closest analog:
- Probe accuracy as a function of layer — sometimes shows a curve (accuracy increases, peaks, then decreases across layers). This is informative about where information is available but says nothing about causal use.
- DAS intervention strength — when DAS-style follow-ups are performed, patching at varying strengths can produce a dose-response. But this is no longer standard probing; it is a different methodology.
For standard probing, the dose-response section is N/A — no intervention means no curve to characterize.
Measurement Theory Lens — Measurement Validity
Section titled “Measurement Theory Lens — Measurement Validity”Is the probe a reliable and well-calibrated instrument?
Criteria
Section titled “Criteria”M1 — Reliability: Partial. Probe accuracy varies with: probe architecture, training hyperparameters, layer choice, and dataset. The same concept probed with different settings gives different accuracy numbers. Reliability is conditional on methodological choices.
M2 — Invariance: Weak. A probe trained on one dataset may not transfer to another. The measurement is distribution-specific rather than model-intrinsic.
M3 — Baseline separation: Critical gap (without controls). Without a control task (probing random labels, or probing a selectional baseline), probe accuracy is uninterpretable. Hewitt & Liang (2019) show that probes on random labels can achieve non-trivial accuracy in high-dimensional spaces. The baseline is essential and often missing.
M4 — Sensitivity: Variable. Linear probes may miss nonlinear representations. This is a known limitation — choosing a linear probe is both a strength (constraining the hypothesis) and a weakness (potentially missing genuine encodings that are nonlinear).
M5 — Calibration: Poorly understood. What does 75% probe accuracy mean? Without calibration against models of known representational capacity, the number is hard to interpret. Is 75% “the model represents this concept weakly” or “the probe is underpowered”?
M6 — Construct coverage: Partial. Probes measure decodability at a single layer. They do not measure: how the representation is formed (across layers), how it is used (downstream effects), or its role in computation.
| Criterion | Verdict | Key evidence |
|---|---|---|
| M1 Reliability | Partial | Sensitive to methodological choices |
| M2 Invariance | Weak | Distribution-specific |
| M3 Baseline separation | Critical gap | Without control tasks, uninterpretable |
| M4 Sensitivity | Variable | Linear constraint: feature and limitation |
| M5 Calibration | Poorly understood | Accuracy numbers hard to interpret |
| M6 Construct coverage | Partial | Single-layer decodability only |
Key Distinctions
Section titled “Key Distinctions”- Reliability vs validity: Probe results are unreliable (vary with hyperparameters, architecture, dataset) and of uncertain validity (decodability does not equal representation). The combination is particularly concerning — we cannot even confirm that the instrument produces stable measurements, let alone that those measurements reflect something real about the model.
- Convergent vs discriminant validity: Standard probing provides neither. Convergent: does a different method (causal intervention, weight analysis) confirm the same representation? Discriminant: does the probe for concept A give low scores when applied to concept B? Both are almost always absent.
- The instrument may create the signal: A sufficiently powerful probe can decode many concepts from any high-dimensional space. The “measurement” may be a property of the probe (its capacity to find linear separability) rather than a property of the model (its representational content). This is the Hewitt & Liang insight formalized as a measurement-theoretic concern.
MTMM Matrix
Section titled “MTMM Matrix”| Linear probe (concept A) | Linear probe (concept B) | DAS (concept A) | Weight analysis (concept A) | |
|---|---|---|---|---|
| Linear probe (concept A) | — | ? (discriminant) | ? (convergent) | ? (convergent) |
| Linear probe (concept B) | ? | — | ? | ? |
| DAS (concept A) | ? | ? | — | ? |
| Weight analysis (concept A) | ? | ? | ? | — |
Entirely empty. Standard probing does not produce MTMM data because it uses one method on one concept at a time. The absence is not incidental — it reflects the fundamental single-method nature of probing. Filling even one convergent cell (does DAS confirm what the probe found?) would substantially strengthen the representational claim; filling a discriminant cell (does the probe for A score low on B?) would address confound concerns. Neither is standard practice.
MI Lens — Interpretive Validity
Section titled “MI Lens — Interpretive Validity”Is “the model represents X” warranted by probe success?
Criteria
Section titled “Criteria”V1 — Level declaration: Pass. Probing makes a representational claim — the model encodes a concept.
V2 — Level-evidence match: Partial. Probing provides representational evidence (the concept is decodable). But “decodable” does not equal “encoded.” The model may not use this information — it could be a byproduct of other computations. The evidence is necessary but not sufficient for the representational claim.
V3 — Narrative coherence: Moderate. “The model represents X because a probe can decode X” is a coherent story, but the inference is fragile. A better narrative requires showing that the model uses the representation — which probing alone cannot do.
V4 — Alternative exclusion: Weak. The primary alternative — “the concept is linearly separable as a geometric artifact of the embedding, not because the model intentionally encodes it” — is not excluded by standard probing. Control tasks partially address this but do not fully resolve it.
V5 — Scope honesty: Often violated. Probe success is often reported as “the model represents X” when the evidence supports only “X is linearly decodable from layer L on this dataset.” The gap between these statements is substantial.
| Criterion | Verdict | Key evidence |
|---|---|---|
| V1 Level declaration | Pass | Representational |
| V2 Level-evidence match | Partial | Decodable does not equal encoded |
| V3 Narrative coherence | Moderate | Story is coherent but underdetermined |
| V4 Alternative exclusion | Weak | Geometric artifact not excluded |
| V5 Scope honesty | Often violated | ”Represents” exceeds “decodable” |
Key Distinctions
Section titled “Key Distinctions”- Description vs explanation: Probing is purely descriptive — it identifies that a concept is decodable but does not explain the mechanism that produces the encoding or the computation that uses it. The gap from description to explanation requires causal evidence that probing does not provide.
- Component identity vs component role: Probing identifies a direction (component identity) but not its role in the model’s computation. The probe direction may be a real computational axis or a geometric artifact — probing alone cannot distinguish these.
- Faithfulness vs understanding: Probe accuracy measures faithfulness of the external classifier, not understanding of the model. High probe accuracy means the external classifier faithfully decodes the concept — it does not mean we understand how or why the model represents it.
Evidence Convergence Map
Section titled “Evidence Convergence Map”- Implementational → Interpretation: Absent. Probing does not identify implementing parameters or structural mechanisms. The direction found by the probe is not linked to specific weights.
- Algorithmic → Interpretation: Absent. Probing does not specify the algorithm that produces or reads the representation. The computational steps are entirely uncharacterized.
- Computational → Interpretation: Weak. Probing establishes that the concept is available in the representation (a computational-level observation). But availability does not imply use — the model may have access to information it never reads.
Intervention-Interpretation Matrix
Section titled “Intervention-Interpretation Matrix”| Necessity | Sufficiency | Representational | Algorithmic | Computational | |
|---|---|---|---|---|---|
| Standard probing | — | — | observational only | — | — |
| Probe + control task | — | — | partial (excludes artifact) | — | — |
| DAS (probe + intervention) | partial | partial | ✓ | — | partial |
| Probe + weight analysis | — | — | convergent | — | — |
Standard probing fills zero interventional cells. It provides observational representational evidence only. Each methodological extension (control tasks, DAS, weight-space convergence) fills additional cells, progressively strengthening the claim. The matrix makes visible that standard probing is a starting point, not an endpoint — the interpretive claim requires evidence that probing alone cannot provide.
Causal Sufficiency Graph
Section titled “Causal Sufficiency Graph”- Input properties → activation pattern: solid (by construction — the model processes the input)
- Activation pattern → linear decodability: solid (the probe demonstrates this)
- Linear decodability → model representation: dashed (the inferential leap — decodable does not imply represented)
- Model representation → downstream computation: absent (probing provides no evidence about use)
- Probe direction → model’s actual encoding direction: dashed (the probe finds a direction; whether it is the direction the model uses is unverified)
Two solid edges (trivial: inputs produce activations, probes can decode them) and three dashed-or-absent edges (substantive: does decodability imply representation? does the model use this direction?). The causal sufficiency graph makes visible that the interesting claims — those that go beyond “a classifier works” — have no solid causal support from probing alone.