Skip to content

This framework asks: Do methods that measure different things actually produce different answers?

If a circuit importance score for IOI correlates just as highly with a circuit importance score for SVA as it does with another IOI method, then the instrument is not discriminating between tasks — it may be capturing a generic property like head norm rather than task-specific circuit membership. Discriminant validity is the “different trait, same method” check.

A good measurement framework produces high convergent validity (F03) and high discriminant validity: instruments agree when they should, and disagree when they should. Without discriminant validity, a high score might simply mean “large weights” rather than “important for this specific task.”

SourceYearKey contribution
Campbell & Fiske, “Convergent and discriminant validation by the MTMM matrix”1959Defined the MTMM framework including discriminant cells
Messick, “Validity of Psychological Assessment”1995Unified construct validity including discrimination
Geiger et al., “Causal Abstractions of Neural Networks”2021Task-specific causal structure in neural networks
Conmy et al., “Towards Automated Circuit Discovery”2023ACDC — method-specific circuits per task

Given importance scores from the same method applied to two different tasks ( T_1, T_2 ):

[ r_{\text{disc}} = \text{Spearman}(\mathbf{a}^{T_1}, \mathbf{a}^{T_2}) ]

Discriminant validity requires ( r_{\text{disc}} < r_{\text{conv}} ). We compute the discriminant ratio:

[ D = 1 - \frac{r_{\text{disc}}}{r_{\text{conv}}} ]

Values of ( D > 0.5 ) indicate good discrimination: the instrument captures task-specific structure rather than generic head properties. When ( D \approx 0 ), the same heads are flagged regardless of task, suggesting the method is insensitive to the construct.

Discriminant Validity (17_discriminant_validity.py)

Section titled “Discriminant Validity (17_discriminant_validity.py)”

Computes cross-task correlations for each circuit-identification method and compares them against within-task convergent correlations. Outputs the MTMM matrix with convergent (diagonal) and discriminant (off-diagonal) cells highlighted.

What it establishes: That circuit scores are task-specific — the method captures distinct constructs for distinct tasks. What it does not establish: Which task decomposition is correct — only that the method differentiates between the tasks provided.

Usage:

uv run python 17_discriminant_validity.py --tasks ioi sva greater_than --methods weight activation
PatternWhat it means
D > 0.6Strong discrimination — circuits are task-specific
D 0.3–0.6Moderate — some shared structure but meaningful differentiation
D < 0.3Weak — method may be capturing generic properties (norm, layer depth)
Cross-task kappa > within-task kappaMethod failure — “different” tasks produce more agreement than “same” task methods