F03 — Convergent Validity
Section titled “F03 — Convergent Validity”This framework asks: Do independent methods that claim to measure the same thing actually agree?
If weight-based circuit importance and activation-based circuit importance both purport to identify which heads matter for a task, their rankings should correlate. High convergent validity means the construct (circuit importance) is real and measurable — not an artifact of one particular method. Low convergence signals that at least one method is measuring something else entirely.
This is the “same trait, different method” cell of the classic multitrait-multimethod (MTMM) matrix. In circuit discovery, the “trait” is circuit membership and the “methods” are weight analysis, activation patching, edge attribution, and knockout experiments.
Theoretical grounding
Section titled “Theoretical grounding”| Source | Year | Key contribution |
|---|---|---|
| Campbell & Fiske, “Convergent and discriminant validation by the MTMM matrix” | 1959 | Defined convergent/discriminant validity framework |
| Conley et al., “Replication and robustness in developmental research” | 2018 | Multi-method agreement as evidence for construct validity |
| Goldstein et al., “Are Neural Network Interpretations Robust?“ | 2023 | Testing agreement between attribution methods |
| Wang et al., “Interpretability in the Wild” | 2022 | IOI circuit validation with multiple methods |
Core concept
Section titled “Core concept”Given two instruments ( A ) and ( B ) that each assign importance scores ( a_i, b_i ) to circuit heads ( i = 1, \ldots, H ), convergent validity is the rank correlation:
[ r_{\text{conv}} = \text{Spearman}(\mathbf{a}, \mathbf{b}) ]
For binary circuit membership decisions, we use Cohen’s kappa:
[ \kappa = \frac{p_o - p_e}{1 - p_e} ]
where ( p_o ) is observed agreement and ( p_e ) is chance agreement. Values above 0.6 indicate substantial convergence; above 0.8 indicates near-perfect agreement between methods.
Instruments under F03
Section titled “Instruments under F03”Convergent Validity (12_convergent_validity.py)
Section titled “Convergent Validity (12_convergent_validity.py)”Computes pairwise Spearman correlations and Cohen’s kappa between all available circuit-identification methods (weight-based ranking, activation patching, edge attribution patching) on the same task. Reports an MTMM-style matrix.
What it establishes: That independent methods recover the same circuit structure — the construct is real, not method-specific. What it does not establish: That the shared construct is faithfulness specifically — convergence could reflect a shared bias.
Usage:
uv run python 12_convergent_validity.py --tasks ioi sva --methods weight activation eapReading the scores
Section titled “Reading the scores”| Pattern | What it means |
|---|---|
| Spearman > 0.8 | Strong convergence — methods agree on head importance ordering |
| Spearman 0.5–0.8 | Moderate — methods partially agree; investigate divergent heads |
| Spearman < 0.5 | Weak — methods may be measuring different constructs |
| Kappa > 0.6 | Substantial agreement on binary circuit membership |
| Kappa < 0.4 | Poor agreement — circuit boundaries are method-dependent |