Skip to content

This framework asks: When two independent methods identify circuit boundaries, how often do they agree on which edges belong?

In measurement theory, inter-rater reliability quantifies whether different raters assign the same scores to the same subjects. In circuit discovery, the “raters” are different algorithms — weight-based identification, EAP (Edge Attribution Patching), activation patching, ACDC — and the “subjects” are model edges. High inter-rater agreement means the circuit boundary is objective, not method-dependent.

This differs from convergent validity (F03) in granularity: F03 correlates continuous importance rankings, while F06 measures agreement on the binary decision “is this edge in the circuit or not?” using set-overlap and chance-corrected agreement coefficients.

SourceYearKey contribution
Cohen, “A coefficient of agreement for nominal scales”1960Cohen’s kappa for two raters
Shrout & Fleiss, “Intraclass correlations: uses in assessing rater reliability”1979ICC framework for multiple raters
Jaccard, “The distribution of the flora in the alpine zone”1912Jaccard index for set similarity
Conmy et al., “Towards Automated Circuit Discovery”2023ACDC as an independent circuit-discovery rater
Syed et al., “Attribution Patching Outperforms Automated Circuit Discovery”2023EAP as alternative rater for circuit edges

Given two methods that each produce a circuit edge set ( C_A, C_B \subseteq E ), the Jaccard index is:

[ J(C_A, C_B) = \frac{|C_A \cap C_B|}{|C_A \cup C_B|} ]

Cohen’s kappa for the binary classification (in-circuit vs. not) over all possible edges:

[ \kappa = \frac{p_o - p_e}{1 - p_e} ]

where ( p_o = \frac{|C_A \cap C_B| + |\overline{C_A} \cap \overline{C_B}|}{|E|} ) and ( p_e ) accounts for chance agreement given each method’s circuit density.

Edge Jaccard Agreement (27_edge_jaccard.py)

Section titled “Edge Jaccard Agreement (27_edge_jaccard.py)”

Computes the Jaccard index and Cohen’s kappa between the weight-circuit edge set and the EAP-derived edge set at matched sparsity levels. Reports agreement at multiple threshold points to show how inter-rater reliability varies with circuit density.

What it establishes: That the circuit boundary is reproducible across methods — the identified edges are not artifacts of one algorithm. What it does not establish: That the agreed-upon edges are causally important — agreement could reflect shared biases (e.g., both methods favor high-norm edges).

Usage:

uv run python 27_edge_jaccard.py --tasks ioi sva --methods weight eap --sparsity 0.1 0.2 0.3
PatternWhat it means
Kappa > 0.7Strong agreement — circuit boundary is method-independent
Kappa 0.4–0.7Moderate — core edges agree, periphery diverges
Kappa < 0.4Weak — methods identify substantially different circuits
Jaccard decreasing with sparsityMethods agree on top edges but diverge on marginal ones