Skip to content

This page documents metrics that evaluate whether proposed circuits or model-level mechanisms are faithful, necessary, sufficient, and robust. Includes CLT (Circuit Tracing) attribution graph validation and circuit-level faithfulness tests.


These metrics evaluate the validity of attribution graphs produced by Anthropic’s Circuit Tracing / CLT framework, testing faithfulness, reliability, completeness, and sensitivity to pruning decisions.


FH01 — CLT Attribution Graph Faithfulness

Section titled “FH01 — CLT Attribution Graph Faithfulness”

ID. FH01.clt_graph_faithfulness | File. EX29_clt_graph_faithfulness.py

What it computes. Measures how well a pruned attribution graph (keeping only the top-k highest-attribution heads) preserves the model’s behavior, computed as the ratio of pruned-model logit difference to full-model logit difference.

Evidence family. Internal (I2 Compositional Sufficiency, C2 Structural Plausibility)

Pass threshold. graph_faithfulness > 0.8

Reference. Ameisen, Lindsey et al. (2025), Anthropic.


ID. FH02.clt_cross_prompt_consistency | File. EX30_clt_cross_prompt_consistency.py

What it computes. Tests M1 reliability of circuit identification by computing pairwise Jaccard similarity of top-k causally important head sets across semantically equivalent prompts (paraphrases targeting the same answer).

Evidence family. Measurement (M1 Reliability, M2 Invariance)

Pass threshold. cross_prompt_consistency > 0.4

Reference. Ameisen, Lindsey et al. (2025), Anthropic.


ID. FH03.clt_error_fraction | File. EX31_clt_error_fraction.py

What it computes. Quantifies the replacement model gap by measuring what fraction of the model’s behavior is NOT captured by individually attributable head contributions, approximating the “error node” concept from the CLT framework.

Evidence family. Measurement (M6 Construct Coverage), Internal (I5 Confound Control)

Pass threshold. error_fraction < 0.2

Reference. Ameisen, Lindsey et al. (2025), Anthropic.


FH04 — CLT Missing Attention Quantification

Section titled “FH04 — CLT Missing Attention Quantification”

ID. FH04.clt_missing_attention | File. EX32_clt_missing_attention.py

What it computes. Quantifies the systematic explanatory gap from CLT attribution graphs’ exclusion of attention mechanisms (QK circuits) by measuring the fraction of a task’s causal effect attributable to attention heads versus MLP layers.

Evidence family. Internal (I5 Confound Control, M6 Construct Coverage)

Pass threshold. attention_gap_fraction < 0.3

Reference. Ameisen, Lindsey et al. (2025), Anthropic.


ID. FH05.clt_minimality_sensitivity | File. EX33_clt_minimality_sensitivity.py

What it computes. Tests whether the pruned attribution graph’s size is stable across pruning thresholds, detecting fragile minimality where small threshold changes cause large jumps in graph size.

Evidence family. Construct (C4 Minimality), Measurement (M4 Sensitivity)

Pass threshold. minimality_stability > 0.5

Reference. Ameisen, Lindsey et al. (2025), Anthropic.


These metrics evaluate whether proposed circuits are faithful, actionable, and robust at the circuit level.


ID. FH06.cot_faithfulness | File. 108_cot_faithfulness.py

What it computes. Detects unfaithful Chain-of-Thought reasoning by presenting paired contradictory comparison questions (A > B? and B > A?) and measuring the rate of logical contradictions, revealing post-hoc rationalization.

Evidence family. Internal (I2 Compositional Sufficiency)

Pass threshold. contradiction_rate < 0.05

Reference. Arcuschin et al. (2025), Google DeepMind, ICLR 2025.


ID. FH07.modcirc_modularity | File. 122_modcirc.py

What it computes. For each pair of tasks with defined circuits, evaluates whether a circuit discovered on task A maintains faithfulness when evaluated on task B’s prompts, plus Jaccard overlap of circuit head sets.

Evidence family. Construct (C5 Convergent Validity), Internal (E2 Causal Sufficiency)

Pass threshold. mean_cross_task_faithfulness > 0.4

Reference. He et al. (2025), ICML 2025.


ID. FH08.actionability | File. 128_actionability.py

What it computes. Measures whether circuit-level insights translate into actionable steering interventions, combining concreteness (norm ratio of circuit-derived vs. full-model steering vectors) with validation (fraction of prompts where circuit-derived steering shifts output toward the correct answer).

Evidence family. External (E1 Downstream Utility)

Pass threshold. actionability > 0.1

Reference. Orgad, Barez et al. (2026), ICML 2026.


ID. FH09.surprise_reduction | File. 129_surprise_reduction.py

What it computes. Measures how much knowing the circuit reduces uncertainty about model outputs by comparing output entropy with and without the circuit’s contribution (entropy increase upon circuit ablation divided by ablated entropy).

Evidence family. Measurement (M4 Construct Coverage)

Pass threshold. surprise_reduction > 0.05

Reference. ARC (2026); Hilton et al. (2026), AlignmentForum.


ID. FH10.behavior_capability_gap | File. 130_behavior_capability_gap.py

What it computes. Compares circuit faithfulness on prompts where the model gets the correct answer (capability) against prompts where the model errs (behavior), detecting circuits that only explain success but not failure.

Evidence family. External (E3 Generalizability)

Pass threshold. behavior_gap < 0.3

Reference. Steinhardt (2026), AlignmentForum.


IDNameFileEvidence FamilyThreshold
FH01CLT Graph FaithfulnessEX29_clt_graph_faithfulness.pyInternal> 0.8
FH02CLT Cross-Prompt ConsistencyEX30_clt_cross_prompt_consistency.pyMeasurement> 0.4
FH03CLT Error FractionEX31_clt_error_fraction.pyMeasurement< 0.2
FH04CLT Missing AttentionEX32_clt_missing_attention.pyInternal< 0.3
FH05CLT Minimality SensitivityEX33_clt_minimality_sensitivity.pyConstruct> 0.5
FH06CoT Faithfulness108_cot_faithfulness.pyInternal< 0.05
FH07ModCirc Modularity122_modcirc.pyConstruct> 0.4
FH08Actionability128_actionability.pyExternal> 0.1
FH09Surprise Reduction129_surprise_reduction.pyMeasurement> 0.05
FH10Behavior vs. Capability Gap130_behavior_capability_gap.pyExternal< 0.3

For artifact quality metrics (SAE features, transcoders, crosscoders), see MI Artifact Quality Metrics. For safety-relevant construct validation, see MI Safety Metrics. For the overall MI lens framework, see Mechanistic Interpretability.