Skip to content

These are MI-specific metrics beyond the Core Metrics (the A-F evidence-family frameworks). They measure specific properties of circuits, features, and decompositions using tools from mechanistic interpretability. The metrics are organized by the type of evidence they produce, not by the tool or technique used — a single technique like activation patching appears in the causal category regardless of which framework invokes it.

CategoryCountWhat it coversPage
Causal (incl. discovery)30Ablation, patching, scrubbing, causal discovery, ACDC, EAPMI Causal
Structural22Weight decomposition, graph analysis, motifs, compositionMI Structural
Behavioral15Faithfulness variants, generalization, calibrationMI Behavioral
Information-Theoretic11MI, PID, transfer entropy, Granger, NOTEARSMI Information
Representational9Probes, RSA, CKA, attention entropyMI Representational
Artifact Quality15SAE eval, transcoder, crosscoder validationMI Artifact Quality
Faithfulness10CLT graph fidelity, circuit faithfulnessMI Faithfulness
Steering5CAA, LEACE, RepE, cross-model transferMI Steering
Safety7Safety subspaces, adversarial ablation, claim reliabilityMI Safety
Benchmarks6AxBench, SAEBench, CE-Bench, MIBMI Benchmarks

In addition, the MI Evaluation Metrics page documents 43 evaluation metrics that cut across these categories, testing circuit faithfulness, feature quality, safety constructs, and decomposition completeness. The MI Methods Index provides technique-based lookup into the same metrics.

The Core Metrics (A01 through F08) document frameworks — bundled metrics plus calibrations plus theoretical interpretation. Each framework draws on a scientific tradition (Pearl’s SCM, Rubin’s potential outcomes, Granger causality, etc.) and packages a curated set of metrics into a protocol with domain-specific pass/fail criteria.

The extended MI metrics listed here are individual metric implementations. Many of them are used within those frameworks. For example:

  • MET-activation-patching is both a standalone extended metric (C2 on the MI Causal page) and a component of protocol A01 (Pearl SCM), A03 (Rubin CATE), A04 (Woodward), A10 (Regularity/INUS), and A11 (Actual Cause).
  • MET-das-iia is both a standalone metric (C1 on MI Causal) and the central instrument of protocol A02 (Counterfactual DAS/IIA).
  • MET-mutual-information appears on MI Information as C01 and feeds into protocol A08 (PID).

The relationship is one-to-many: a single extended metric can appear in multiple protocols. The protocols add curation (which metrics to run together) and interpretation (what the pattern of results means through a specific theoretical lens). Running the extended metrics individually produces the same raw numbers; the protocols add structure and context.

  • Methods Index — technique-based lookup. If you know the method (ACDC, EAP, DAS, CAA) and want to find which metrics use it, start there.
  • Calibrations — quality gates. Before trusting any extended metric’s output, check which calibrations apply (bootstrap stability, convergent validity, measurement invariance, etc.).
  • Protocols — curated bundles. If you want structured depth on a specific validity question rather than individual metric scores, find the relevant protocol.
  • Naming Convention — how entity IDs (CRIT, MET, CAL, PROT, SYN) prevent namespace collisions across the framework.