Skip to content

Measurement Theory — Metrics & Protocols

Section titled “Measurement Theory — Metrics & Protocols”

This page documents the extended metrics under the Measurement Theory lens. These metrics go beyond the original F01—F08 suite (documented at their existing pages) to cover SAE-specific validity diagnostics, psychometric extensions from cognitive science, safety representation analysis, and benchmark meta-diagnostics.

All metrics in this page follow the same principle as the core measurement theory lens: is the metric that produced the number trustworthy? Some operate at the decomposition level (is the SAE itself a valid instrument?), some at the evaluation level (are our benchmarks reliable?), and some import constructs from psychophysics and psychometrics to formalize properties that MI evaluates informally.


These metrics test whether the decomposition and the evaluation pipeline produce stable, reproducible results.

Source: Martin-Linares & Ling (2025). arXiv:2512.24975.

Criteria: M1 Reliability

What it establishes: Which SAE features are reliably recovered across iterative distillation cycles. Only a small fraction of features form a stable “core” — the paper found 197 out of 65,000 features in a 65k SAE are stable. This provides a reliability diagnostic for the decomposition itself: features outside the core may be fitting noise or arbitrary local optima rather than stable structure.

What it does not establish: Whether the stable core features are interpretable, causally important, or correspond to “true” features. Core membership is a necessary condition for reliability but not sufficient for validity.

Method:

  1. Train a small SAE on activations at a hook point.
  2. Identify high gradient-times-activation features, mark as “core” (top fraction by importance).
  3. Reinitialize non-core features, retrain.
  4. After nn cycles, record which features converged into the stable core.
  5. Core membership rate = reliability score.

Key quantities:

  • core_fraction — fraction of features in the stable core after all cycles
  • mean_overlap — mean Jaccard overlap of core sets between consecutive cycles
  • is_stable — whether mean overlap exceeds the stability threshold (default 0.8)

Pass condition: Report-only (diagnostic). Any nonzero core fraction is informative.

Usage:

Terminal window
uv run python 115_core_stability.py --model gpt2 --device cpu
uv run python 115_core_stability.py --hook blocks.6.hook_resid_pre --n-cycles 5

Reading the scores:

PatternWhat it means
core_fraction > 0.05Reasonable core — most features are unstable but a meaningful subset persists
core_fraction < 0.01Very small core — the decomposition is highly sensitive to initialization
mean_overlap > 0.8Core membership converges — distillation is reaching a fixed point
mean_overlap < 0.5Core membership drifts — even “important” features change across cycles

Source: Bai, Baumgartner, Sun, Holtzman, Tan (2026). “The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research.” arXiv:2602.18458.

Criteria: M1 Reliability (test-retest)

What it establishes: Whether a metric computation pipeline produces reproducible results across runs with different random seeds. Inspired by MechEvalAgent’s finding that 93% of MI research outputs fail reproducibility when code is actually executed.

What it does not establish: Whether the metric measures the right thing — only that it measures the same thing each time. A perfectly reproducible metric can still be invalid if it measures a confound.

Method:

  1. Select a base metric (logit-diff, probe accuracy, or ablation recovery).
  2. Run it NN times on the same model and prompts with different random seeds (controlling subsample selection and ordering).
  3. Compute:
    • deviation_rate — fraction of run pairs differing by more than the max-deviation threshold
    • max_deviation — largest relative deviation from the mean across runs
    • coherence_score — mean pairwise Spearman rank correlation between per-prompt rankings across runs

Pass condition:

  • deviation_rate < 0.05
  • max_deviation < 0.08
  • coherence_score > 0.9

Usage:

Terminal window
uv run python 132_reproducibility.py --model gpt2 --device cpu
uv run python 132_reproducibility.py --n-runs 10 --n-prompts 50

Reading the scores:

PatternWhat it means
Low deviation, high coherencePipeline is reproducible; results can be trusted
High deviation, high coherenceRankings are stable but absolute values shift — report rankings, not point estimates
High deviation, low coherencePipeline is unreliable; results should not be interpreted

Source: Anonymous (2026). “Are Sparse Autoencoder Benchmarks Reliable?” arXiv:2605.18229.

Criteria: M1 Reliability, M2 Measurement Invariance

What it establishes: Whether evaluation metrics used for SAE comparison are themselves reliable (low reseed noise) and discriminative (can distinguish meaningfully different SAEs). The SAEBench audit independently found that TPP and SCR fail comprehensively (CV of 16—39%) while sae-probes is most reliable.

What it does not establish: Whether any particular SAE is good — only whether the metrics used to evaluate SAEs produce stable, discriminating numbers.

Method:

  1. Select an evaluation metric (e.g., probe accuracy, logit-diff recovery).
  2. Run the metric NN times on the same model and prompts with different random seeds.
  3. Compute coefficient of variation: CV=σ/μ\text{CV} = \sigma / |\mu|.
  4. Compute discriminability: run metric on two configurations differing by a known quality dimension, compute Cohen’s dd.

Pass condition:

  • CV < 0.05
  • Discriminability dd > 0.8

Usage:

Terminal window
uv run python 131_saebench_audit.py --model gpt2 --device cpu
uv run python 131_saebench_audit.py --n-reseeds 10 --n-prompts 50

Reading the scores:

PatternWhat it means
Low CV, high discriminabilityMetric is both stable and sensitive — suitable for SAE comparison
Low CV, low discriminabilityMetric is stable but cannot distinguish quality differences — not useful for comparison
High CV, any discriminabilityMetric is noisy — differences between SAEs may reflect measurement noise

These metrics test whether the SAE decomposition itself is a valid measurement instrument, independent of any downstream circuit claim.

Source: Lindsey et al. (2025). NeurIPS 2025.

Criteria: M2 Hyperparameter Sensitivity, M6 Artifact Quality

What it establishes: Whether two different SAE architectures trained on the same model and hook point agree on what features exist. This is a construct validity test for the decomposition method itself: if TopK-SAE and JumpReLU-SAE discover completely different features, the “features” are partly determined by the architecture rather than being properties of the model.

What it does not establish: Which architecture’s features are “correct” — the metric measures agreement, not accuracy. High agreement is necessary for construct validity but two architectures could agree on an artifact.

Method:

  1. Collect activations at a shared hook point from the model.
  2. Encode activations through both artifact adapters.
  3. Compute feature_overlap: Jaccard similarity of active feature sets at a threshold.
  4. Compute direction_agreement: mean max cosine similarity between encoder directions of the two artifacts (symmetric: A-to-B and B-to-A averaged).
  5. architecture_agreement = mean(feature_overlap, direction_agreement).

Pass condition: architecture_agreement > 0.3

Usage:

Terminal window
uv run python 110_architecture_duality.py \
--artifact-a-path <release_a> --artifact-b-path <release_b>
uv run python 110_architecture_duality.py --device cpu

Reading the scores:

PatternWhat it means
Agreement > 0.5Architectures substantially agree — features reflect model structure more than architecture choice
Agreement 0.3—0.5Partial agreement — some features are robust but many are architecture-dependent
Agreement < 0.3Low agreement — the decomposition is largely determined by architecture, not model structure

Source: Golimblevskaia, Jain, Puri, Ibrahim, Samek, Lapuschkin (2026). ICLR 2026. arXiv:2510.14936.

Criteria: C5 Convergent Validity

What it establishes: Whether weight-based and activation-based feature descriptions agree. A feature’s structural identity (what it promotes in logit space via W_dec @ W_U) should match its functional identity (what inputs it fires on). Divergence means the feature’s “meaning” depends on whether you look at its weights or its activations — a construct validity failure.

What it does not establish: Whether either description is “correct” in isolation. The metric tests convergence between two independent characterization methods, not ground truth.

Method:

  1. Compute weight-based descriptions: for each feature, project its decoder direction through the model’s unembedding (W_dec @ W_U) to get top-kk promoted tokens.
  2. Compute activation-based descriptions: run prompts through the model, encode at the hook point, and for each feature track which tokens produce the highest activations.
  3. Measure agreement: Jaccard overlap of the two top-kk token sets, averaged over features.

Pass condition: weight_activation_agreement > 0.3

Usage:

Terminal window
uv run python 114_weightlens.py --artifact-path <release> --sae-id <id>
uv run python 114_weightlens.py --device cpu --top-k 50

Reading the scores:

PatternWhat it means
Agreement > 0.5Strong weight-activation convergence — feature identity is robust to description method
Agreement 0.3—0.5Moderate convergence — some features have consistent identity, others diverge
Agreement < 0.3Low convergence — weight-based and activation-based descriptions measure different constructs
High frac_above_thresholdMost active features individually converge, even if the mean is pulled down by dead features

Source: Kopf, Feldhus, Bykov, Bommer, Hedstrom, Hohne, Eberle (2025). NeurIPS 2025. arXiv:2506.15538.

Criteria: M6 Artifact Quality, E1 Predictive Validity

What it establishes: What fraction of SAE features are polysemantic — activating on multiple semantically distinct clusters of contexts. Standard autointerp pipelines are architecturally incapable of reliably describing polysemantic features (they assign a single label), so the polysemanticity rate directly bounds the fraction of features whose automated descriptions can be trusted.

What it does not establish: Whether polysemantic features are “bad” — some may represent genuine multifaceted concepts. The metric quantifies polysemanticity, not whether it is a problem.

Method:

  1. Collect feature activations across prompts via the artifact adapter.
  2. For each sampled feature, find the top-activating contexts.
  3. Embed those contexts using the model’s residual stream (mean-pooled token embeddings).
  4. Compute pairwise cosine similarity among context embeddings.
  5. Apply agglomerative clustering with a cosine distance threshold (default 0.5).
  6. A feature is polysemantic if it has > 1 cluster.
  7. polysemanticity_rate = fraction of sampled alive features that are polysemantic.

Pass condition: Report-only (diagnostic). polysemanticity_rate >= 0 trivially passes.

Usage:

Terminal window
uv run python 117_prism.py --artifact-path <release> --sae-id <id>
uv run python 117_prism.py --device cpu --n-features 100 --cluster-threshold 0.5

Reading the scores:

PatternWhat it means
Rate < 0.1Most features are monosemantic — autointerp descriptions likely reliable
Rate 0.1—0.4Moderate polysemanticity — autointerp descriptions should be cross-checked
Rate > 0.4High polysemanticity — single-label descriptions unreliable for most features
Many dead featuresThe SAE has unused capacity; polysemanticity rate computed only over alive features

M11 — Matryoshka Cross-Scale Consistency

Section titled “M11 — Matryoshka Cross-Scale Consistency”

Source: arXiv:2503.17547 (NeurIPS 2025).

Criteria: M1 Reliability, M2 Hyperparameter Sensitivity

What it establishes: Whether features at SAE dictionary width kk correspond to coherent feature clusters at width 2k2k. This is a measurement consistency check across SAE scales: a feature that exists at width 16k should either remain as-is or cleanly split into semantically related sub-features at width 32k. Incoherent splitting or many-to-one absorption are reliability failures.

What it does not establish: The “correct” dictionary width. The metric tests consistency between scales, not which scale is optimal.

Method:

  1. Collect activations at a shared hook point from the model.
  2. Encode through both artifact adapters (small and large dictionary).
  3. Compute per-feature correspondence via activation correlation.
  4. splitting_rate: fraction of small features whose top-kk correlated large features have low pairwise cosine similarity (incoherent cluster).
  5. absorption_rate: fraction of large features that are the top match for multiple small features (many-to-one collapse).
  6. cross_scale_consistency = 1(splitting_rate+absorption_rate)/21 - (\text{splitting\_rate} + \text{absorption\_rate}) / 2

Pass condition: cross_scale_consistency > 0.7

Usage:

Terminal window
uv run python 118_matryoshka.py \
--artifact-small-path <release_small> --artifact-large-path <release_large>
uv run python 118_matryoshka.py --device cpu --top-k 20

Reading the scores:

PatternWhat it means
Consistency > 0.7Features are stable across scales — dictionary width is not distorting the decomposition
High splitting, low absorptionSmall features break into unrelated pieces at larger width — small dictionary over-compresses
Low splitting, high absorptionLarge dictionary collapses distinct small features — large dictionary under-differentiates
Both rates highDecomposition is fundamentally unstable across scales

Source: Convergent evidence from three papers: Bussmann, Leask, Nanda (NeurIPS 2024, BatchTopK); Yao & Du (arXiv:2508.17320, AdaptiveK); SoftSAE (arXiv:2605.06610).

Criteria: E1 Content Validity, M6 Artifact Quality

What it establishes: Whether fixed-kk SAE sparsity matches input complexity. Fixed-kk architectures activate exactly kk features per input regardless of the input’s actual complexity. For simple inputs, this means spurious features are activated to fill the quota; for complex inputs, real concepts are truncated. Three independent papers converge on this as a systematic content validity failure.

What it does not establish: Whether adaptive-kk architectures solve the problem — only that fixed-kk exhibits systematic mismatch. The metric diagnoses the problem without prescribing a solution.

Method:

  1. Collect activations at the hook point from the model.
  2. Encode through the artifact adapter to get active feature counts per position.
  3. Estimate input complexity via residual stream embedding norm (L2 norm as proxy for information content).
  4. Fit a linear relationship: expected_kcomplexity\text{expected\_k} \sim \text{complexity}.
  5. Flag examples where k_activek_expected/k_expected>threshold|\text{k\_active} - \text{k\_expected}| / \text{k\_expected} > \text{threshold}.
  6. k_mismatch_rate = fraction of flagged examples.

Key quantities:

  • k_mismatch_rate — fraction of inputs where active count deviates from expected
  • complexity_k_correlation — Pearson correlation between input complexity and active feature count (high = SAE adapts naturally; low = fixed behavior)
  • over_activation_rate — fraction with spurious features
  • under_activation_rate — fraction with truncated concepts

Pass condition: k_mismatch_rate < 0.2

Usage:

Terminal window
uv run python 120_adaptive_sparsity.py --artifact-path <release> --sae-id <id>
uv run python 120_adaptive_sparsity.py --device cpu --mismatch-threshold 2.0

Source: Liu, Liu, Gore (2025). “Superposition Yields Robust Neural Scaling.” NeurIPS 2025 Oral, Best Paper Runner-Up. arXiv:2505.10465.

Criteria: M6 Construct Coverage

What it establishes: Whether a model layer operates in the weak or strong superposition regime. In the strong regime (packing ratio >> 1), models pack more features than dimensions with irreducible interference, meaning no decomposition — SAE or otherwise — can recover unique “true features.” They are one of many valid decompositions. In the weak regime, feature recovery is feasible.

What it does not establish: Whether any particular SAE’s features are valid — only the theoretical upper bound on what recovery is possible. A model in strong superposition may still have useful (but non-unique) decompositions.

Method:

  1. Run model on diverse text, capturing residual stream at each layer.
  2. Compute effective rank via participation ratio of singular values: PR=(si2)2si4\text{PR} = \frac{(\sum s_i^2)^2}{\sum s_i^4}
  3. Packing ratio = effective_rank / dmodeld_{\text{model}}. Values >> 1 indicate strong superposition.
  4. Interference score = mean absolute pairwise cosine similarity between top-kk principal components.
  5. Classify regime per layer: weak (packing \leq 0.8, low interference), transition (0.8 < packing \leq 1.2), strong (packing > 1.2 or high interference).

Pass condition: Diagnostic (no pass/fail). Reports regime classification and quantitative indicators per layer plus aggregate.

Usage:

Terminal window
uv run python 126_superposition_regime.py --model gpt2 --device cpu
uv run python 126_superposition_regime.py --n-samples 200

Reading the scores:

PatternWhat it means
All layers weakFeature recovery is feasible — SAE decomposition can in principle find unique features
Mixed weak/strongEarly layers typically weak, later layers stronger — validity claims should be qualified by layer
All layers strongModel packs more features than dimensions — any decomposition is one of many valid ones; claims about “the true features” are not licensed

Source: Derived from Anthropic (2026). “Natural Language Autoencoders.” transformer-circuits.pub/2026/nla/, cross-referenced with SAE-based feature descriptions via SAELens.

Criteria: C5 Convergent Validity

What it establishes: Whether two independent feature description methods — NLA-style (activation-based pattern reconstruction via PCA) and SAE-style (weight-based decoder direction projected through unembedding) — converge on the same feature characterization. This is a multitrait-multimethod (MTMM) test: agreement between two independent methods is C5 Convergent Validity evidence; divergence indicates the feature’s meaning is method-dependent.

What it does not establish: Which method is “correct.” Like all convergent validity tests, it measures inter-method agreement, not ground truth.

Method:

  1. For each feature direction at a hook point, compute two independent characterizations:
    • Activation-based (NLA proxy): identify top-kk activating tokens, compute PCA direction from their activation patterns.
    • Weight-based (SAE proxy): project the feature direction through the unembedding matrix to get top promoted/suppressed tokens.
  2. Compute agreement:
    • token_overlap: Jaccard similarity of top promoted tokens.
    • direction_cosine: cosine similarity between the PCA-reconstructed direction and the original feature direction.

Pass condition: mean_token_overlap > 0.3; mean_direction_cosine > 0.5

Usage:

Terminal window
uv run python 133_nla_sae_convergence.py --model gpt2 --device cpu
uv run python 133_nla_sae_convergence.py --n-features 30 --top-k 20

These metrics import established constructs from psychophysics and psychometrics to formalize properties that MI evaluates informally.

Source: Green & Swets (1966), “Signal Detection Theory and Psychophysics”; Macmillan & Creelman (2005), “Detection Theory: A User’s Guide.”

Criteria: Signal Detection, Causal

What it establishes: Separates a circuit’s sensitivity (dd') from its criterion (β\beta). Standard circuit evaluations (ablation accuracy, logit-diff) conflate these: a circuit might have high sensitivity but conservative criterion (it CAN detect the pattern but only fires when very confident), or vice versa. dd' isolates pure discriminability.

What it does not establish: Whether the circuit is the unique mechanism for the task. High dd' means the circuit discriminates signal from noise; it does not mean other components cannot also discriminate.

Method:

  1. Run model on task prompts with full circuit: count hits (correct predictions where logit_diff > 0).
  2. Mean-ablate all circuit heads: count “false alarms” (still correct despite circuit removal).
  3. Compute:

d=Z(hit_rate)Z(false_alarm_rate)d' = Z(\text{hit\_rate}) - Z(\text{false\_alarm\_rate})

where ZZ is the inverse normal CDF.

  1. Compute criterion: β=0.5×(Z(hit_rate)+Z(false_alarm_rate))\beta = -0.5 \times (Z(\text{hit\_rate}) + Z(\text{false\_alarm\_rate}))
  2. Compute AUC from an ROC curve by sweeping the logit-diff threshold.

Pass condition: dd' > 1.0 (meaningful discrimination above chance) AND AUC > 0.7.

Usage:

Terminal window
uv run python EX1_dprime.py --tasks ioi --n-prompts 40
uv run python EX1_dprime.py --device cpu

Reading the scores:

PatternWhat it means
dd' > 2.0, high AUCStrong discriminability — the circuit is a reliable detector
dd' 1.0—2.0Moderate discriminability — circuit contributes but does not dominate
dd' < 1.0Weak discriminability — circuit barely distinguishes signal from noise
High dd', negative β\betaSensitive but liberal criterion — circuit fires broadly
High dd', positive β\betaSensitive but conservative criterion — circuit fires selectively

EX2 — Differential Item Functioning (DIF)

Section titled “EX2 — Differential Item Functioning (DIF)”

Source: Holland & Wainer (1993), “Differential Item Functioning”; Zumbo (1999), “A Handbook on the Theory and Methods of DIF.”

Criteria: Behavioral, Measurement Equivalence

What it establishes: Whether the circuit performs equivalently across different name types (common, uncommon, diverse-origin names), controlling for overall circuit ability. If the IOI circuit performs differently on “John and Mary” versus “Hiroshi and Priya” at matched model confidence, the measurement is confounded with token frequency or cultural associations — a measurement bias, not a circuit property.

What it does not establish: Whether the bias is “in the circuit” or “in the model.” DIF detects measurement non-equivalence; disentangling the source requires further intervention.

Method:

  1. Generate prompts with three name categories:
    • Common English names (John, Mary, James, …)
    • Less common names (Nigel, Mabel, Rupert, …)
    • Names from different linguistic origins (Hiroshi, Priya, Oluwaseun, …)
  2. Run the circuit on each category, compute logit-diff for each prompt.
  3. For each category pair, compute Cohen’s dd:

d=XˉAXˉBspooledd = \frac{\bar{X}_A - \bar{X}_B}{s_{\text{pooled}}}

  1. DIF magnitude = max d|d| across all group pairs.

Pass condition: Cohen’s dd < 0.5 across all group pairs.

Usage:

Terminal window
uv run python EX2_dif.py --tasks ioi --n-prompts 40
uv run python EX2_dif.py --device cpu

Reading the scores:

PatternWhat it means
Max dd < 0.2Negligible DIF — circuit measures syntax, not token frequency
Max dd 0.2—0.5Small-to-medium DIF — some confound with name type
Max dd > 0.5Large DIF — circuit performance is substantially confounded with name familiarity

EX11 — Weber-Fechner / JND (Just-Noticeable Difference)

Section titled “EX11 — Weber-Fechner / JND (Just-Noticeable Difference)”

Source: Weber (1834); Fechner (1860), “Elemente der Psychophysik”; Gescheider (1997), “Psychophysics: The Fundamentals.”

Criteria: Behavioral, Construct Validity

What it establishes: Whether circuit heads follow Weber’s law: the just-noticeable difference (JND) in output is proportional to the stimulus intensity, yielding a constant Weber fraction. This tests whether the circuit’s response follows a principled input-output relationship (logarithmic scaling) rather than arbitrary nonlinearities.

What it does not establish: Whether Weber’s law is the “correct” response function — only whether the circuit’s behavior is consistent with this well-characterized psychophysical pattern.

Method:

  1. For each circuit head, scale its output by factors [0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.98, 1.0].
  2. Find the JND: smallest scale change from 1.0 that produces a detectable output change (logit-diff drops by > 5% of baseline).
  3. Test at two baseline levels (full and reduced by 0.5).
  4. Weber fraction = JND / baseline_scale.
  5. Weber consistency = 1σ(weber_fractions)/μ(weber_fractions)1 - \sigma(\text{weber\_fractions}) / \mu(\text{weber\_fractions}).

Pass condition: All circuit heads have detectable JND (all heads contribute measurably).

Usage:

Terminal window
uv run python EX11_weber_fechner.py --tasks ioi --n-prompts 40
uv run python EX11_weber_fechner.py --device cpu

Reading the scores:

PatternWhat it means
Weber consistency > 0.8Circuit follows Weber’s law — response scales predictably with stimulus
Weber consistency 0.5—0.8Partial Weber compliance — some heads follow the law, others do not
Not all heads detectableSome circuit heads have no measurable contribution — they may be false positives in the circuit definition
Different JNDs across baselinesHead sensitivity changes nonlinearly with overall activation level

Source: Anonymous (2026). “Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion.” arXiv:2602.00038.

Criteria: M1 Reliability, M6 Construct Coverage

What it establishes: How densely safety information is packed across a model’s layers. Low SVE means safety occupies a compact, low-rank subspace; high SVE means safety information is diffusely spread. The LSSF paper shows safety subspaces are stable under fine-tuning, providing M1 Reliability evidence for safety-related representations.

What it does not establish: Whether the safety subspace is “correct” or complete. The metric measures compactness and stability, not the content of the safety representation.

Method:

  1. Compute safety contrast directions at each layer: mean(safe prompts) - mean(contrast prompts) in residual stream space.
  2. Stack direction vectors across layers into a matrix of shape (n_layers, d_model).
  3. SVD to get singular values sis_i.
  4. Compute SVE:

SVE=ipilogpiwherepi=si2jsj2\text{SVE} = -\sum_i p_i \log p_i \quad \text{where} \quad p_i = \frac{s_i^2}{\sum_j s_j^2}

  1. Stability test: repeat with perturbed prompt subsets and check SVE consistency.
  2. Effective rank: number of singular values needed to capture 90% of variance.

Pass condition: safety_sve < 2.0; stability > 0.8

Usage:

Terminal window
uv run python 136_safety_sve.py --model gpt2 --device cpu
uv run python 136_safety_sve.py --n-prompts 30 --n-stability-runs 5

Reading the scores:

PatternWhat it means
Low SVE, high stabilitySafety is compactly represented and stable — amenable to subspace-based interventions
High SVE, high stabilitySafety is diffusely represented but consistently so — no compact safety subspace exists
Low SVE, low stabilityCompact representation exists but it is prompt-sensitive — findings may not generalize
Low effective rank (1—3)Safety information concentrates in very few directions — potentially easy to attack or defend

Source: Mueller et al. (2025). “MIB.” ICML 2025.

Criteria: Multi-threshold faithfulness

What it establishes: Circuit quality via the area under the faithfulness curve across edge-count thresholds. Rather than evaluating a circuit at a single threshold, this sweeps across thresholds from 0.1% to 100% of edges and measures faithfulness at each. CPR (Cumulative Performance Recovery) is the area under this curve; CMD (Cumulative Metric Deficit) is the area between the curve and perfect faithfulness.

What it does not establish: Whether the circuit is causally necessary or sufficient at any particular threshold — only the aggregate faithfulness profile across all thresholds.

Method:

For each threshold tt in [0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0]:

  1. n=max(1,t×total_edges)n = \max(1, \lfloor t \times \text{total\_edges} \rfloor)
  2. Keep top-nn edges (by layer order).
  3. Convert kept edges to heads, compute faithfulness (logit-diff recovery under mean ablation of non-circuit heads).
  4. Record faithfulness at threshold tt.

Compute:

CPR=01f(t)dt(trapezoidal rule)\text{CPR} = \int_0^1 f(t) \, dt \quad \text{(trapezoidal rule)} CMD=01(1f(t))dt\text{CMD} = \int_0^1 (1 - f(t)) \, dt

Pass condition: CPR > 0.5

Usage:

Terminal window
uv run python MIB_faithfulness_curve.py --tasks ioi --n-prompts 40
uv run python MIB_faithfulness_curve.py --device cpu

Reading the scores:

PatternWhat it means
CPR > 0.7Strong circuit — maintains faithfulness even at aggressive pruning thresholds
CPR 0.5—0.7Moderate circuit — faithfulness degrades substantially at low thresholds
CPR < 0.5Weak circuit — most edges are needed; the circuit is not well-separated from the full model
Flat curve near 1.0Nearly all edges contribute — the circuit is the whole model
Sharp elbowClear separation between essential and non-essential edges

Metric IDNameCriteriaRequires ArtifactPass Condition
M07Architecture DualityM2, M6Two SAE adaptersagreement > 0.3
M08WeightLens ConvergenceC5One SAE adapteragreement > 0.3
M09DMSAE Core StabilityM1Model + hookDiagnostic (report-only)
M10PRISM PolysemanticityM6, E1One SAE adapterDiagnostic (report-only)
M11Matryoshka Cross-ScaleM1, M2Two SAE adapters (small/large)consistency > 0.7
M12Adaptive SparsityE1, M6One SAE adaptermismatch_rate < 0.2
M13Superposition RegimeM6Model onlyDiagnostic (report-only)
M14Safety SVEM1, M6Model onlySVE < 2.0, stability > 0.8
EX1d-prime (SDT)CausalModel + circuitdd' > 1.0, AUC > 0.7
EX2DIFBehavioralModel + circuitCohen’s dd < 0.5
EX11Weber-Fechner / JNDBehavioralModel + circuitAll heads detectable
EX24SAEBench AuditM1, M2Model onlyCV < 0.05, dd > 0.8
EX25Reproducibility CheckM1Model onlydev_rate < 0.05
EX34NLA-SAE ConvergenceC5Model onlyoverlap > 0.3, cosine > 0.5
MIBFaithfulness CurveFaithfulnessModel + circuitCPR > 0.5

The original F01—F08 metrics are documented at their existing pages:

The extended metrics on this page complement F01—F08 by:

  • Deepening reliability testing (M09 Core Stability, EX24 SAEBench Audit, EX25 Reproducibility) — going beyond prompt-level resampling to test decomposition stability, benchmark reliability, and pipeline reproducibility.
  • Adding SAE-specific construct validity (M07 Architecture Duality, M08 WeightLens, M10 PRISM, M11 Matryoshka, M12 Adaptive Sparsity, M13 Superposition Regime) — testing whether the decomposition instrument itself is valid before interpreting its outputs.
  • Importing psychometric rigor (EX1 d-prime, EX2 DIF, EX11 Weber-Fechner) — formalizing sensitivity, measurement bias, and scaling behavior using established frameworks from cognitive science.
  • Extending to safety (M14 Safety SVE) — applying measurement theory to safety-relevant representations.
  • Multi-threshold evaluation (MIB Faithfulness Curve) — replacing single-threshold faithfulness with curve-based analysis.