MI Benchmarks Metrics
Section titled “MI Benchmarks Metrics”This page documents the six benchmark metrics implemented in mechval_v2.core.mechanistic_interpretability.benchmarks. Each benchmark evaluates interpretability artifacts (SAEs, features, circuits) against standardized external evaluation protocols from the literature. These benchmarks contextualize mechanistic claims within community baselines.
B20 — AxBench Concept Detection and Steering
Section titled “B20 — AxBench Concept Detection and Steering”ID. B20.axbench | File. 101_axbench.py
What it evaluates. Whether artifact features detect concepts better than simple baselines (DiffMean, random directions) and whether feature directions steer model behavior.
Evidence family. External (Behavioral)
Key metrics.
| Sub-metric | What it measures | Pass threshold |
|---|---|---|
| Detection AUROC | Best-feature AUROC at separating positive vs. negative concept examples | > 0.6 |
| DiffMean baseline AUROC | AUROC of the mean-difference direction (supervised baseline) | comparison |
| Steering effect | Log-probability change on concept-related continuations when adding/subtracting the feature direction | > 0.0 |
What it establishes. Whether an artifact’s features provide a practical advantage over the simplest possible baselines for concept detection and behavioral steering. AxBench (Huang et al., ICML 2025 Spotlight) showed that SAEs are not competitive with prompting, finetuning, or DiffMean on these tasks. A passing score means the artifact exceeds the baseline threshold, not that it outperforms all alternatives.
What it does not establish. That the features capture genuine computational structure. High detection AUROC is achievable by any direction correlated with the concept — it does not require the direction to be mechanistically meaningful or causally relevant. Steering effectiveness confirms behavioral impact but not that the feature is used by the model’s native computation.
Reference. Huang et al. (2025), “AxBench: Steering and Concept Detection Benchmarks for Language Models”, ICML 2025 Spotlight.
B22 — Failure Prediction
Section titled “B22 — Failure Prediction”ID. B22.failure_prediction | File. 103_failure_prediction.py
What it evaluates. Whether artifact features can predict model failures (incorrect outputs), not just successes, using per-feature AUROC at separating success vs. failure prompts.
Evidence family. External (Behavioral)
Key metrics.
| Sub-metric | What it measures | Pass threshold |
|---|---|---|
| Best feature AUROC | AUROC of the single best-separating feature at distinguishing success from failure | > 0.6 |
| Best neuron AUROC | Same metric on raw activation dimensions (baseline) | comparison |
| Combined top-k AUROC | AUROC using mean of top-k most discriminative features | > 0.6 |
What it establishes. Whether individual features carry information about when the model will fail, not just what it knows. Mathew et al. (AAAI 2026) found that linear probes on learned features achieve near-chance performance at predicting failures, suggesting features encode capability but not failure modes.
What it does not establish. That failure-predictive features are mechanistically informative. A feature that fires on “hard” inputs may predict failure by detecting input difficulty rather than identifying the mechanism of failure.
Reference. Mathew et al. (2026), “Why It Failed: Probing Failure Modes of Learned Representations”, AAAI 2026.
C17 — MIB Causal Variable Localization
Section titled “C17 — MIB Causal Variable Localization”ID. C17.mib_causal_variable | File. 103_mib_causal_variable.py
What it evaluates. Whether artifact features localize causal variables better than raw neurons, with DAS (Distributed Alignment Search) providing a supervised upper bound.
Evidence family. Internal (Causal)
Key metrics.
| Sub-metric | What it measures | Pass threshold |
|---|---|---|
| Feature AUROC | Best per-feature AUROC at separating positive vs. negative prompts (by logit-diff sign) | comparison |
| Neuron AUROC | Best per-neuron AUROC (baseline) | comparison |
| DAS AUROC | AUROC of the supervised mean-difference direction (ceiling) | comparison |
| Normalized score | (feature - neuron) / (DAS - neuron) — fraction of the neuron-to-DAS gap closed by features | > 0.1 |
What it establishes. Whether the artifact’s decomposition provides a measurable advantage over raw neurons for localizing causal variables. Mueller et al. (ICML 2025) found that SAE features are NOT systematically better than raw neurons, making this a calibration benchmark rather than a validation test. A normalized score near 0 means features add no value over neurons; a score near 1 means features match the supervised DAS ceiling.
What it does not establish. That features are causally meaningful even if they beat neurons. The AUROC metric measures linear separability of a binary classification, which does not require the feature to be a genuine causal variable — it only requires correlation with the task outcome.
Reference. Mueller et al. (2025), “Missed Insights, Burned Budgets: Evaluating Sparse Autoencoders for Causal Variable Localization”, ICML 2025.
B23 — Evaluation Awareness Format Control
Section titled “B23 — Evaluation Awareness Format Control”ID. B23.eval_awareness | File. 104_eval_awareness.py
What it evaluates. Whether artifact features track evaluation FORMAT (benchmark-canonical MCQ structure) rather than evaluation CONTENT (knowledge-testing questions), using a 2x2 format-by-content factorial design.
Evidence family. External (Behavioral)
Key metrics.
| Sub-metric | What it measures | Pass threshold |
|---|---|---|
| Format effect | Mean activation difference between eval-format and free-format conditions | diagnostic |
| Content effect | Mean activation difference between eval-content and non-eval-content conditions | diagnostic |
| Format ratio | ` | format |
| Format confound fraction | Fraction of features with format_ratio > 0.7 | < 0.5 |
What it establishes. Whether claimed “evaluation awareness” features actually track MCQ formatting artifacts (lettered options, benchmark-canonical structure) rather than genuine awareness that the model is being evaluated. Devbunova (ICLR 2026 Workshop) showed that probe-based evidence for evaluation awareness largely collapses under format control.
What it does not establish. That features passing the format control are genuinely tracking evaluation awareness. A feature could respond to content features (e.g., question marks, academic vocabulary) that correlate with evaluation content without representing awareness per se.
Reference. Devbunova (2026), “Evaluation Awareness in LLMs: Probes Track Format, Not Awareness”, ICLR 2026 Workshop.
EX10 — CE-Bench Contrastive Evaluation
Section titled “EX10 — CE-Bench Contrastive Evaluation”ID. EX10.ce_bench | File. EX10_ce_bench.py
What it evaluates. Whether artifact features respond selectively to specific semantic dimensions, measured via contrastive minimal pairs that differ in exactly one dimension (gender, sentiment, tense, location, quantity, formality, agency, certainty).
Evidence family. Measurement
Key metrics.
| Sub-metric | What it measures | Pass threshold |
|---|---|---|
| Contrastive score | Cohen’s-d-like effect size between feature activations on paired stories | > 0.3 |
| Independence score | `1 - | corr |
| Mean contrastive score | Average across features | > 0.3 |
What it establishes. Whether features detect specific semantic dimensions rather than responding to broad distributional properties. CE-Bench (Gulko et al., BlackboxNLP 2025) is fully deterministic and requires no LLM judge, making it a reliable complement to autointerp-based evaluations.
What it does not establish. That semantically selective features are causally relevant. A feature that responds selectively to gender in its activations may not causally influence the model’s gender-related outputs. Selectivity is a necessary but not sufficient condition for causal relevance.
Reference. Gulko et al. (2025), “CE-Bench: A Contrastive Evaluation Benchmark for Feature Interpretability”, BlackboxNLP 2025.
EX9 — SAEBench Multi-Metric Evaluation
Section titled “EX9 — SAEBench Multi-Metric Evaluation”ID. EX9.saebench | File. EX9_saebench.py
What it evaluates. Overall SAE quality across five sub-metrics spanning proxy, interpretability, and disentanglement dimensions.
Evidence family. Measurement / External
Key metrics.
| Sub-metric | What it measures | Pass threshold |
|---|---|---|
| EX9a Reconstruction loss | MSE between original activations and SAE reconstruction | < 0.1 |
| EX9b L0 sparsity | Mean number of active features per token | < 50 |
| EX9c Explained variance | 1 - var(residual) / var(original) | > 0.85 |
| EX9d Feature detection | Per-feature AUROC on built-in concept pairs (sentiment, formal/informal, scientific/everyday) | > 0.6 |
| EX9e Feature disentanglement | Mean pairwise | cosine similarity |
What it establishes. Whether an SAE meets basic quality thresholds across multiple evaluation axes. Karvonen et al. (ICML 2025) showed that proxy metric gains (better reconstruction, lower L0) do NOT reliably translate to practical performance improvements on downstream tasks. SAEBench calibrates expectations: passing all sub-metrics is necessary but not sufficient for useful features.
What it does not establish. That the SAE captures genuine computational structure. An SAE can achieve low reconstruction loss, high sparsity, and reasonable detection AUROCs while learning an arbitrary basis that does not correspond to the model’s internal computation. The benchmarks test output quality, not mechanistic correspondence.
Reference. Karvonen et al. (2025), “SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Models”, ICML 2025.
Summary Table
Section titled “Summary Table”| ID | Name | File | Evidence Family | Primary Metric | Threshold |
|---|---|---|---|---|---|
| B20 | AxBench | 101_axbench.py | External | Detection AUROC | > 0.6 |
| B22 | Failure Prediction | 103_failure_prediction.py | External | Best feature AUROC | > 0.6 |
| C17 | MIB Causal Variable | 103_mib_causal_variable.py | Internal | Normalized score | > 0.1 |
| B23 | Eval Awareness | 104_eval_awareness.py | External | Format confound fraction | < 0.5 |
| EX9 | SAEBench | EX9_saebench.py | Measurement | Multi-metric composite | varies |
| EX10 | CE-Bench | EX10_ce_bench.py | Measurement | Mean contrastive score | > 0.3 |
Reading the Benchmarks Together
Section titled “Reading the Benchmarks Together”Proxy vs. practical performance
Section titled “Proxy vs. practical performance”The central lesson from SAEBench and AxBench is that proxy metrics (reconstruction, L0, explained variance) do not predict practical utility (concept detection, steering, failure prediction). When evaluating an artifact:
- Start with SAEBench (EX9) to confirm basic quality thresholds are met.
- Then run AxBench (B20) to test whether features beat simple baselines on practical tasks.
- Check CE-Bench (EX10) for semantic selectivity without LLM-judge confounds.
- Use MIB (C17) to calibrate against the neuron baseline and the DAS ceiling.
Failure modes to watch for
Section titled “Failure modes to watch for”| Pattern | What it means |
|---|---|
| SAEBench passes, AxBench fails | Proxy quality is high but features are not practically useful — the SAE may have learned an arbitrary basis |
| AxBench passes, MIB fails | Features detect concepts but do not localize causal variables — correlation without causation |
| All pass, Eval Awareness fails | Features track formatting artifacts, not genuine evaluation content |
| MIB normalized score near 0 | Features add no value over raw neurons for causal localization |
| Failure Prediction near chance | Features encode capability but not failure modes |
Relationship to Other Pages
Section titled “Relationship to Other Pages”For the evaluation metrics (circuit faithfulness, safety, transcoders, crosscoders, CLT), see MI Evaluation Metrics. For the genetics lens benchmarks and protocols, see Genetics Metrics. For the overall MI lens framework, see Mechanistic Interpretability.