Skip to content

This page documents metrics that evaluate validity properties specific to safety-relevant mechanistic claims — reliability, construct stability, and the relationship between alignment and interpretability.



S01 — Dual Mechanism Discriminant Validity

Section titled “S01 — Dual Mechanism Discriminant Validity”

ID. S01.dual_mechanism | File. 125_dual_mechanism.py

What it computes. Decomposes representation directions for a behavioral construct into intrinsic (baseline) and prompted (instruction-following) mechanisms, testing whether they are genuinely distinct and whether each steers independently after removing the shared component.

Evidence family. Construct (C4 Discriminant Validity)

Pass threshold. discriminant_separation > 0.2; independent_steering_effect > 0.1

Reference. Han et al. (2025), NeurIPS 2025 / ICML 2026.


ID. S02.adversarial_ablation_gap | File. 127_adversarial_ablation.py

What it computes. Tests whether circuit heads that appear necessary under standard (mean) ablation remain necessary under adversarial ablation, where non-circuit heads are replaced with maximally disruptive values to detect false necessity.

Evidence family. Internal (I2 Sufficiency)

Pass threshold. mean adversarial_gap < 0.3

Reference. Sharkey et al. (2026), Goodfire / Apollo Research.


ID. S03.safety_claim_reliability | File. 131_safety_claim_reliability.py

What it computes. Tests whether circuit-based behavioral claims are consistent across different prompt templates and ablation calibration seeds, measuring reliability via coefficient of variation.

Evidence family. Measurement (M1 Reliability)

Pass threshold. safety_reliability > 0.7

Reference. Nanda (2025), AlignmentForum.


ID. S04.assistant_axis | File. 132_assistant_axis.py

What it computes. Tests three validity properties of a dominant representational direction extracted via persona contrasts: causal sufficiency (suppression degrades behavior), reliability (direction stable across extraction runs), and discriminant validity (direction specific to target construct).

Evidence family. Internal (E2 Causal Sufficiency), Measurement (M1 Reliability), Construct (C4 Discriminant)

Pass threshold. causal_deficit > 0.3; direction_stability > 0.8; discriminant_ratio > 2.0

Reference. MATS + Anthropic Fellows (2026).


ID. S05.safety_subspace | File. 135_safety_subspace.py

What it computes. Identifies safety-relevant directions by contrasting activations on safe vs. unsafe prompts, constructs a low-rank subspace via PCA, then tests causal sufficiency (linear classifier accuracy on projected activations) and necessity (refusal behavior change upon subspace ablation).

Evidence family. Internal (E2 Causal Sufficiency, I1 Necessity)

Pass threshold. sufficiency > 0.6; ablation_deficit > 0.3

Reference. NCSU (2025), arXiv:2512.23260.


ID. S06.safety_one_shot | File. 137_safety_one_shot.py

What it computes. Tests whether the safety construct is compact enough that a single safety example’s gradient aligns with the safety subspace, validating that safety alignment occupies a low-rank gradient structure.

Evidence family. Internal (E2 Causal Sufficiency)

Pass threshold. gradient_alignment > 0.5; recovery_rate > 0.3

Reference. Anonymous (2026), arXiv:2601.01887.


S07 — Alignment-Interpretability Trade-off

Section titled “S07 — Alignment-Interpretability Trade-off”

ID. S07.alignment_interpretability | File. 138_alignment_interpretability.py

What it computes. Quantifies the trade-off between interpretability (activation consistency of per-unit responses) and representational richness (effective rank of activation matrices), testing whether they are discriminant constructs and how alignment affects the balance.

Evidence family. Construct (C4 Discriminant Validity)

Pass threshold. Diagnostic (no binary pass/fail). Reports whether alignment improves interpretability at the cost of richness.

Reference. Colin, Oliver, Serre (2026), ICLR 2026 Re-Align Workshop.


IDNameFileEvidence FamilyThreshold
S01Dual Mechanism125_dual_mechanism.pyConstruct> 0.2 separation
S02Adversarial Ablation127_adversarial_ablation.pyInternal< 0.3
S03Safety Claim Reliability131_safety_claim_reliability.pyMeasurement> 0.7
S04Assistant Axis132_assistant_axis.pyInternal / Measurement> 0.3 deficit
S05Safety Subspace135_safety_subspace.pyInternal> 0.6 suff.
S06Safety One-Shot137_safety_one_shot.pyInternal> 0.5 align.
S07Alignment-Interpretability138_alignment_interpretability.pyConstructdiagnostic

For artifact quality metrics (SAE features, transcoders, crosscoders), see MI Artifact Quality Metrics. For circuit faithfulness metrics (CLT attribution, cross-prompt consistency, minimality), see MI Faithfulness Metrics. For the overall MI lens framework, see Mechanistic Interpretability.