Skip to content

These metrics evaluate whether artifact directions (from SAEs, transcoders, or factor banks) can steer model behavior, erase concept representations, or transfer across models. They combine causal interventions (activation addition, subspace erasure) with behavioral measurement (logit-difference shift, dose-response) to test whether learned representations are causally load-bearing. All are implemented in mechval_v2.core.mechanistic_interpretability.methods.


C09 — Contrastive Activation Addition (93_caa.py)

Section titled “C09 — Contrastive Activation Addition (93_caa.py)”

What it computes. Implements CAA (Panickssery et al., ACL 2024) as a validation metric. For each artifact feature direction, computes a steering vector and adds it to the residual stream at inference time across a range of coefficients (2,1,0.5,0.5,1,2-2, -1, -0.5, 0.5, 1, 2). Measures the resulting logit-difference shift relative to baseline. Reports steerability (magnitude of behavioral shift relative to baseline), dose-response linearity (Pearson correlation between coefficient and effect), and the fraction of steerable features.

steerabilityf=maxα>0LD(x+αdf)LD(x)LD(x)+ϵ\text{steerability}_f = \frac{\max_{\alpha > 0} |\text{LD}(x + \alpha \cdot d_f) - \text{LD}(x)|}{|\text{LD}(x)| + \epsilon}

Evidence family. Causal (intervention via activation addition).

Key metrics.

MetricDescriptionPass threshold
steerable_fractionFraction of tested directions with steerability >0.3> 0.3>0.20> 0.20
mean_steerabilityMean steerability across tested featuresreported
dose_response_rMean absolute Pearson rr between coefficient and behavioral shiftreported

What it establishes. Artifact directions actually control model behavior when added as steering vectors. If a substantial fraction of feature directions produce graded, dose-responsive behavioral shifts, the artifact encodes causally relevant structure — not mere correlation.

What it does not establish. Specificity. A direction that steers behavior may also disrupt unrelated computations. The metric does not test whether steering affects only the target behavior. Combine with concept erasure (C15) or selectivity metrics for specificity evidence.

Usage.

Terminal window
uv run python 93_caa.py --tasks ioi --n-prompts 40

C15 — Concept Erasure / LEACE (99_concept_erasure.py)

Section titled “C15 — Concept Erasure / LEACE (99_concept_erasure.py)”

What it computes. Implements LEACE (Least-Squares Concept Erasure; Belrose et al., NeurIPS 2023) as a dissociation test. Given an artifact adapter’s top-kk feature directions as a concept subspace, computes the orthogonal projection matrix via SVD:

P=VrTVr,Xerased=XXPP = V_r^T V_r, \quad X_{\text{erased}} = X - X \cdot P

where VrV_r is the right singular vectors of the concept directions with non-negligible singular values. The model is re-run with erased activations via hooks, and the behavioral change is measured as the normalized reduction in logit difference.

Evidence family. Causal (subspace erasure intervention).

Key metrics.

MetricDescriptionPass threshold
dissociation_strengthLDcleanLDerased/LDclean\|LD_{\text{clean}} - LD_{\text{erased}}\| / \|LD_{\text{clean}}\|>0.3> 0.3
erasure_klKL divergence between clean and erased output distributionsreported

What it establishes. The concept subspace defined by the artifact’s top feature directions is load-bearing: erasing it from the residual stream destroys task performance. This is a necessity test for the artifact’s feature directions — they encode information that the model actually uses.

What it does not establish. That the erased directions are the unique encoding of the concept. Other directions may also encode the same information redundantly. LEACE erases a subspace, not individual features, so the dissociation may reflect removal of multiple distinct computations that happen to share direction.

Usage.

Terminal window
uv run python 99_concept_erasure.py --tasks ioi --n-prompts 40

C16 — Representation Engineering / RepE (100_representation_engineering.py)

Section titled “C16 — Representation Engineering / RepE (100_representation_engineering.py)”

What it computes. Implements RepE (Zou et al., 2023), generalizing CAA by using PCA on the difference of positive/negative activation distributions to discover multi-component concept directions. For each task: collects residual-stream activations, splits prompts into positive/negative by median logit-diff, computes a paired contrast matrix, and applies PCA. Then steers the model with cumulative PCA components (1, then 1+2, …) and measures the logit-diff shift at each level.

Evidence family. Causal (PCA-based concept direction discovery + steering).

Key metrics.

MetricDescriptionPass threshold
concept_dimensionalityNumber of PCA components to reach 90% of max cumulative steering effect5\leq 5
steerabilityMax cumulative steering effect relative to baseline logit-diffreported
artifact_cosine_similarityMax absolute cosine similarity between PC1 and artifact directionsreported (if artifact provided)

What it establishes. Task-relevant concepts occupy low-dimensional subspaces in the residual stream. Low concept dimensionality (5\leq 5) indicates the concept is compactly represented. When an artifact adapter is provided, the cosine similarity between the discovered PCA directions and artifact directions provides convergent validity.

What it does not establish. That the discovered directions are causally specific to the target concept. PCA captures the largest variance direction in the contrast, which may conflate the target concept with confounded features. Unlike DAS (which optimizes for causal intervention accuracy), RepE is purely observational at the discovery stage.

Usage.

Terminal window
uv run python 100_representation_engineering.py --tasks ioi --n-prompts 40

B21 — Steering-Bench Reliability (102_steering_reliability.py)

Section titled “B21 — Steering-Bench Reliability (102_steering_reliability.py)”

What it computes. Implements the Steering-Bench decomposition (Tan et al., NeurIPS 2024). The key insight is that raw steerability conflates baseline model propensity with genuine causal effect. Decomposes steering evaluation into: (1) propensity — P(correct)P(\text{correct}) without steering; (2) raw steerability — change in P(correct)P(\text{correct}) when adding the artifact direction; (3) propensity-corrected steerability:

corrected=raw steerability1propensity\text{corrected} = \frac{\text{raw steerability}}{1 - \text{propensity}}

which corrects for ceiling effects. Also measures dose-response linearity (R2R^2 of coefficient vs effect).

Evidence family. Behavioral (propensity-corrected steering).

Key metrics.

MetricDescriptionPass threshold
corrected_steerabilityPropensity-corrected steerability at optimal coefficient>0.15> 0.15
dose_response_r2R2R^2 of linear fit between steering coefficient and behavioral effectreported
propensityBaseline P(correct)P(\text{correct}) without steeringreported

What it establishes. The artifact direction produces genuine behavioral change beyond what the model’s baseline propensity would predict. Propensity correction prevents the artifact from receiving credit for behavior the model already exhibits without intervention.

What it does not establish. That the steering direction is the correct causal variable. A direction that passes propensity correction may still be a confounded proxy for the true mechanism. The metric quantifies the magnitude of causal effect but not its specificity.

Usage.

Terminal window
uv run python 102_steering_reliability.py --tasks ioi --n-prompts 30

EX15 — Cross-Model Steering Transfer (111_cross_model_transfer.py)

Section titled “EX15 — Cross-Model Steering Transfer (111_cross_model_transfer.py)”

What it computes. Implements cross-model steering transfer (Oozeer et al., ICML 2025). For a source model A and target model B: (1) collects paired activations from both models on the same prompts at corresponding layers; (2) learns a linear mapping M:RdARdBM: \mathbb{R}^{d_A} \to \mathbb{R}^{d_B} via least-squares regression; (3) extracts a steering vector vAv_A from model A via mean-difference between positive/negative concept prompts; (4) transfers it: vB=MvAv_B = M \cdot v_A; (5) measures whether the transferred vector produces behavioral effects correlated with the native vector’s effects across multiple steering coefficients.

Evidence family. External validity (cross-architecture generalization).

Key metrics.

MetricDescriptionPass threshold
transfer_fidelityPearson correlation between transferred and native steering effects across coefficients>0.3> 0.3
cosine_similarityCosine similarity between transferred and natively-extracted steering vectorsreported
mapping_r2R2R^2 of the learned linear mapping on held-out activationsreported

What it establishes. The concept encoded by the steering vector is genuinely represented in both models’ activation spaces — not a model-specific artifact. If a steering vector transfers across architectures via a simple linear map, the underlying concept representation is shared and likely reflects a general computational strategy.

What it does not establish. That the transferred vector produces the same mechanistic effect. Two models may represent the same concept but implement it through different circuits. Transfer fidelity measures behavioral agreement, not mechanistic agreement.

Usage.

Terminal window
uv run python 111_cross_model_transfer.py --source-model gpt2 --target-model gpt2-medium --tasks ioi