Philosophy of Science — Metrics & Protocols
Section titled “Philosophy of Science — Metrics & Protocols”This page documents the metrics and protocols under the Philosophy of Science lens. These metrics formalize criteria from the philosophy of science — mechanistic explanation, causal discovery, minimality, transportability, and model complexity — to evaluate whether circuit claims meet the evidential standards that philosophers of science use to assess explanatory adequacy.
All metrics on this page address the question: does the circuit qualify as a scientific explanation of the model’s behavior? Some test structural properties (weight-space analysis, logic gates), some test causal robustness (sigma ablation, resample complement, misalignment), some test explanatory adequacy (operation specification, held-out prediction, replacement test), and some test the circuit’s place in a broader theory (minimality, transportability, causal discovery).
Causal Robustness Metrics
Section titled “Causal Robustness Metrics”These metrics test whether causal claims about the circuit are robust across methods, contexts, and intervention types.
C3 — Sigma Ablation
Section titled “C3 — Sigma Ablation”Source: Woodward (2003), “Making Things Happen.”
Criteria: I1 Necessity (Woodward Interventionism / A04)
What it establishes: Whether the circuit’s causal role is robust across different ablation methods. Measures the coefficient of variation (CV) of faithfulness scores across 8 ablation techniques: zero, mean, resample, noise, causal_resample, soft, attn_knockout, and mean_last. Low CV means the finding is not an artifact of a particular ablation choice.
What it does not establish: Which ablation method is “correct.” The metric tests consistency, not correctness. A circuit could produce consistent results across all methods while still being mischaracterized.
Method:
- For each of 8 ablation methods, ablate all circuit heads and measure faithfulness (logit-diff recovery).
- Compute the coefficient of variation: across the 8 faithfulness scores.
- Low CV indicates method-robust findings; high CV indicates method-dependent findings.
Key quantities:
cv— coefficient of variation of faithfulness across ablation methodsfaithfulness_per_method— individual faithfulness scores for each of 8 methodsmean_faithfulness— mean across all methods
Pass condition: Report-only. Lower CV is better; CV < 0.2 indicates strong robustness.
Usage:
uv run python 03_sigma_ablation.py --model gpt2 --device cpuuv run python 03_sigma_ablation.py --tasks ioi --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| CV < 0.1 | Very robust — faithfulness is nearly identical across all ablation methods |
| CV 0.1—0.2 | Robust — minor variation, results are trustworthy |
| CV 0.2—0.5 | Moderate variation — some methods give substantially different results |
| CV > 0.5 | High variation — the causal claim depends heavily on which ablation method is used |
| One method outlier | A specific ablation type interacts unusually with the circuit — investigate why |
C35 — Resample Complement
Section titled “C35 — Resample Complement”Source: Woodward (2003), “Making Things Happen”; Craver & Bechtel (2007).
Criteria: I1 Necessity (Woodward Interventionism / A04)
What it establishes: Whether non-circuit heads are genuinely uninvolved in the task. Replaces non-circuit head activations with activations from a different prompt (resampling) and measures whether the model’s task performance is preserved. If replacing non-circuit components with activations from unrelated prompts does not degrade performance, they are genuinely irrelevant.
What it does not establish: Why non-circuit heads are irrelevant. The metric confirms they can be replaced without consequence but does not explain what they are doing instead.
Method:
- Select a set of different-prompt activations for each non-circuit head.
- Replace non-circuit head activations with the resampled activations.
- Measure faithfulness (logit-diff recovery) under resampling.
- Compare to mean ablation faithfulness as a reference.
Key quantities:
resample_faithfulness— faithfulness when non-circuit heads are resampledmean_ablation_faithfulness— faithfulness when non-circuit heads are mean-ablated (reference)
Pass condition: Report-only. resample_faithfulness close to 1.0 indicates genuine circuit-complement independence.
Usage:
uv run python 35_resample_complement.py --model gpt2 --device cpuuv run python 35_resample_complement.py --tasks ioi --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| resample_faithfulness > 0.9 | Non-circuit heads are genuinely uninvolved — strong complement independence |
| resample_faithfulness 0.7—0.9 | Moderate complement independence — some non-circuit heads carry residual task information |
| resample_faithfulness < 0.7 | Weak complement independence — “non-circuit” heads contribute substantially |
| resample >> mean_ablation | Mean ablation is destructive in ways that resampling is not — mean ablation overstates circuit importance |
C37 — Misalignment Score
Section titled “C37 — Misalignment Score”Source: Woodward (2003), “Making Things Happen.”
Criteria: I1 Necessity (Woodward Interventionism / A04)
What it establishes: Whether the circuit passes both noising (necessity) and denoising (sufficiency) tests consistently. Misalignment = |noising_necessity - denoising_sufficiency| per head. A head that is necessary but not sufficient (or vice versa) has high misalignment, indicating that the causal claim is incomplete — the head’s role is more nuanced than simple “in the circuit / not in the circuit.”
What it does not establish: Which direction of misalignment is worse. A head that is necessary but not sufficient may be part of a distributed mechanism; a head that is sufficient but not necessary may have backup mechanisms. Both are informative.
Method:
- For each head, compute noising necessity: logit-diff drop when the head is corrupted.
- For each head, compute denoising sufficiency: logit-diff recovery when only this head is restored.
- Misalignment = |necessity - sufficiency| per head.
- Severity flag: misalignment > 0.3 for any head.
Key quantities:
misalignment_per_head— per-head |necessity - sufficiency|mean_misalignment— mean across all circuit headsmax_misalignment— maximum misalignment (worst-case head)n_severe— number of heads exceeding severity threshold (0.3)
Pass condition: misalignment < 0.3 for all heads.
Usage:
uv run python 37_misalignment_score.py --model gpt2 --device cpuuv run python 37_misalignment_score.py --tasks ioi --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| All heads < 0.15 | Low misalignment — necessity and sufficiency agree; clean causal roles |
| Some heads 0.15—0.3 | Moderate misalignment — nuanced causal roles worth investigating |
| Any head > 0.3 | Severe misalignment — the head’s role cannot be captured by simple necessity/sufficiency |
| High necessity, low sufficiency | Head is needed but not enough alone — part of a distributed mechanism |
| Low necessity, high sufficiency | Head is enough alone but has backups — redundant mechanism |
C25 — Intervention Specificity
Section titled “C25 — Intervention Specificity”Source: Rubin (1974); Imbens & Rubin (2015).
Criteria: I3 Specificity (Rubin CATE / A03)
What it establishes: Whether the circuit’s causal effect is specific to the target task. Computes the ratio of the circuit’s effect on the target task to its mean effect on non-target tasks. High specificity means the circuit selectively affects the intended task.
What it does not establish: That the circuit has zero effect on other tasks. Some overlap is expected (shared representations); the metric quantifies the degree of selectivity.
Method:
- For the target task and several non-target tasks, ablate the circuit and measure the effect.
- Compute specificity = target_effect / mean(nontarget_effects).
Key quantities:
specificity_ratio— target effect / mean nontarget effecttarget_effect— effect magnitude on the target taskmean_nontarget_effect— mean effect magnitude on non-target tasks
Pass condition: Report-only. Higher ratio indicates more specific intervention.
Usage:
uv run python 25_intervention_specificity.py --model gpt2 --device cpuuv run python 25_intervention_specificity.py --tasks ioi --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| Specificity > 5.0 | Highly specific — circuit selectively affects the target task |
| Specificity 2.0—5.0 | Moderately specific — target task affected more than others |
| Specificity < 2.0 | Low specificity — the circuit affects many tasks similarly |
| Specificity near 1.0 | No specificity — the circuit is a general-purpose component, not task-specific |
C6 — CATE (Conditional Average Treatment Effect)
Section titled “C6 — CATE (Conditional Average Treatment Effect)”Source: Rubin (1974); Holland (1986); Imbens & Rubin (2015).
Criteria: I3 Specificity (Rubin CATE / A03)
What it establishes: Whether the circuit’s causal effect is heterogeneous across syntactic contexts. Computes the average treatment effect (ATE) of circuit ablation and decomposes it into subgroup effects using Cohen’s d to measure heterogeneity. If the circuit’s effect varies substantially across contexts (e.g., different sentence structures), the causal claim needs qualification.
What it does not establish: Why the effect varies. CATE identifies heterogeneity but not the moderating mechanism.
Method:
- Ablate the circuit across multiple syntactic subgroups.
- Compute ATE within each subgroup.
- Compute Cohen’s d between subgroups to quantify heterogeneity.
- Report overall ATE and heterogeneity measures.
Key quantities:
ate— overall average treatment effectsubgroup_ates— per-subgroup treatment effectsheterogeneity_d— Cohen’s d between subgroups with largest effect difference
Pass condition: Report-only. Low heterogeneity indicates a robust, context-independent effect.
Usage:
uv run python 06_cate.py --model gpt2 --device cpuuv run python 06_cate.py --tasks ioi --n-prompts 50Mechanistic Explanation Metrics
Section titled “Mechanistic Explanation Metrics”These metrics test whether the circuit qualifies as a mechanistic explanation under the MDC/Glennan framework — organized entities performing specifiable operations to produce the phenomenon.
C18 — Weight-Space Analysis
Section titled “C18 — Weight-Space Analysis”Source: Machamer, Darden & Craver (2000); Glennan (2017).
Criteria: Structural (MDC/Glennan / A05)
What it establishes: Whether the circuit’s weight matrices have structural properties consistent with its proposed function. Three sub-metrics:
- Effective rank (
C18.wqk_effective_rank): exponential of the entropy of normalized singular values of . Higher rank means the attention pattern uses more dimensions. - Cosine alignment (
C18.cosine_alignment): maximum cosine similarity between the top-3 SVD directions of projected through . High alignment means the head’s output is directionally consistent with specific tokens. - Spectral norm ratio (
C18.spectral_norm_ratio): ratio of circuit to non-circuit spectral norms. Higher ratio means circuit heads have more “capacity” in weight space.
What it does not establish: That the weight structure is used at runtime. These are static, weight-space diagnostics that reveal structural capacity but not functional behavior.
Method:
- For each circuit head, compute and its SVD.
- Effective rank = where .
- Compute and its top-3 SVD directions; project through (unembedding); cosine similarity with top promoted tokens.
- Spectral norm ratio = .
Key quantities:
wqk_effective_rank— effective dimensionality of the QK circuitcosine_alignment— directional consistency of OV output with token embeddingsspectral_norm_ratio— relative weight magnitude of circuit vs non-circuit
Pass condition: Report-only (CPU-only weight-space diagnostics).
Usage:
uv run python 18_weight_extended.py --model gpt2 --device cpuReading the scores:
| Pattern | What it means |
|---|---|
| High effective rank | Head uses many dimensions for attention — complex query-key interaction |
| Low effective rank | Head attends to a low-dimensional subspace — simple, interpretable attention pattern |
| High cosine alignment | OV circuit strongly promotes specific tokens — consistent with a “lookup” operation |
| High spectral norm ratio | Circuit heads have disproportionately large weights — capacity argument for importance |
C19 — Logic Gates
Section titled “C19 — Logic Gates”Source: Machamer, Darden & Craver (2000); Glennan (2017).
Criteria: C2 Structural Plausibility (MDC/Glennan / A05)
What it establishes: How pairs of circuit heads interact: AND (superadditive — both needed), OR (redundant — either sufficient), NOT (inhibitory — one suppresses the other), or ADDITIVE (independent contributions). This reveals the circuit’s computational structure — whether it operates as a serial pipeline, parallel redundant system, or mixed architecture.
What it does not establish: Whether the interaction classification is complete. Pairwise analysis misses higher-order interactions among three or more heads.
Method:
- For each pair of circuit heads, measure logit-diff under four conditions:
- Both active (clean)
- Only head A ablated
- Only head B ablated
- Both ablated
- Classify the interaction:
- AND/superadditive: joint effect > sum of individual effects
- OR/redundant: joint effect < sum of individual effects
- NOT/inhibitory: one head’s effect reverses when the other is present
- ADDITIVE: joint effect sum of individual effects
- Report noising vs denoising completeness delta (consistency between the two test directions).
Key quantities:
interaction_counts— number of head pairs classified as AND, OR, NOT, ADDITIVEcompleteness_delta— |noising_completeness - denoising_completeness|
Pass condition: Report-only. Low completeness_delta indicates consistent classification.
Usage:
uv run python 19_logic_gates.py --model gpt2 --device cpuuv run python 19_logic_gates.py --tasks ioi --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| Mostly AND | Heads work synergistically — removing any one breaks the circuit |
| Mostly OR | Heads are redundant — the circuit is robust to single-head failures |
| Mixed AND/OR | Multi-path architecture with some critical nodes and some backup paths |
| High completeness_delta | Noising and denoising give different interaction patterns — results are fragile |
F1 — Operation Specification
Section titled “F1 — Operation Specification”Source: Machamer, Darden & Craver (2000); Glennan (2017).
Criteria: MDC/Glennan / A05
What it establishes: Whether each circuit head performs a specifiable, consistent operation. Measures two aspects: (1) output consistency — first principal component variance ratio across prompts (does the head always produce output in the same direction?), and (2) attention-weighted OV prediction — R-squared of predicting head output from its attention-weighted OV circuit (is the head’s operation well-described by its attention pattern + OV matrix?).
What it does not establish: What the operation is in human-interpretable terms. The metric tests consistency and predictability, not interpretability.
Method:
- Run the model on task prompts, collecting each circuit head’s output.
- Stack outputs across prompts, compute PCA. First PC variance ratio = output consistency.
- Predict head output from attention_pattern @ V @ W_O. R-squared = OV prediction quality.
- Pass if circuit heads have higher consistency and prediction than non-circuit heads.
Key quantities:
output_consistency— first PC variance ratio (higher = more consistent operation)ov_prediction_r2— R-squared of attention-weighted OV prediction
Pass condition: Circuit > non-circuit baseline on both measures.
Usage:
uv run python 70_operation_specification.py --model gpt2 --device cpuF2 — Held-Out Prediction
Section titled “F2 — Held-Out Prediction”Source: Machamer, Darden & Craver (2000); Glennan (2017).
Criteria: MDC/Glennan / A05
What it establishes: Whether the circuit’s operations generalize to held-out data. Trains a characterization of each head’s operation on a training split (principal direction of activation) and tests whether that characterization predicts behavior on a test split.
What it does not establish: Whether the generalization extends to out-of-distribution inputs. The metric tests held-out generalization within the same distribution.
Method:
- Split prompts into train and test sets.
- On train: compute principal direction of each head’s activation magnitude.
- On test: predict activation magnitude from train-derived principal direction.
- Report Pearson r between predicted and actual magnitudes.
Key quantities:
pearson_r— correlation between predicted and actual activation magnitudes on held-out data
Pass condition: Report-only. Higher Pearson r indicates better generalization.
Usage:
uv run python 71_held_out_prediction.py --model gpt2 --device cpuF3 — Replacement Test
Section titled “F3 — Replacement Test”Source: Machamer, Darden & Craver (2000); Glennan (2017).
Criteria: MDC/Glennan / A05
What it establishes: Whether circuit heads can be replaced by simplified approximations without losing task performance. Two variants:
- Constant replacement (
F3.replacement_constant): replace head output with its mean activation across prompts. If recovery remains high, the head contributes only a constant bias, not prompt-specific computation. - Linear OV replacement (
F3.replacement_linear_ov): replace with resid_pre @ W_V (ignoring the attention pattern). If recovery remains high, the attention pattern is not doing useful work.
What it does not establish: Whether the replacement captures the same computation. Low recovery under replacement means the head’s full computation is needed; high recovery means a simpler model suffices (possibly the circuit is overspecified).
Method:
- For each circuit head, replace its output with the simplified version.
- Measure recovery = replaced_logit_diff / clean_logit_diff.
- Compare circuit heads vs non-circuit heads.
Key quantities:
constant_recovery— logit-diff recovery under mean-activation replacementlinear_ov_recovery— logit-diff recovery under attention-free OV replacement
Pass condition: Report-only. Low recovery means the head’s full computation is needed (good for the circuit claim).
Usage:
uv run python 72_replacement_test.py --model gpt2 --device cpuReading the scores:
| Pattern | What it means |
|---|---|
| Low constant_recovery | Head contributes prompt-specific information — not just a bias |
| High constant_recovery | Head contributes a near-constant output — may not need to be in the circuit |
| Low linear_ov_recovery | Attention pattern matters — the head is doing nontrivial routing |
| High linear_ov_recovery | Attention pattern is irrelevant — head output is determined by OV alone |
A1 — Procedure Specification
Section titled “A1 — Procedure Specification”Source: Machamer, Darden & Craver (2000); Glennan (2017).
Criteria: MDC/Glennan / A05
What it establishes: Whether information gain is monotonically ordered along pathway chains in the circuit. If the circuit is a genuine procedure, information should flow in a specifiable order: earlier components should contribute less to the final answer than later components, producing monotonic information gain along the computational pathway.
What it does not establish: That the ordering is the unique correct one. Multiple valid orderings may exist if the circuit has parallel paths.
Method:
- Identify pathway chains through the circuit (sequences of heads connected by residual stream).
- For each chain, measure cumulative information gain (logit-diff recovery) at each step.
- Compute ordering_score = fraction of chains showing monotonic information gain.
Key quantities:
ordering_score— fraction of pathway chains with monotonic information gain
Pass condition: ordering_score > 0.7.
Usage:
uv run python 77_procedure_specification.py --model gpt2 --device cpuA2 — Composition Test
Section titled “A2 — Composition Test”Source: Machamer, Darden & Craver (2000); Glennan (2017).
Criteria: MDC/Glennan / A05
What it establishes: Whether the circuit’s pathways compose into a functioning whole. Tests pathway-level complement ablation: ablating the complement of each pathway and measuring whether the pathway alone produces meaningful output.
What it does not establish: That the pathways are independent. Composition testing confirms that pathways contribute, not that they operate without interaction.
Method:
- Identify pathways through the circuit.
- For each pathway, ablate everything outside the pathway and measure logit-diff recovery.
- Report full_circuit recovery and max_single_pathway recovery.
Key quantities:
full_circuit_recovery— logit-diff recovery with the full circuit activemax_single_pathway— best single-pathway recovery
Pass condition: full_circuit > 0.30 OR max_single_pathway > 0.20.
Usage:
uv run python 78_composition_test.py --model gpt2 --device cpuMinimality and INUS Metrics
Section titled “Minimality and INUS Metrics”These metrics test whether the circuit is minimal — whether all its components are genuinely needed — and how each component relates to sufficient conditions for the task.
C4b — Minimality Classification
Section titled “C4b — Minimality Classification”Source: Hadad, Katz & Bassan (ICLR 2026).
Criteria: Minimality
What it establishes: The minimality class of the circuit: how close it is to containing only necessary components.
- Subset minimal (1.0): no proper subset is sufficient — every head is strictly needed.
- Locally minimal (0.75): no single head can be removed without breaking sufficiency — but subsets of 2+ might be removable.
- Quasi minimal (0.5): removing some heads preserves sufficiency, but not many.
- Not minimal (0.0): many heads can be removed — the circuit is substantially overspecified.
What it does not establish: The unique minimal subset. Multiple minimal subcircuits may exist (the circuit may have redundant pathways that are each individually minimal).
Method:
- Test all single-head removals: if any preserves sufficiency, the circuit is not locally minimal.
- If locally minimal, test all pairwise removals: if any preserves sufficiency, the circuit is not subset minimal.
- Classify accordingly.
Key quantities:
minimality_class— one of: subset_minimal (1.0), locally_minimal (0.75), quasi_minimal (0.5), not_minimal (0.0)
Pass condition: At least locally_minimal (>= 0.75).
Usage:
uv run python C4b_minimality_class.py --model gpt2 --device cpuuv run python C4b_minimality_class.py --tasks ioi --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| Subset minimal (1.0) | Every head is strictly needed — tightest possible circuit |
| Locally minimal (0.75) | No single head is redundant, but some subsets might be |
| Quasi minimal (0.5) | Some redundancy exists but circuit is not grossly overspecified |
| Not minimal (0.0) | Circuit contains many unnecessary heads — needs pruning |
C39 — INUS Conditions
Section titled “C39 — INUS Conditions”Source: Mackie (1965), “Causes and Conditions.”
Criteria: Regularity / INUS (A10)
What it establishes: Whether each head is an INUS condition — Insufficient but Necessary part of an Unnecessary but Sufficient condition. Finds minimal sufficient subcircuits and classifies each head’s relationship to them:
- Necessary: present in all sufficient subcircuits.
- INUS: present in some but not all sufficient subcircuits.
- Non-redundant necessary: necessary within its subcircuit but the subcircuit itself is not unique.
- Redundant: can be removed from all sufficient subcircuits without breaking sufficiency.
What it does not establish: Whether the sufficient subcircuits are complete. INUS analysis is relative to the heads in the defined circuit; heads outside the circuit are not considered.
Method:
- Enumerate candidate subcircuits (subsets of circuit heads).
- Test each for sufficiency (logit-diff recovery above threshold, default 0.7).
- Find minimal sufficient subcircuits (no proper subset is also sufficient).
- Classify each head based on its membership pattern across minimal sufficient sets.
Key quantities:
head_classifications— per-head INUS classificationn_minimal_sufficient— number of distinct minimal sufficient subcircuits found
Pass condition: Report-only.
Usage:
uv run python 39_inus_conditions.py --model gpt2 --device cpuuv run python 39_inus_conditions.py --tasks ioi --sufficiency-threshold 0.7C40 — Actual Causation (Halpern-Pearl)
Section titled “C40 — Actual Causation (Halpern-Pearl)”Source: Halpern & Pearl (2005), “Causes and Explanations: A Structural-Model Approach.”
Criteria: Actual Causation (A11)
What it establishes: Whether each head is an actual cause of the task behavior under the Halpern-Pearl definition (AC1—AC3). Unlike standard counterfactual tests, actual causation handles preemption (a backup mechanism would have produced the same result) and overdetermination (multiple sufficient causes). Detects heads that are actual causes but not standard-necessary (indicating backup mechanisms exist).
What it does not establish: A complete causal model of the task. Actual causation is context-specific — a head may be an actual cause in one context but not another.
Method:
- For each head, test AC1 (the head’s value and the outcome both occurred).
- Test AC2 (there exists a “witness set” of other variables such that changing the head’s value, while holding the witness set fixed, changes the outcome).
- Test AC3 (minimality — no proper subset of the head satisfies AC1-AC2).
- Report which heads satisfy all three conditions (actual causes) and which satisfy AC2 but not standard necessity (backup-protected causes).
Key quantities:
n_actual_causes— number of heads satisfying AC1-AC3n_backup_protected— number of heads that are actual causes but not standard-necessarywitness_sets— the witness sets that establish AC2 for each actual cause
Pass condition: Report-only.
Usage:
uv run python 40_actual_causation.py --model gpt2 --device cpuuv run python 40_actual_causation.py --tasks ioi --n-prompts 50Reading the scores:
| Pattern | What it means |
|---|---|
| All circuit heads are actual causes | Strong support — every head is genuinely causal |
| backup_protected > 0 | Some heads have backup mechanisms — standard ablation would miss their importance |
| Few actual causes | Most circuit heads are not actual causes in the HP sense — circuit may be overspecified |
Protocols
Section titled “Protocols”Protocol A09 — MDL/SLT
Section titled “Protocol A09 — MDL/SLT”Source: Rissanen (1978), “Modeling by shortest data description”; Watanabe (2009), “Algebraic Geometry and Statistical Learning Theory.”
Framework: Minimum Description Length and Singular Learning Theory. Tests whether the circuit is a parsimonious description of the model’s behavior (MDL) and whether its learning coefficient indicates regular or singular learning dynamics (SLT).
Metrics and thresholds:
| Metric | Threshold |
|---|---|
hyperparam_sensitivity | < 0.2 |
llc | > 0.0 |
What it establishes: Whether the circuit represents a compressed, stable description of the model’s task behavior. Low hyperparameter sensitivity means the circuit’s quality does not depend critically on evaluation parameters. Positive LLC (local learning coefficient from SLT) indicates the circuit has learnable structure.
What it does not establish: Whether the circuit is the minimum-length description. MDL provides a relative comparison (is this description shorter than alternatives?), not an absolute optimality guarantee.
Protocol A12 — Causal Transportability
Section titled “Protocol A12 — Causal Transportability”Source: Pearl & Bareinboim (2011), “Transportability of Causal and Statistical Relations.”
Framework: Tests whether circuit findings transport across models. A circuit claim is more credible if the same circuit structure (or its analogue) appears in independently trained models.
Metrics and thresholds:
| Metric | Threshold |
|---|---|
cross_model_invariance | > 0.5 |
What it establishes: Whether the circuit is a property of the task rather than an idiosyncrasy of a particular trained model. High cross-model invariance means the circuit structure is transportable — it generalizes beyond the specific model it was discovered in.
What it does not establish: That the circuit is “fundamental” to the task. The same task might be solved by different circuits in different model families (e.g., attention-based vs MLP-based solutions).
Protocol A13 — Causal Discovery
Section titled “Protocol A13 — Causal Discovery”Source: Zheng et al. (2018), “DAGs with NO TEARS”; Spirtes, Glymour & Scheines (2000).
Framework: Automated causal structure learning. Uses continuous optimization methods (NOTEARS) to discover causal relationships between circuit components from observational data.
Metrics and thresholds:
| Metric | Threshold |
|---|---|
notears | > 0.5 |
What it establishes: Whether automated causal discovery algorithms recover a DAG structure consistent with the proposed circuit. If NOTEARS independently identifies the same edges, this is convergent evidence for the circuit’s causal structure.
What it does not establish: That the discovered DAG is the true causal graph. NOTEARS recovers structure from statistical dependencies, which may not correspond to genuine causal relationships (faithfulness assumption violations, latent confounders).
Summary Table
Section titled “Summary Table”| Metric ID | Name | Criteria | Evidence Family | Pass Condition |
|---|---|---|---|---|
| C3 | Sigma Ablation | I1 Necessity | Causal | Report-only (CV < 0.2 preferred) |
| C6 | CATE | I3 Specificity | Causal | Report-only |
| C18.wqk | WQK Effective Rank | Structural | Weight-space | Report-only |
| C18.cos | Cosine Alignment | Structural | Weight-space | Report-only |
| C18.snr | Spectral Norm Ratio | Structural | Weight-space | Report-only |
| C19 | Logic Gates | C2 Structural Plausibility | Structural | Report-only |
| C25 | Intervention Specificity | I3 Specificity | Causal | Report-only |
| C35 | Resample Complement | I1 Necessity | Causal | Report-only |
| C37 | Misalignment Score | I1 Necessity | Causal | misalignment < 0.3 |
| C39 | INUS Conditions | INUS | Structural-Causal | Report-only |
| C40 | Actual Causation | Actual Causation | Causal | Report-only |
| C4b | Minimality Classification | Minimality | Structural | >= locally_minimal |
| F1 | Operation Specification | MDC/Glennan | Mechanistic | Circuit > baseline |
| F2 | Held-Out Prediction | MDC/Glennan | Mechanistic | Report-only |
| F3.const | Replacement (Constant) | MDC/Glennan | Mechanistic | Report-only |
| F3.ov | Replacement (Linear OV) | MDC/Glennan | Mechanistic | Report-only |
| A1 | Procedure Specification | MDC/Glennan | Mechanistic | ordering_score > 0.7 |
| A2 | Composition Test | MDC/Glennan | Mechanistic | full_circuit > 0.30 or pathway > 0.20 |
| p_a09 | MDL/SLT | Complexity | Protocol | sensitivity < 0.2, LLC > 0.0 |
| p_a12 | Transportability | Cross-model | Protocol | invariance > 0.5 |
| p_a13 | Causal Discovery | NOTEARS | Protocol | notears > 0.5 |
Connection to Philosophy of Science Lens
Section titled “Connection to Philosophy of Science Lens”The philosophy of science lens is documented at the Philosophy of Science lens page. The core insight is that circuit claims are scientific explanations, and scientific explanations must meet standards developed over centuries of philosophical analysis.
The metrics on this page operationalize those standards:
- Causal robustness metrics (C3, C25, C35, C37) implement Woodward’s interventionist criteria — stability across methods, proportionality of interventions, and invariance across contexts.
- Mechanistic explanation metrics (C18, C19, F1, F2, F3, A1, A2) implement MDC/Glennan’s framework — specifiable operations, predictive power, non-trivial mechanisms, ordered procedures, and compositional structure.
- Minimality and INUS metrics (C4b, C39, C40) formalize what it means for a circuit to contain “just the right components” — drawing on Mackie’s INUS conditions and Halpern-Pearl actual causation.
- Protocols (A09, A12, A13) test the circuit against broader explanatory desiderata — parsimony (MDL/SLT), transportability (Pearl-Bareinboim), and independent causal discovery (NOTEARS).