Causal Instruments
Pearl SCM, Rubin CATE, Woodward interventionism, mediation analysis, and more.
Everything available for formatting pages in this site.
Inline status indicators:
Validated Proposed Disconfirmed Triangulated Underdetermined DefaultBadges also work in sidebar navigation via frontmatter (see the Verdicts section).
Causal Instruments
Pearl SCM, Rubin CATE, Woodward interventionism, mediation analysis, and more.
Structural Instruments
SVD spectral analysis, OV/QK decomposition, weight alignment, effective rank.
Information Instruments
Transfer entropy, PID, mutual information, NOTEARS discovery.
Behavioral Instruments
Logit diff recovery, KL divergence, generalization gap, MDL compression.
Replace activations at a specific position with those from a counterfactual input. Measures necessity of the component for the behavior.
Recursively verify that each node in a computational graph contributes only the information attributed to it by the hypothesis.
Interchange Intervention Accuracy — swap a single representation between inputs and measure whether behavior follows the swapped variable.
Basic syntax highlighting:
def compute_iia(model, clean_input, patch_input, target_idx): """Interchange intervention accuracy.""" clean_acts = model.run_with_cache(clean_input) patch_acts = model.run_with_cache(patch_input)
patched_logits = model.run_with_hooks( clean_input, fwd_hooks=[(hook_point, lambda act, hook: patch_acts[hook.name])] ) return (patched_logits.argmax(-1) == target_idx).float().mean()With line highlighting and title:
def bidirectional_logit_lens(factor_bank, tokenizer): W_U = model.W_U # (d_model, d_vocab) factor_logits = factor_bank.factors @ W_U top_tokens = factor_logits.topk(10, dim=-1) bottom_tokens = factor_logits.topk(10, dim=-1, largest=False) return top_tokens, bottom_tokensDiff format:
# Before: uncalibratedscore = compute_iia(model, clean, patch, target)score = compute_iia(model, clean, patch, target) - baseline_iia| Verdict | Tier | Requirements |
|---|---|---|
| Proposed | A | Construct defined, one instrument chosen |
| Causally suggestive | B | One causal instrument with baseline separation |
| Mechanistically supported | C | Necessity + sufficiency + measurement calibration |
| Triangulated | D | Three evidence families converge |
| Validated | E | All five validity types pass threshold |
The framework draws on Bradford Hill’s criteria for causal inference1, adapted for the computational setting where controlled intervention is possible2.
Key insight: A circuit claim is not a binary — it’s a pattern across five dimensions. The framework makes the pattern visible.
Math rendering requires remark-math + rehype-katex plugins. Once installed, inline math uses single dollar signs and display math uses double:
Inline: $IIA = E[1(f(x) = y)]$
Display:$$\text{IIA} = \mathbb{E}[\mathbf{1}(f(x_{\text{patched}}) = y_{\text{target}})]$$Images can be placed in src/assets/ and referenced with standard markdown:
Potential additional plugins/features to explore:
[[page-name]] syntax that resolves to proper links