Skip to content

Everything available for formatting pages in this site.


Inline status indicators:

Validated Proposed Disconfirmed Triangulated Underdetermined Default

Badges also work in sidebar navigation via frontmatter (see the Verdicts section).


Causal Instruments

Pearl SCM, Rubin CATE, Woodward interventionism, mediation analysis, and more.

Structural Instruments

SVD spectral analysis, OV/QK decomposition, weight alignment, effective rank.

Information Instruments

Transfer entropy, PID, mutual information, NOTEARS discovery.

Behavioral Instruments

Logit diff recovery, KL divergence, generalization gap, MDL compression.



Replace activations at a specific position with those from a counterfactual input. Measures necessity of the component for the behavior.




Basic syntax highlighting:

def compute_iia(model, clean_input, patch_input, target_idx):
"""Interchange intervention accuracy."""
clean_acts = model.run_with_cache(clean_input)
patch_acts = model.run_with_cache(patch_input)
patched_logits = model.run_with_hooks(
clean_input,
fwd_hooks=[(hook_point, lambda act, hook: patch_acts[hook.name])]
)
return (patched_logits.argmax(-1) == target_idx).float().mean()

With line highlighting and title:

lib/analysis/stages/stage_10_factor_decode.py
def bidirectional_logit_lens(factor_bank, tokenizer):
W_U = model.W_U # (d_model, d_vocab)
factor_logits = factor_bank.factors @ W_U
top_tokens = factor_logits.topk(10, dim=-1)
bottom_tokens = factor_logits.topk(10, dim=-1, largest=False)
return top_tokens, bottom_tokens

Diff format:

# Before: uncalibrated
score = compute_iia(model, clean, patch, target)
score = compute_iia(model, clean, patch, target) - baseline_iia

VerdictTierRequirements
ProposedAConstruct defined, one instrument chosen
Causally suggestiveBOne causal instrument with baseline separation
Mechanistically supportedCNecessity + sufficiency + measurement calibration
TriangulatedDThree evidence families converge
ValidatedEAll five validity types pass threshold

The framework draws on Bradford Hill’s criteria for causal inference1, adapted for the computational setting where controlled intervention is possible2.


Key insight: A circuit claim is not a binary — it’s a pattern across five dimensions. The framework makes the pattern visible.


Math rendering requires remark-math + rehype-katex plugins. Once installed, inline math uses single dollar signs and display math uses double:

Inline: $IIA = E[1(f(x) = y)]$
Display:
$$
\text{IIA} = \mathbb{E}[\mathbf{1}(f(x_{\text{patched}}) = y_{\text{target}})]
$$

Images can be placed in src/assets/ and referenced with standard markdown:

![Alt text](../../assets/my-diagram.png)

Potential additional plugins/features to explore:

  • starlight-links-validator — checks for broken internal links at build time
  • starlight-typedoc — auto-generates API docs from TypeScript
  • Custom sidebar icons — per-section icons in the nav
  • View transitions — smooth page-to-page animations (Astro built-in)
  • Obsidian-style wikilinks[[page-name]] syntax that resolves to proper links
  • Graph view — D3-based visualization of page connections (heavier lift)
  • PDF export — compile sections into a printable document
  • Version selector — if the framework evolves, show v1/v2 side by side
  1. Hill, A.B. (1965). The environment and disease: association or causation?

  2. Unlike epidemiology, we can perform true interventions rather than relying on observational associations.