Skip to content

This framework asks: How much information is lost when we replace the full model with its circuit approximation?

KL divergence provides a principled, distribution-level measure of circuit fidelity. Unlike logit diff (D02), which focuses on a single pair of tokens, KL captures discrepancies across the entire vocabulary — including effects on non-target tokens that may reveal incomplete mechanistic understanding.

As an information-theoretic quantity, KL divergence has a natural interpretation: it measures the expected number of additional nats (or bits) needed to encode samples from the full model’s distribution using the circuit’s distribution as a codebook.

SourceYearKey contribution
Conmy et al., “Towards Automated Circuit Discovery”2023KL as primary ACDC optimization target
Goldowsky-Dill et al., “Localizing Model Behavior”2023KL for sufficiency/necessity decomposition
Geiger et al., “Causal Abstraction for Faithful Model Interpretability”2023KL in interchange intervention settings
Miller et al., “Transformer Circuit Faithfulness Metrics”2024Comparison of KL vs other divergence measures

Given the full model’s output distribution ( P ) and the circuit’s output distribution ( Q ) at position ( t ):

[ D_{\text{KL}}(P | Q) = \sum_{v \in \mathcal{V}} P(v) \log \frac{P(v)}{Q(v)} ]

For circuit evaluation, we average over prompts in the task distribution:

[ \overline{D}{\text{KL}} = \frac{1}{N} \sum{i=1}^{N} D_{\text{KL}}\bigl(p_{\text{full}}(\cdot \mid x_i) ;|; p_{\text{circuit}}(\cdot \mid x_i)\bigr) ]

KL is asymmetric: ( D_{\text{KL}}(P | Q) ) penalizes the circuit for assigning low probability to tokens the full model considers likely (mode-dropping), but not for assigning high probability to tokens the model ignores. This makes it a conservative faithfulness test — circuits that hallucinate extra probability mass may still score well.

Causal Scrubbing KL (04_causal_scrubbing.py)

Section titled “Causal Scrubbing KL (04_causal_scrubbing.py)”

Applies the causal scrubbing protocol, measuring KL between the scrubbed model (circuit-only computation preserved) and the full model.

What it establishes: Distributional faithfulness under the causal scrubbing intervention. What it does not establish: Whether low KL is due to circuit completeness vs. task simplicity.

Usage:

uv run python 04_causal_scrubbing.py --tasks ioi sva

Output Metric Variants (21_output_variants.py)

Section titled “Output Metric Variants (21_output_variants.py)”

Computes multiple divergence measures (KL, reverse KL, Jensen-Shannon, total variation) to characterize where circuit and model disagree.

What it establishes: Whether KL results are robust to divergence measure choice. What it does not establish: Causal mechanism — only distributional match.

Usage:

uv run python 21_output_variants.py --tasks ioi sva
PatternWhat it means
KL < 0.01 natsNear-perfect distributional match
KL 0.01–0.1 natsGood circuit, minor tail discrepancies
KL 0.1–1.0 natsMeaningful distributional gaps — missing components
KL diverges across tasksCircuit specialization varies by task
JS << KLMode-dropping dominates over hallucination