Skip to content

This page documents the information-theoretic metrics that implement the mechanistic interpretability lens. These metrics quantify information flow, shared information, and causal structure within circuits using tools from information theory — mutual information, partial information decomposition, transfer entropy, Granger causality, and structure learning. They are implemented in mechval_v2.core.mechanistic_interpretability.information and can be run independently or as part of a protocol.

All metrics in this family use Direct Logit Attribution (DLA) as the per-head scalar summary: the dot product of each head’s output (projected through WOW_O) with the correct-minus-incorrect unembedding direction. DLA preserves sign, which carries essential information about whether a head promotes or suppresses the correct answer.


C01 — Mutual Information (54_mutual_information.py)

Section titled “C01 — Mutual Information (54_mutual_information.py)”

What it computes. Estimates pairwise mutual information between circuit heads’ DLA values across prompts using binned MI estimation. Each head’s DLA is discretized into 10 equal-frequency quantile bins, and MI is computed from the joint histogram:

I(X;Y)=x,yp(x,y)log2p(x,y)p(x)p(y)I(X; Y) = \sum_{x, y} p(x, y) \log_2 \frac{p(x, y)}{p(x)\, p(y)}

MI is computed for all pairs of circuit heads (within-circuit MI) and between circuit heads and a matched set of random non-circuit heads (between MI). The ratio of mean within-circuit MI to mean between MI quantifies how much more information circuit heads share with each other than with arbitrary heads.

Evidence family. Information-theoretic (observational).

Key metrics.

MetricDescriptionBaseline
ratioMean within-circuit MI / mean circuit-to-random MI>1.0> 1.0 (random baseline)
mean_within_circuit_miAverage MI between all circuit head pairsreported
top_mi_edgesHighest-MI head pairs (MI-weighted graph structure)reported

What it establishes. Circuit heads share more information with each other than with random heads. High within-circuit MI indicates functional coupling: the heads’ outputs co-vary in a task-relevant way. The MI-weighted graph reveals which head pairs are most informationally coupled, potentially reflecting direct information flow paths.

What it does not establish. Directionality or causality. MI is symmetric: I(X;Y)=I(Y;X)I(X; Y) = I(Y; X). High MI between two heads means they share information, but does not indicate which head provides information to which. Use transfer entropy (C03) or Granger causality (C07) for directional evidence. MI also does not distinguish direct from mediated dependencies — use conditional MI (C02) for that.

Usage.

Terminal window
uv run python 54_mutual_information.py --tasks ioi sva --n-prompts 100

C02 — Conditional Mutual Information (55_conditional_mi.py)

Section titled “C02 — Conditional Mutual Information (55_conditional_mi.py)”

What it computes. For triplets of circuit heads (h1,h2,h3)(h_1, h_2, h_3), computes I(h1;h2h3)I(h_1; h_2 \mid h_3) — the MI between h1h_1 and h2h_2 conditioned on h3h_3. Uses residualization: regresses h3h_3‘s DLA out of both h1h_1 and h2h_2‘s DLA values, then computes binned MI on the residuals.

If I(h1;h2)I(h_1; h_2) is high but I(h1;h2h3)0I(h_1; h_2 \mid h_3) \approx 0, then h3h_3 mediates the h1h_1-h2h_2 dependency — removing h3h_3‘s contribution explains away the shared information. For each pair, the metric finds the best mediator among remaining circuit heads and reports the fraction of pairwise MI that is mediated vs direct.

Evidence family. Information-theoretic (observational, mediation analysis).

Key metrics.

MetricDescriptionInterpretation
mean_direct_fractionAverage fraction of MI that persists after conditioning on the best mediatorHigh = direct coupling
mean_mediated_fractionAverage fraction explained away by the best mediatorHigh = hub-mediated circuit

What it establishes. Whether circuit head dependencies are direct or mediated through hub heads. High mediation (>0.5> 0.5) indicates a hub-and-spoke architecture where a small number of heads serve as information bottlenecks. High direct fraction indicates parallel, independent information channels.

What it does not establish. That conditioning on h3h_3 removes a causal pathway. Residualization removes linear association, which may not correspond to the actual causal mediation mechanism. A head that happens to correlate with both h1h_1 and h2h_2 will appear to mediate even if it plays no causal role.

Usage.

Terminal window
uv run python 55_conditional_mi.py --tasks ioi sva --n-prompts 100

C03 — Transfer Entropy (53_transfer_entropy.py)

Section titled “C03 — Transfer Entropy (53_transfer_entropy.py)”

What it computes. Estimates directional information flow between circuit heads across layers. For each directed pair h1h_1 (layer L1L_1) \to h2h_2 (layer L2L_2, L2>L1L_2 > L_1), estimates transfer entropy as the squared partial correlation of h1h_1‘s DLA with h2h_2‘s DLA, controlling for all circuit heads at layers between L1L_1 and L2L_2:

TE(h1h2)rpartial2(h1,h2{hk:L1<Lk<L2})\text{TE}(h_1 \to h_2) \approx r^2_{\text{partial}}(h_1, h_2 \mid \{h_k : L_1 < L_k < L_2\})

Compares TE for known circuit edges versus non-edges.

Evidence family. Information-theoretic (directional, observational).

Key metrics.

MetricDescriptionBaseline
ratioMean TE for circuit edges / mean TE for non-circuit edges>1.0> 1.0 (random baseline)
mean_circuit_teAverage TE proxy for known circuit edgesreported
mean_non_circuit_teAverage TE proxy for non-circuit head pairsreported

What it establishes. Circuit edges carry more directional information flow than non-edges. A ratio substantially above 1.0 means the circuit’s claimed edge structure reflects genuine information transfer: earlier heads provide information that later heads use, beyond what intervening heads already provide.

What it does not establish. True causal information flow. Partial correlation is a linear proxy for transfer entropy, which is itself an observational (not interventional) measure. Two heads can show high TE because they both respond to the same input feature, not because one informs the other. Combine with path patching or activation patching for causal directional evidence.

Usage.

Terminal window
uv run python 53_transfer_entropy.py --tasks ioi sva --n-prompts 100

C04 — Partial Information Decomposition (08_pid.py)

Section titled “C04 — Partial Information Decomposition (08_pid.py)”

What it computes. Decomposes the mutual information between pairs of circuit heads and the model output into four components:

  • Redundancy: information that both heads provide about the output (overlapping).
  • Unique h1h_1: information only h1h_1 provides.
  • Unique h2h_2: information only h2h_2 provides.
  • Synergy: information that neither head provides individually but both provide jointly.

Uses the BROJA PID implementation from the dit library when available, falling back to a binned approximation. DLA values are quantized into 5 equal-frequency bins.

Evidence family. Information-theoretic (multivariate decomposition).

Key metrics.

MetricDescriptionInterpretation
mean_synergyAverage synergistic information across head pairsHigh = cooperative circuit
mean_redundancyAverage redundant information across head pairsHigh = robust, fault-tolerant circuit
Per-pair decompositionFull PID for each head pairIdentifies which pairs cooperate vs overlap

What it establishes. Whether circuit heads carry complementary (synergistic) or overlapping (redundant) information about the output. High synergy means the circuit is a genuine computational unit: pairs of heads jointly encode information that neither encodes alone. High redundancy means the circuit is fault-tolerant but potentially over-specified.

What it does not establish. The causal structure underlying the decomposition. PID quantifies the information structure but not the mechanism by which synergy or redundancy arises. Two heads may be synergistic because they compute complementary functions (genuinely cooperative) or because they respond to different aspects of the same confound. Combine with epistasis (GN2 from the genetics lens) for a causal version of interaction analysis.

Usage.

Terminal window
uv run python 08_pid.py --tasks ioi sva --n-prompts 60

C05 — Information Bottleneck Analysis (57_info_bottleneck.py)

Section titled “C05 — Information Bottleneck Analysis (57_info_bottleneck.py)”

What it computes. For each layer, computes how much information the residual stream retains about the input (I(X;T)I(X; T_\ell)) versus how much it preserves about the output (I(T;Y)I(T_\ell; Y)), where XX is the input token identity and YY is the correct/incorrect binary label. Uses residual-stream activations projected onto the top 10 PCA dimensions, with MI estimated via binned approximation on PCA scores. The result is an information plane: I(X;T)I(X; T) vs I(T;Y)I(T; Y) across layers.

Evidence family. Information-theoretic (information plane analysis).

Key metrics.

MetricDescriptionBaseline
mean_circuit_I_T_YMean I(T;Y)I(T_\ell; Y) at layers containing circuit heads>> mean_non_circuit_I_T_Y
mean_non_circuit_I_T_YMean I(T;Y)I(T_\ell; Y) at non-circuit layersbaseline
info_planeFull I(X;T)I(X; T) and I(T;Y)I(T; Y) at each layerfor visualization

What it establishes. Circuit-critical layers preserve more task-relevant information (I(T;Y)I(T; Y)) than non-circuit layers. This is consistent with the information bottleneck principle: the network compresses irrelevant input information while preserving task-relevant structure, and the circuit heads are located at layers where this task-relevant information is highest.

What it does not establish. That circuit heads are responsible for the high I(T;Y)I(T; Y). The residual stream at a layer reflects all computations up to that point, not just the circuit heads at that layer. The layer-level analysis conflates circuit and non-circuit contributions to the residual stream.

Usage.

Terminal window
uv run python 57_info_bottleneck.py --tasks ioi sva --n-prompts 80

C06 — O-Information (58_o_information.py)

Section titled “C06 — O-Information (58_o_information.py)”

What it computes. Computes the O-information (Rosas et al., 2019) of circuit head DLA values, a multivariate measure that captures the overall balance between redundancy and synergy in a set of variables:

Ω=(n2)H(X1,,Xn)+iH(Xi)iH(X1,,Xi1,Xi+1,,Xn)\Omega = (n - 2) \cdot H(X_1, \ldots, X_n) + \sum_i H(X_i) - \sum_i H(X_1, \ldots, X_{i-1}, X_{i+1}, \ldots, X_n)

Positive Ω\Omega indicates redundancy-dominated: heads carry overlapping information. Negative Ω\Omega indicates synergy-dominated: heads carry information jointly that no subset carries alone. Compares Ω\Omega for the circuit head set vs random subsets of the same size to test whether the circuit is specifically synergistic or redundant.

Evidence family. Information-theoretic (multivariate redundancy/synergy).

Key metrics.

MetricDescriptionBaseline
omega_circuitO-information for circuit headscompared to random
omega_random_meanMean O-information for random head subsets of the same sizerandom baseline
z_score(Ωcircuitμrandom)/σrandom(\Omega_{\text{circuit}} - \mu_{\text{random}}) / \sigma_{\text{random}}significance
interpretation”redundancy” (Ω>0\Omega > 0) or “synergy” (Ω<0\Omega < 0)qualitative

What it establishes. Whether the circuit as a whole is organized around redundancy (fault tolerance, overlapping representations) or synergy (cooperative computation, emergent joint encoding). A zz-score substantially different from zero means the circuit’s information structure is non-random — it is specifically more synergistic or redundant than an arbitrary subset of heads.

What it does not establish. Which heads drive the synergy or redundancy. O-information is a summary statistic for the entire set; it does not identify specific synergistic or redundant pairs. Use PID (C04) for pairwise decomposition.

Usage.

Terminal window
uv run python 58_o_information.py --tasks ioi sva --n-prompts 100

C07 — Granger Causality (56_granger_causality.py)

Section titled “C07 — Granger Causality (56_granger_causality.py)”

What it computes. Treats the sequence of head activations across layers as a “time series” (layer = time). For each pair of circuit heads h1h_1 (layer L1L_1) and h2h_2 (layer L2L_2, L2>L1L_2 > L_1), tests whether h1h_1‘s DLA Granger-causes h2h_2‘s DLA: does adding h1h_1‘s DLA improve prediction of h2h_2‘s DLA beyond all other circuit heads at earlier layers? Uses an F-test comparing the restricted model (all earlier heads except h1h_1) to the full model (restricted + h1h_1).

Evidence family. Information-theoretic / causal (Granger causality).

Key metrics.

MetricDescriptionBaseline
circuit_significance_rateFraction of circuit edges that are Granger-significant at α=0.05\alpha = 0.05compared to non-circuit rate
non_circuit_significance_rateFraction of non-circuit edges that are Granger-significantbaseline
top_significantTop Granger-significant edges by F-statisticreported

What it establishes. Circuit edges show Granger-causal relationships at a higher rate than non-circuit edges. Granger causality — whether adding h1h_1 improves prediction of h2h_2 beyond the other predictors — is a well-established statistical test for directed functional coupling. If circuit edges are preferentially Granger-significant, the circuit’s edge structure reflects genuine predictive information flow.

What it does not establish. True causality. Granger causality is a statistical (observational) concept: h1h_1 Granger-causes h2h_2 if h1h_1 contains unique predictive information about h2h_2. This can occur due to confounding (both respond to a common upstream signal with different delays). In transformers, all earlier-layer outputs are available to later layers via the residual stream, so Granger causality may reflect shared input features rather than direct information flow.

Usage.

Terminal window
uv run python 56_granger_causality.py --tasks ioi sva --n-prompts 100

C08 — Observational Circuit Discovery / oCSE (07_ocse.py)

Section titled “C08 — Observational Circuit Discovery / oCSE (07_ocse.py)”

What it computes. Two complementary purely observational discovery methods:

  1. Stability selection via bootstrap LassoCV: runs 50 bootstrap LassoCV regressions predicting logit-diff from all 144 head DLAs, selects heads that appear in >50%> 50\% of bootstrap runs. Handles multicollinearity through L1 regularization and bootstrap averaging.
  2. Greedy oCSE (observational Causal Subgraph Extraction): greedy forward selection using conditional mutual information with a permutation-calibrated threshold (95th percentile of max CMI under row permutation).

Both use DLA features (signed head contribution to logit diff) rather than norms, since sign carries essential information about head function.

Evidence family. Information-theoretic / statistical (observational discovery).

Key metrics.

MetricDescriptionInterpretation
f1F1 score of stability selection against known circuitprimary metric
precision / recallPrecision and recall of discovered headsreported
ocse.f1F1 of oCSE forward selectionsecondary metric
combined.f1F1 of union of both methodsconvergent discovery

What it establishes. Circuit heads can be recovered from purely observational data — without any interventions. Stability selection identifies heads whose DLA values are robust predictors of logit-diff across bootstrap samples. oCSE identifies heads whose DLA provides incremental conditional information. Agreement between the two methods (combined F1) provides convergent validity.

What it does not establish. Causal necessity. Observational discovery identifies heads that are statistically predictive, not causally necessary. A head that correlates with logit-diff but is not causally load-bearing will be discovered by both methods. Combine with activation patching or ablation for causal validation of observationally-discovered circuits.

Usage.

Terminal window
uv run python 07_ocse.py --tasks ioi sva --n-prompts 200

C09 — NOTEARS Structure Learning (09_notears.py)

Section titled “C09 — NOTEARS Structure Learning (09_notears.py)”

What it computes. Learns a directed acyclic graph (DAG) over component activations using NOTEARS (Zheng et al., NeurIPS 2018), which reformulates the combinatorial DAG constraint as a continuous optimization:

minW12nXXWF2+λ1W1s.t.tr(eWW)d=0\min_{W} \frac{1}{2n} \|X - XW\|_F^2 + \lambda_1 \|W\|_1 \quad \text{s.t.} \quad \text{tr}(e^{W \circ W}) - d = 0

where the acyclicity constraint h(W)=tr(eWW)d=0h(W) = \text{tr}(e^{W \circ W}) - d = 0 is enforced via augmented Lagrangian. The discovered DAG’s parents of the output node are compared against known circuit heads via F1, precision, and recall. A permutation baseline (shuffled data) calibrates the false-positive rate.

Evidence family. Information-theoretic / causal (continuous structure learning).

Key metrics.

MetricDescriptionBaseline
f1F1 of NOTEARS-discovered parents against known circuitprimary metric
precision / recallPrecision and recall of discovered causal parentsreported
baseline_randomNumber of parents discovered from permuted datafalse-positive calibration

What it establishes. A DAG recovered by continuous optimization — without prior assumptions about circuit structure — identifies circuit heads as causal parents of the output. NOTEARS discovers which heads causally precede the output, enforcing acyclicity as a structural constraint. This is stronger than undirected MI or correlation: the DAG has directionality.

What it does not establish. That the learned DAG reflects the true causal structure. NOTEARS assumes a linear structural equation model, which may not hold for transformer computations. The DAG is learned from observational data only and is subject to the same confounding concerns as Granger causality, though the acyclicity constraint partially mitigates this by imposing structural consistency.

Usage.

Terminal window
uv run python 09_notears.py --tasks ioi sva --n-prompts 80

MetricHigh scoreLow score
C01 (MI)Circuit heads share more information than random heads; functional couplingNo preferential coupling; circuit heads are informationally independent
C02 (CMI)High direct fraction = parallel channels; high mediated = hub architectureMixed or trivial MI; not enough signal to decompose
C03 (Transfer Entropy)Circuit edges carry directional information; sender informs receiverNon-circuit edges carry as much TE; circuit edges are not privileged
C04 (PID)High synergy = cooperative circuit; high redundancy = fault-tolerantLow synergy and redundancy; heads contribute independently
C05 (Info Bottleneck)Circuit layers have high I(T;Y)I(T; Y); task info peaks at circuit layersTask info is uniformly distributed or peaks outside circuit layers
C06 (O-Information)Circuit is specifically synergistic or redundant vs randomCircuit has the same information structure as random head subsets
C07 (Granger)Circuit edges are preferentially Granger-significantGranger significance does not discriminate circuit from non-circuit
C08 (oCSE)Observational discovery recovers circuit heads; high F1Discovery identifies different heads; circuit may be non-predictive
C09 (NOTEARS)Structure learning recovers circuit as DAG parentsDAG parents do not match circuit; structure may be non-linear

The strongest evidence comes from convergent findings across information-theoretic methods that probe different properties:

  • C01 + C04: High within-circuit MI (C01) with high synergy (C04) indicates the circuit is both informationally coupled and cooperatively computational. If MI is high but synergy is low, the coupling is redundant (overlapping, not joint).
  • C03 + C07: Transfer entropy (C03) and Granger causality (C07) both test directional information flow but use different statistical methods (partial correlation vs F-test). Convergence provides robust evidence of directed functional coupling along circuit edges.
  • C02 + C06: Conditional MI (C02) identifies hub heads; O-information (C06) identifies overall synergy/redundancy balance. A synergy-dominated circuit (C06) with strong mediation (C02) suggests a few hub heads orchestrate cooperative computation.
  • C08 + C09: oCSE (C08) discovers via forward selection; NOTEARS (C09) discovers via continuous DAG optimization. If both identify the same circuit heads from observational data alone, the circuit is robustly recoverable without interventions.

Information-theoretic metrics are observational by nature: they measure statistical relationships in activation patterns without intervening. This makes them complementary to — not substitutes for — causal metrics (activation patching, ablation, causal scrubbing). The information-theoretic metrics answer “do these heads share information?” while causal metrics answer “does this head’s information causally affect the output?”

The ideal pattern is convergent: heads identified by MI, TE, and Granger causality should also be identified by activation patching and ablation. Divergence is informative: a head with high MI but low activation patching effect suggests its information is redundant (available elsewhere), while a head with low MI but high patching effect suggests it carries unique, sparse information.