MI Information Metrics
Section titled “MI Information Metrics”This page documents the information-theoretic metrics that implement the mechanistic interpretability lens. These metrics quantify information flow, shared information, and causal structure within circuits using tools from information theory — mutual information, partial information decomposition, transfer entropy, Granger causality, and structure learning. They are implemented in mechval_v2.core.mechanistic_interpretability.information and can be run independently or as part of a protocol.
All metrics in this family use Direct Logit Attribution (DLA) as the per-head scalar summary: the dot product of each head’s output (projected through ) with the correct-minus-incorrect unembedding direction. DLA preserves sign, which carries essential information about whether a head promotes or suppresses the correct answer.
Metrics
Section titled “Metrics”C01 — Mutual Information (54_mutual_information.py)
Section titled “C01 — Mutual Information (54_mutual_information.py)”What it computes. Estimates pairwise mutual information between circuit heads’ DLA values across prompts using binned MI estimation. Each head’s DLA is discretized into 10 equal-frequency quantile bins, and MI is computed from the joint histogram:
MI is computed for all pairs of circuit heads (within-circuit MI) and between circuit heads and a matched set of random non-circuit heads (between MI). The ratio of mean within-circuit MI to mean between MI quantifies how much more information circuit heads share with each other than with arbitrary heads.
Evidence family. Information-theoretic (observational).
Key metrics.
| Metric | Description | Baseline |
|---|---|---|
ratio | Mean within-circuit MI / mean circuit-to-random MI | (random baseline) |
mean_within_circuit_mi | Average MI between all circuit head pairs | reported |
top_mi_edges | Highest-MI head pairs (MI-weighted graph structure) | reported |
What it establishes. Circuit heads share more information with each other than with random heads. High within-circuit MI indicates functional coupling: the heads’ outputs co-vary in a task-relevant way. The MI-weighted graph reveals which head pairs are most informationally coupled, potentially reflecting direct information flow paths.
What it does not establish. Directionality or causality. MI is symmetric: . High MI between two heads means they share information, but does not indicate which head provides information to which. Use transfer entropy (C03) or Granger causality (C07) for directional evidence. MI also does not distinguish direct from mediated dependencies — use conditional MI (C02) for that.
Usage.
uv run python 54_mutual_information.py --tasks ioi sva --n-prompts 100C02 — Conditional Mutual Information (55_conditional_mi.py)
Section titled “C02 — Conditional Mutual Information (55_conditional_mi.py)”What it computes. For triplets of circuit heads , computes — the MI between and conditioned on . Uses residualization: regresses ‘s DLA out of both and ‘s DLA values, then computes binned MI on the residuals.
If is high but , then mediates the - dependency — removing ‘s contribution explains away the shared information. For each pair, the metric finds the best mediator among remaining circuit heads and reports the fraction of pairwise MI that is mediated vs direct.
Evidence family. Information-theoretic (observational, mediation analysis).
Key metrics.
| Metric | Description | Interpretation |
|---|---|---|
mean_direct_fraction | Average fraction of MI that persists after conditioning on the best mediator | High = direct coupling |
mean_mediated_fraction | Average fraction explained away by the best mediator | High = hub-mediated circuit |
What it establishes. Whether circuit head dependencies are direct or mediated through hub heads. High mediation () indicates a hub-and-spoke architecture where a small number of heads serve as information bottlenecks. High direct fraction indicates parallel, independent information channels.
What it does not establish. That conditioning on removes a causal pathway. Residualization removes linear association, which may not correspond to the actual causal mediation mechanism. A head that happens to correlate with both and will appear to mediate even if it plays no causal role.
Usage.
uv run python 55_conditional_mi.py --tasks ioi sva --n-prompts 100C03 — Transfer Entropy (53_transfer_entropy.py)
Section titled “C03 — Transfer Entropy (53_transfer_entropy.py)”What it computes. Estimates directional information flow between circuit heads across layers. For each directed pair (layer ) (layer , ), estimates transfer entropy as the squared partial correlation of ‘s DLA with ‘s DLA, controlling for all circuit heads at layers between and :
Compares TE for known circuit edges versus non-edges.
Evidence family. Information-theoretic (directional, observational).
Key metrics.
| Metric | Description | Baseline |
|---|---|---|
ratio | Mean TE for circuit edges / mean TE for non-circuit edges | (random baseline) |
mean_circuit_te | Average TE proxy for known circuit edges | reported |
mean_non_circuit_te | Average TE proxy for non-circuit head pairs | reported |
What it establishes. Circuit edges carry more directional information flow than non-edges. A ratio substantially above 1.0 means the circuit’s claimed edge structure reflects genuine information transfer: earlier heads provide information that later heads use, beyond what intervening heads already provide.
What it does not establish. True causal information flow. Partial correlation is a linear proxy for transfer entropy, which is itself an observational (not interventional) measure. Two heads can show high TE because they both respond to the same input feature, not because one informs the other. Combine with path patching or activation patching for causal directional evidence.
Usage.
uv run python 53_transfer_entropy.py --tasks ioi sva --n-prompts 100C04 — Partial Information Decomposition (08_pid.py)
Section titled “C04 — Partial Information Decomposition (08_pid.py)”What it computes. Decomposes the mutual information between pairs of circuit heads and the model output into four components:
- Redundancy: information that both heads provide about the output (overlapping).
- Unique : information only provides.
- Unique : information only provides.
- Synergy: information that neither head provides individually but both provide jointly.
Uses the BROJA PID implementation from the dit library when available, falling back to a binned approximation. DLA values are quantized into 5 equal-frequency bins.
Evidence family. Information-theoretic (multivariate decomposition).
Key metrics.
| Metric | Description | Interpretation |
|---|---|---|
mean_synergy | Average synergistic information across head pairs | High = cooperative circuit |
mean_redundancy | Average redundant information across head pairs | High = robust, fault-tolerant circuit |
| Per-pair decomposition | Full PID for each head pair | Identifies which pairs cooperate vs overlap |
What it establishes. Whether circuit heads carry complementary (synergistic) or overlapping (redundant) information about the output. High synergy means the circuit is a genuine computational unit: pairs of heads jointly encode information that neither encodes alone. High redundancy means the circuit is fault-tolerant but potentially over-specified.
What it does not establish. The causal structure underlying the decomposition. PID quantifies the information structure but not the mechanism by which synergy or redundancy arises. Two heads may be synergistic because they compute complementary functions (genuinely cooperative) or because they respond to different aspects of the same confound. Combine with epistasis (GN2 from the genetics lens) for a causal version of interaction analysis.
Usage.
uv run python 08_pid.py --tasks ioi sva --n-prompts 60C05 — Information Bottleneck Analysis (57_info_bottleneck.py)
Section titled “C05 — Information Bottleneck Analysis (57_info_bottleneck.py)”What it computes. For each layer, computes how much information the residual stream retains about the input () versus how much it preserves about the output (), where is the input token identity and is the correct/incorrect binary label. Uses residual-stream activations projected onto the top 10 PCA dimensions, with MI estimated via binned approximation on PCA scores. The result is an information plane: vs across layers.
Evidence family. Information-theoretic (information plane analysis).
Key metrics.
| Metric | Description | Baseline |
|---|---|---|
mean_circuit_I_T_Y | Mean at layers containing circuit heads | mean_non_circuit_I_T_Y |
mean_non_circuit_I_T_Y | Mean at non-circuit layers | baseline |
info_plane | Full and at each layer | for visualization |
What it establishes. Circuit-critical layers preserve more task-relevant information () than non-circuit layers. This is consistent with the information bottleneck principle: the network compresses irrelevant input information while preserving task-relevant structure, and the circuit heads are located at layers where this task-relevant information is highest.
What it does not establish. That circuit heads are responsible for the high . The residual stream at a layer reflects all computations up to that point, not just the circuit heads at that layer. The layer-level analysis conflates circuit and non-circuit contributions to the residual stream.
Usage.
uv run python 57_info_bottleneck.py --tasks ioi sva --n-prompts 80C06 — O-Information (58_o_information.py)
Section titled “C06 — O-Information (58_o_information.py)”What it computes. Computes the O-information (Rosas et al., 2019) of circuit head DLA values, a multivariate measure that captures the overall balance between redundancy and synergy in a set of variables:
Positive indicates redundancy-dominated: heads carry overlapping information. Negative indicates synergy-dominated: heads carry information jointly that no subset carries alone. Compares for the circuit head set vs random subsets of the same size to test whether the circuit is specifically synergistic or redundant.
Evidence family. Information-theoretic (multivariate redundancy/synergy).
Key metrics.
| Metric | Description | Baseline |
|---|---|---|
omega_circuit | O-information for circuit heads | compared to random |
omega_random_mean | Mean O-information for random head subsets of the same size | random baseline |
z_score | significance | |
interpretation | ”redundancy” () or “synergy” () | qualitative |
What it establishes. Whether the circuit as a whole is organized around redundancy (fault tolerance, overlapping representations) or synergy (cooperative computation, emergent joint encoding). A -score substantially different from zero means the circuit’s information structure is non-random — it is specifically more synergistic or redundant than an arbitrary subset of heads.
What it does not establish. Which heads drive the synergy or redundancy. O-information is a summary statistic for the entire set; it does not identify specific synergistic or redundant pairs. Use PID (C04) for pairwise decomposition.
Usage.
uv run python 58_o_information.py --tasks ioi sva --n-prompts 100C07 — Granger Causality (56_granger_causality.py)
Section titled “C07 — Granger Causality (56_granger_causality.py)”What it computes. Treats the sequence of head activations across layers as a “time series” (layer = time). For each pair of circuit heads (layer ) and (layer , ), tests whether ‘s DLA Granger-causes ‘s DLA: does adding ‘s DLA improve prediction of ‘s DLA beyond all other circuit heads at earlier layers? Uses an F-test comparing the restricted model (all earlier heads except ) to the full model (restricted + ).
Evidence family. Information-theoretic / causal (Granger causality).
Key metrics.
| Metric | Description | Baseline |
|---|---|---|
circuit_significance_rate | Fraction of circuit edges that are Granger-significant at | compared to non-circuit rate |
non_circuit_significance_rate | Fraction of non-circuit edges that are Granger-significant | baseline |
top_significant | Top Granger-significant edges by F-statistic | reported |
What it establishes. Circuit edges show Granger-causal relationships at a higher rate than non-circuit edges. Granger causality — whether adding improves prediction of beyond the other predictors — is a well-established statistical test for directed functional coupling. If circuit edges are preferentially Granger-significant, the circuit’s edge structure reflects genuine predictive information flow.
What it does not establish. True causality. Granger causality is a statistical (observational) concept: Granger-causes if contains unique predictive information about . This can occur due to confounding (both respond to a common upstream signal with different delays). In transformers, all earlier-layer outputs are available to later layers via the residual stream, so Granger causality may reflect shared input features rather than direct information flow.
Usage.
uv run python 56_granger_causality.py --tasks ioi sva --n-prompts 100C08 — Observational Circuit Discovery / oCSE (07_ocse.py)
Section titled “C08 — Observational Circuit Discovery / oCSE (07_ocse.py)”What it computes. Two complementary purely observational discovery methods:
- Stability selection via bootstrap LassoCV: runs 50 bootstrap LassoCV regressions predicting logit-diff from all 144 head DLAs, selects heads that appear in of bootstrap runs. Handles multicollinearity through L1 regularization and bootstrap averaging.
- Greedy oCSE (observational Causal Subgraph Extraction): greedy forward selection using conditional mutual information with a permutation-calibrated threshold (95th percentile of max CMI under row permutation).
Both use DLA features (signed head contribution to logit diff) rather than norms, since sign carries essential information about head function.
Evidence family. Information-theoretic / statistical (observational discovery).
Key metrics.
| Metric | Description | Interpretation |
|---|---|---|
f1 | F1 score of stability selection against known circuit | primary metric |
precision / recall | Precision and recall of discovered heads | reported |
ocse.f1 | F1 of oCSE forward selection | secondary metric |
combined.f1 | F1 of union of both methods | convergent discovery |
What it establishes. Circuit heads can be recovered from purely observational data — without any interventions. Stability selection identifies heads whose DLA values are robust predictors of logit-diff across bootstrap samples. oCSE identifies heads whose DLA provides incremental conditional information. Agreement between the two methods (combined F1) provides convergent validity.
What it does not establish. Causal necessity. Observational discovery identifies heads that are statistically predictive, not causally necessary. A head that correlates with logit-diff but is not causally load-bearing will be discovered by both methods. Combine with activation patching or ablation for causal validation of observationally-discovered circuits.
Usage.
uv run python 07_ocse.py --tasks ioi sva --n-prompts 200C09 — NOTEARS Structure Learning (09_notears.py)
Section titled “C09 — NOTEARS Structure Learning (09_notears.py)”What it computes. Learns a directed acyclic graph (DAG) over component activations using NOTEARS (Zheng et al., NeurIPS 2018), which reformulates the combinatorial DAG constraint as a continuous optimization:
where the acyclicity constraint is enforced via augmented Lagrangian. The discovered DAG’s parents of the output node are compared against known circuit heads via F1, precision, and recall. A permutation baseline (shuffled data) calibrates the false-positive rate.
Evidence family. Information-theoretic / causal (continuous structure learning).
Key metrics.
| Metric | Description | Baseline |
|---|---|---|
f1 | F1 of NOTEARS-discovered parents against known circuit | primary metric |
precision / recall | Precision and recall of discovered causal parents | reported |
baseline_random | Number of parents discovered from permuted data | false-positive calibration |
What it establishes. A DAG recovered by continuous optimization — without prior assumptions about circuit structure — identifies circuit heads as causal parents of the output. NOTEARS discovers which heads causally precede the output, enforcing acyclicity as a structural constraint. This is stronger than undirected MI or correlation: the DAG has directionality.
What it does not establish. That the learned DAG reflects the true causal structure. NOTEARS assumes a linear structural equation model, which may not hold for transformer computations. The DAG is learned from observational data only and is subject to the same confounding concerns as Granger causality, though the acyclicity constraint partially mitigates this by imposing structural consistency.
Usage.
uv run python 09_notears.py --tasks ioi sva --n-prompts 80Reading the Scores
Section titled “Reading the Scores”Metric-level interpretation
Section titled “Metric-level interpretation”| Metric | High score | Low score |
|---|---|---|
| C01 (MI) | Circuit heads share more information than random heads; functional coupling | No preferential coupling; circuit heads are informationally independent |
| C02 (CMI) | High direct fraction = parallel channels; high mediated = hub architecture | Mixed or trivial MI; not enough signal to decompose |
| C03 (Transfer Entropy) | Circuit edges carry directional information; sender informs receiver | Non-circuit edges carry as much TE; circuit edges are not privileged |
| C04 (PID) | High synergy = cooperative circuit; high redundancy = fault-tolerant | Low synergy and redundancy; heads contribute independently |
| C05 (Info Bottleneck) | Circuit layers have high ; task info peaks at circuit layers | Task info is uniformly distributed or peaks outside circuit layers |
| C06 (O-Information) | Circuit is specifically synergistic or redundant vs random | Circuit has the same information structure as random head subsets |
| C07 (Granger) | Circuit edges are preferentially Granger-significant | Granger significance does not discriminate circuit from non-circuit |
| C08 (oCSE) | Observational discovery recovers circuit heads; high F1 | Discovery identifies different heads; circuit may be non-predictive |
| C09 (NOTEARS) | Structure learning recovers circuit as DAG parents | DAG parents do not match circuit; structure may be non-linear |
Cross-metric triangulation
Section titled “Cross-metric triangulation”The strongest evidence comes from convergent findings across information-theoretic methods that probe different properties:
- C01 + C04: High within-circuit MI (C01) with high synergy (C04) indicates the circuit is both informationally coupled and cooperatively computational. If MI is high but synergy is low, the coupling is redundant (overlapping, not joint).
- C03 + C07: Transfer entropy (C03) and Granger causality (C07) both test directional information flow but use different statistical methods (partial correlation vs F-test). Convergence provides robust evidence of directed functional coupling along circuit edges.
- C02 + C06: Conditional MI (C02) identifies hub heads; O-information (C06) identifies overall synergy/redundancy balance. A synergy-dominated circuit (C06) with strong mediation (C02) suggests a few hub heads orchestrate cooperative computation.
- C08 + C09: oCSE (C08) discovers via forward selection; NOTEARS (C09) discovers via continuous DAG optimization. If both identify the same circuit heads from observational data alone, the circuit is robustly recoverable without interventions.
Relationship to Causal Metrics
Section titled “Relationship to Causal Metrics”Information-theoretic metrics are observational by nature: they measure statistical relationships in activation patterns without intervening. This makes them complementary to — not substitutes for — causal metrics (activation patching, ablation, causal scrubbing). The information-theoretic metrics answer “do these heads share information?” while causal metrics answer “does this head’s information causally affect the output?”
The ideal pattern is convergent: heads identified by MI, TE, and Granger causality should also be identified by activation patching and ablation. Divergence is informative: a head with high MI but low activation patching effect suggests its information is redundant (available elsewhere), while a head with low MI but high patching effect suggests it carries unique, sparse information.