Skip to content

This lens asks one question: does information flow through the circuit in directed, efficient, self-sustaining patterns?

When we claim that a transformer circuit “processes” information, we are making a claim about communication channels: information enters the circuit at specific components, is transformed and transmitted through intermediate components, and arrives at the output in a form that determines the model’s behavior. Information theory (Shannon 1948; Cover & Thomas 2006) provides the mathematical framework for quantifying how much information flows, how efficiently it is compressed, and whether the flow is directed or merely correlated. Network science extends this by treating the circuit as a graph and asking whether information propagation through it has the epidemiological structure of a self-sustaining process or the fragile structure of a signal that requires constant external reinforcement.

Together, these tools address a gap that the other lenses leave open. The neuroscience lens asks whether a component is causally necessary. The pharmacology lens asks whether the effect scales. The dynamical systems lens asks whether the dynamics are structured. But none of them directly ask whether the circuit’s components communicate with each other in a directed, efficient way — whether the claimed “flow” of information is actually a flow, with measurable direction, quantifiable throughput, and identifiable bottlenecks. A set of components that are each individually causally relevant but share no directed information transfer is not a circuit in any information-theoretic sense. It is a coincidence of co-located effects.

There is a disanalogy worth naming. Shannon’s channel model assumes a sender, a channel, and a receiver with well-defined codebooks. A transformer circuit has none of these — there is no explicit encoder, no negotiated code, and no clean separation between channel and noise. The “information” in a residual stream is not a message in Shannon’s sense; it is the full state of a distributed computation. Information-theoretic quantities (mutual information, transfer entropy) can still be computed on activations, but they measure statistical dependence, not intentional communication. The gap between statistical dependence and computational role is real, and this lens does not close it by itself. What it does is provide evidence that is orthogonal to ablation-based evidence: two components can be statistically independent (no information flow) yet both causally necessary (ablating either degrades performance). The combination of causal evidence from other lenses and information-theoretic evidence from this one is stronger than either alone.

Mutual information I(X;Y)I(X; Y) is symmetric — it tells you that XX and YY share information, but not which causes which. Two attention heads can have high mutual information because they both read from the same position in the residual stream, not because one sends information to the other. Transfer entropy (Schreiber 2000) and Granger causality (Granger 1969) add directionality by conditioning on the past: XX Granger-causes YY if knowing XX‘s past reduces uncertainty about YY‘s future, beyond what YY‘s own past provides.

In MI: two heads in a proposed circuit may show high mutual information between their activations — but this could reflect shared input rather than directed communication. Granger causality applied to layer-by-layer activations (S08) tests the stronger claim: does the earlier component’s activation predict the later component’s activation beyond what the later component’s own history predicts? If so, there is directed information transfer along the claimed circuit edge. If not, the “connection” may be an artifact of shared context.

Sufficient statistics vs raw representations

Section titled “Sufficient statistics vs raw representations”

The information bottleneck principle (Tishby, Pereira & Bialek 1999) formalizes a tradeoff: the optimal representation of input XX for predicting output YY is the one that compresses XX maximally while preserving all information about YY. The sufficient statistic T(X)T(X) satisfies I(T(X);Y)=I(X;Y)I(T(X); Y) = I(X; Y) — it loses no task-relevant information — while minimizing I(T(X);X)I(T(X); X) — it discards everything irrelevant.

In MI: a circuit’s activations are an intermediate representation between input and output. If they are a good sufficient statistic — preserving most of I(input;output)I(\text{input}; \text{output}) while being much lower-dimensional than the input — the circuit is performing genuine compression, extracting the task-relevant signal from the noise. If the circuit’s activations preserve little of I(input;output)I(\text{input}; \text{output}), the circuit is losing task-relevant information and cannot be the primary mechanism for the behavior. If the circuit’s activations preserve I(input;output)I(\text{input}; \text{output}) but are as high-dimensional as the input itself, the circuit is not compressing — it is passing everything through, and “circuit” is a misleading label for “the whole model.”

The SIR model (Kermack & McKendrick 1927) distinguishes two regimes of disease spread based on the basic reproduction number R0R_0. When R0>1R_0 > 1, each infected individual infects more than one other on average — the disease is self-sustaining and spreads exponentially. When R0<1R_0 < 1, the disease dies out without external reintroduction.

In MI: if we model information as “spreading” through circuit components — each component receiving signal from its predecessors and transmitting to its successors — the circuit’s R0R_0 measures whether the signal is self-sustaining through the layers. A circuit with R0>1R_0 > 1 amplifies its signal as it propagates: the information grows stronger or at least maintains itself through the circuit’s depth. A circuit with R0<1R_0 < 1 attenuates: the signal weakens at each stage and would vanish without constant reinforcement from the residual stream outside the circuit. The first pattern is consistent with a genuine computational pathway. The second suggests the “circuit” is not self-contained — it depends on non-circuit components to keep the signal alive, which undermines claims of circuit sufficiency.

The free energy principle, adapted from statistical mechanics and variational inference, frames computation as a tradeoff between accuracy (how well a component’s output predicts the target) and complexity (how much of the input the component uses to achieve that prediction). Free energy F=accuracy+complexityF = -\text{accuracy} + \text{complexity}, and the optimal component minimizes FF — achieving good predictions without unnecessary complexity.

In MI: some attention heads may achieve high attribution scores by using a large fraction of the residual stream’s capacity — they are accurate but wasteful. Others achieve comparable attribution with much lower complexity. The Pareto frontier of accuracy vs complexity identifies the heads that are genuinely efficient: no other head achieves equal accuracy at lower complexity. Heads below the frontier are either redundant (another head does the same job more efficiently) or vestigial (they add complexity without proportionate accuracy). A circuit composed primarily of Pareto-efficient components is well-engineered. A circuit that includes many sub-Pareto components is carrying dead weight.

The signature artifact of information-theoretic evaluation is the information flow graph: a directed graph where nodes are circuit components and edges are weighted by directed information transfer measures (Granger F-statistics, transfer entropy values, or conditional mutual information).

The graph reveals structure that no single ablation can:

  • Sources — components with high out-degree and low in-degree in the information flow graph. These are where task-relevant information enters the circuit. In a well-defined circuit, sources should correspond to components that attend to or extract the task-relevant tokens.
  • Sinks — components with high in-degree and low out-degree. These are where the circuit’s computation converges to produce the output. Sinks should correspond to components with high direct logit attribution.
  • Hubs — components with both high in-degree and high out-degree. These are the bottlenecks through which information must pass. Removing a hub should collapse R0R_0 below 1.0 (I12).
  • Topology match — the empirically measured information flow graph should agree with the circuit’s claimed topology. If the circuit claim says “head A sends to head B,” the edge ABA \to B should have significant directed information transfer. If the information flow graph reveals strong edges not present in the claimed circuit, the circuit description is incomplete.

To construct the graph: collect layer-by-layer activations for circuit components across a prompt set. For each pair of components (A,B)(A, B) where AA is in an earlier layer than BB, compute the Granger F-statistic or transfer entropy from AA‘s activations to BB‘s activations, conditioning on all other components in intermediate layers. Apply Bonferroni correction for multiple comparisons. Retain edges with p<0.05p < 0.05. Report the graph, its degree distribution, and its overlap with the claimed circuit topology.

Effective number of components (Hill diversity)

Section titled “Effective number of components (Hill diversity)”

Raw Shannon entropy HH of a circuit’s activation distribution or contribution weights is difficult to interpret — is H=2.3H = 2.3 bits concentrated or diffuse? The Hill number 1D=2H{}^1D = 2^H (Hill 1973; Jost 2006) converts entropy to an effective number of equally-contributing components: an intuitive count. A circuit with H=3.2H = 3.2 bits behaves as if it had 23.29.22^{3.2} \approx 9.2 equally-active components.

This transformation is standard in ecology (where it converts species-abundance entropy to “effective number of species”) and provides a natural language for localization claims in MI:

  • Low effective count (1D5{}^1D \leq 5) — the computation is concentrated in a few components. The circuit is localized.
  • High effective count (1D20{}^1D \geq 20) — the computation is spread across many components. The circuit is distributed.
  • Effective count close to total count — every component contributes roughly equally. There is no circuit in any meaningful sense; the “circuit” is the whole model.

Report 1D{}^1D alongside entropy whenever characterizing a circuit’s concentration or diffuseness. It turns a number that requires expertise to interpret into one that does not.

SourceYearFieldPrinciple
Shannon, “A mathematical theory of communication”1948Information TheoryChannel capacity and mutual information — the fundamental limits on reliable communication through a noisy channel
Granger, “Investigating causal relations by econometric models and cross-spectral methods”1969EconometricsGranger causalityXX Granger-causes YY if XX‘s past reduces prediction error for YY beyond YY‘s own past
Kermack & McKendrick, “A contribution to the mathematical theory of epidemics”1927EpidemiologySIR model and R0R_0 — self-sustaining propagation (R0>1R_0 > 1) vs die-out (R0<1R_0 < 1); threshold determines whether a signal maintains itself through a network
Hawkes, “Spectra of some self-exciting and mutually exciting point processes”1971StatisticsHawkes process — self-exciting point process where past events increase the rate of future events; models temporal cascading in neural circuits
Tishby, Pereira & Bialek, “The information bottleneck method”1999Information TheoryInformation bottleneck — optimal representations compress input while preserving output-relevant information; the sufficient statistic is the endpoint
Schreiber, “Measuring information transfer”2000Information TheoryTransfer entropy — non-parametric, model-free measure of directed information transfer; generalizes Granger causality beyond linear models
Cover & Thomas, Elements of Information Theory2006Information TheoryData processing inequality — post-processing cannot increase information; downstream circuit components cannot contain more task-relevant information than upstream ones
Hill, “Diversity and evenness: a unifying notation and its consequences”1973EcologyHill numbersqD{}^qD converts entropy of order qq to an effective number of equally-abundant types; 1D=2H{}^1D = 2^H for Shannon entropy
Jost, “Entropy and diversity”2006Ecology / Information TheoryTrue diversity — entropy is not diversity; exponentiation is required for a measure that doubles when two equally-diverse communities are pooled

Data processing inequality (Cover & Thomas 2006): For any Markov chain XYZX \to Y \to Z, I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y). Processing cannot create information. In MI: if a circuit is a pipeline ABCA \to B \to C, the information about the task at CC cannot exceed the information at BB. Any apparent increase indicates that CC receives information from outside the circuit — the circuit description is incomplete.

A set of causally relevant components with no measurable directed information transfer between them is not a circuit — it is a list of individually important parts. To call it a circuit is to claim that the components communicate, that the communication is directional, and that the signal is strong enough to sustain the claimed computation through the circuit’s depth. Information theory provides the tools to test each of these claims independently of ablation evidence.

The primary contribution of this lens is to construct validity (C9) and internal validity (I11, I12), with a secondary contribution to measurement validity (M8). Together, these criteria ask whether the circuit has the information-theoretic properties that would make the claimed computation coherent: compression (C9), directed flow (I11), self-sustaining propagation (I12), and efficient use of model capacity (M8).

CodeCriterionWhat it asksValidity type
C9Information bottleneckDoes the circuit compress input to a sufficient statistic for the output?Construct
I11Directed information flowDoes Granger causality confirm directed transfer along claimed edges?Internal
I12Transmission criticalityDoes the circuit’s information propagation self-sustain, and do hub removals collapse it?Internal
M8Free energy efficiencyAre the circuit’s components Pareto-efficient in accuracy vs complexity?Measurement

The circuit’s activations should compress the input to a low-dimensional sufficient statistic for the output — preserving task-relevant information while discarding irrelevant variation.

What it establishes. The circuit is performing genuine information processing: extracting the signal from the noise, reducing dimensionality, and retaining what matters for the task. This is the information-theoretic analog of structural plausibility (C2) — it asks whether the circuit’s representations have the right information content for the claimed computation, not just the right causal profile.

What it does not establish. That the compression is causally necessary (that is I1), that the compressed representation is interpretable (that is V1), or that the same compression occurs in other models (that is E6). A circuit can compress efficiently and still be one of multiple circuits that achieve the same output.

Threshold. Mutual information ratio I(circuit activations;output)/I(input;output)0.7I(\text{circuit activations}; \text{output}) / I(\text{input}; \text{output}) \geq 0.7, with compression ratio dim(circuit)/dim(input)0.3\dim(\text{circuit}) / \dim(\text{input}) \leq 0.3. The circuit preserves at least 70% of the task-relevant information in at most 30% of the input’s dimensionality.

Minimum reporting. I(circuit;output)I(\text{circuit}; \text{output}), I(input;output)I(\text{input}; \text{output}), their ratio, the dimensionality of the circuit representation, the dimensionality of the input, and the compression ratio. Estimation method for mutual information (binning, KSG estimator, or variational bound) must be stated, as MI estimates are sensitive to the estimator.

Granger causality or transfer entropy should confirm directed information transfer along the circuit’s claimed edges.

What it establishes. The circuit’s components actually communicate — earlier components send information that later components use, beyond what the later components could predict from their own context. This is evidence for the “flow” in “information flow,” which ablation studies assume but do not directly test. A circuit with confirmed directed edges has a verified communication topology, not just a list of individually important components.

What it does not establish. That the directed information is task-relevant (Granger causality detects any predictive relationship, not just task-relevant ones), or that the flow is necessary for the behavior (that requires ablation of specific edges, which is I1). A circuit can have strong directed information transfer along edges that carry non-task-relevant signal.

Threshold. At least 50% of claimed circuit edges show significant directed information transfer after Bonferroni correction (p<0.05p < 0.05). Edge Jaccard similarity between the claimed circuit graph and the empirically measured information flow graph 0.3\geq 0.3. The Jaccard threshold is deliberately low — circuits are typically specified at a coarser granularity than the information flow graph, so partial overlap is expected.

Minimum reporting. Number of claimed edges tested, number significant after correction, Jaccard similarity with claimed topology, the correction method, and the information transfer measure used (Granger F-statistic or transfer entropy). Any strong empirical edges not present in the claimed circuit should be noted — they indicate missing circuit components.

The circuit’s information propagation, modeled as a network SIR process (WC_M13), should be self-sustaining (R0>1R_0 > 1), and removing hub components should collapse propagation (R0<1R_0 < 1).

What it establishes. The circuit is a self-sustaining computational pathway: signal entering the circuit maintains or amplifies itself through the circuit’s depth without requiring constant reinforcement from non-circuit components. Hub components — those whose removal drops R0R_0 below 1 — are the circuit’s structural bottlenecks, the components whose presence is necessary not just for their own contribution but for the entire circuit’s ability to propagate signal.

What it does not establish. That the self-sustaining signal is the task-relevant computation (it could be a general-purpose signal that happens to traverse the circuit), or that hub components are more important than non-hub components for the task output (a non-hub component might have higher direct logit attribution). Transmission criticality characterizes the circuit’s network structure, not the content of what it computes.

Threshold. Circuit R0>1.0R_0 > 1.0 (self-sustaining propagation). Removing each identified hub component individually should drop R0R_0 below 1.01.0. If no single hub removal drops R0R_0 below 1.01.0, the circuit has distributed robustness and no single structural bottleneck — which is informative but does not satisfy the criterion as stated.

Minimum reporting. Circuit R0R_0, the SIR model parameters (transmission rate, recovery rate), the identity of hub components (those with highest R0R_0 sensitivity), and the R0R_0 after removing each hub. The Hawkes process variant (WC_M9) may be used to model temporal cascading within layers — if used, report the branching ratio and the excitation kernel parameters.

The circuit’s components should achieve a good accuracy-complexity tradeoff, as measured by their position on the Pareto frontier (WC_M12).

What it establishes. The circuit is composed of components that are not wasteful — each one achieves its contribution to the task without using more of the residual stream’s capacity than necessary. This is a measurement-validity criterion because it evaluates whether the circuit’s components are well-chosen, not whether the circuit’s causal claims hold. A circuit composed entirely of Pareto-efficient components is a minimal description of the computation. A circuit containing many sub-Pareto components may include redundant or vestigial elements that inflate the circuit size without improving its explanatory power.

What it does not establish. That the Pareto-efficient components are causally necessary (a component can be efficient and still be backed up by another equally efficient component), or that the inefficient components are unimportant (they may carry non-task-relevant computations that are important for other behaviors).

Threshold. At least 50% of circuit heads lie on or near the Pareto frontier (no other head in the model achieves equal accuracy at lower complexity). Mean Pareto efficiency across circuit components 0.6\geq 0.6, where Pareto efficiency is defined as the ratio of a component’s accuracy to the accuracy of the nearest Pareto-optimal component at the same complexity level. A component on the frontier has efficiency 1.01.0; a component far below it has efficiency approaching 00.

Minimum reporting. The accuracy-complexity scatter plot for all heads in the model, with circuit heads highlighted. The Pareto frontier. The fraction of circuit heads on the frontier. Mean Pareto efficiency. The definitions of “accuracy” (e.g., direct logit attribution, task probability contribution) and “complexity” (e.g., effective rank of the head’s WOVW_{OV}, fraction of residual stream variance used) must be stated — these are not standardized quantities, and the result depends on the definitions.

Evidence patternWhat it establishesRecommended language
High compression + directed edges + R0>1R_0 > 1Coherent, directed, self-sustaining information pipeline”The circuit compresses input to a sufficient statistic and transmits it through directed, self-sustaining information flow”
Directed edges without compressionInformation flows but the circuit does not reduce dimensionality”Directed information transfer confirmed along [fraction] of claimed edges; the circuit passes information through without compressing it”
High compression without directed edgesThe circuit compresses but components do not demonstrably communicate”Circuit activations compress task-relevant information; directed flow between components is not confirmed”
R0<1R_0 < 1Signal attenuates through the circuit; not self-sustaining”The circuit’s information propagation is sub-critical (R0=XR_0 = X); the signal requires external reinforcement”
Most components sub-ParetoCircuit includes inefficient components”Only [fraction] of circuit components are Pareto-efficient; the circuit description may include redundant elements”
  • Proposed to Causally suggestive: C9 (information bottleneck) alone does not gate the transition. The information theory lens contributes to construct and internal validity but does not substitute for causal evidence from the neuroscience lens.
  • Causally suggestive to Mechanistically supported: I11 (directed information flow) strengthens the case that the circuit is not just a set of individually causally relevant components but a connected communication pathway. This contributes to internal validity beyond what ablation alone provides.
  • Mechanistically supported to Triangulated: I12 (transmission criticality) and C9 (information bottleneck) provide convergent structural evidence from an information-theoretic evidence family, independent of ablation-based evidence. M8 (free energy efficiency) contributes to measurement validity by evaluating whether the circuit’s composition is parsimonious.
  • Triangulated to Validated: All four criteria satisfied, with directed information flow confirmed on held-out prompts and cross-checkpoint stability of R0R_0 and compression ratio.
  1. Compression. Estimate I(circuit activations;output)I(\text{circuit activations}; \text{output}) and I(input;output)I(\text{input}; \text{output}) on the prompt set. Report the ratio and the compression ratio. State the MI estimator used and its hyperparameters.
  2. Directed flow. For each claimed circuit edge, compute the Granger F-statistic or transfer entropy from the earlier component’s activations to the later component’s activations. Apply Bonferroni correction. Report the fraction of significant edges and Jaccard similarity with the claimed topology.
  3. Transmission criticality. Fit an SIR model to the circuit’s information propagation graph. Report R0R_0. Identify hub components. Remove each hub individually and report the resulting R0R_0. Optionally fit a Hawkes process (WC_M9) for temporal cascading analysis.
  4. Free energy efficiency. Compute accuracy and complexity for all heads in the model. Plot the Pareto frontier. Report the fraction of circuit heads on the frontier and mean Pareto efficiency.
  5. Integration. Construct the information flow graph. Compare its topology to the claimed circuit structure. Note any strong empirical edges absent from the claimed circuit.
  6. A skipped step must be named in the verdict.
  • IOI — the IOI circuit can be analyzed with this lens via S08 (Granger causality) and WC_M13 (SIR transmission) protocols; the claimed name-mover to S-inhibition edges are testable as directed information flow
  • Induction Heads — the two-head composition (previous-token head to induction head) is a natural target for directed information transfer analysis; R0R_0 should exceed 1.0 for the two-layer pipeline