Skip to content

These protocols implement the criteria defined in the control theory lens overview. Each protocol translates a classical control-theoretic concept into a concrete experimental procedure for evaluating transformer circuit claims.

The control theory lens contributes criteria across three validity types: internal validity (settling depth, controllability), external validity (stability margin), and measurement validity (observability). The protocols below operationalize these criteria as runnable experiments with defined metrics, thresholds, and calibration suites.


I14 — Settling Depth (a14_settling_depth.py)

Section titled “I14 — Settling Depth (a14_settling_depth.py)”

Lens: Control Theory | Validity type: Internal | Framework: Causal (perturbation propagation and recovery)

When a circuit component is perturbed at layer LL, how quickly does the residual stream recover? If the circuit is modular and self-contained, perturbations should either persist (the component is necessary) or settle quickly (the network compensates). Long settling depths indicate distributed processing; short settling depths suggest localized computation or self-repair.

This protocol directly measures the step response — the central analytical construct of the control theory lens. The step response curve (perturbation magnitude vs layer depth) is the primary diagnostic artifact, revealing whether the circuit exhibits overdamped self-correction, underdamped overshoot, a persistent plateau, or divergent instability.

MetricDescriptionThreshold
mean_settling_depthAverage number of layers until the perturbation decays below 20% of its initial magnitude3.0\leq 3.0 layers
steady_state_errorResidual perturbation at the final layer, measured as normalized L2L^2 distance from the clean run0.1\leq 0.1

The settling threshold is set at 20% of the initial perturbation magnitude. A component that never reaches this threshold within the model’s remaining layers is classified as “non-settling.”

A circuit with short settling depth (3\leq 3 layers) and low steady-state error (0.1\leq 0.1) has effective self-correction — either through its own internal stability or through backup mechanisms in the network. The distinction matters: settling that survives compound ablation (removing backup components) is circuit-intrinsic; settling that collapses under compound ablation is network-mediated.

Long settling depth combined with high steady-state error means the perturbation propagates through all remaining layers and never dissipates — the circuit component has a lasting, distributed effect on downstream computation. This is consistent with the component being load-bearing but inconsistent with the computation being localized to the proposed circuit boundary.

SourceContribution
Ogata (2010), Modern Control EngineeringSettling time in LTI systems
Elhage et al. (2021), “A Mathematical Framework for Transformer Circuits”Residual stream as a dynamical system
McGrath et al. (2023), “The Hydra Effect”Emergent self-repair in language model computations

Uses CAUSAL_CALIBRATIONS — the standard calibration suite for perturbation-based protocols. Calibrations include random-component baselines and ablation-method comparisons.

Terminal window
uv run python a14_settling_depth.py # all tasks, CPU
uv run python a14_settling_depth.py --device cuda # GPU
uv run python a14_settling_depth.py --tasks ioi induction # specific tasks

E8 — Stability Margin (d08_stability_margin.py)

Section titled “E8 — Stability Margin (d08_stability_margin.py)”

Lens: Control Theory | Validity type: External | Framework: Behavioral (robustness under scaled perturbation)

How robust is the circuit’s behavioral contribution to scaled perturbation? The gain margin is the largest multiplier on the ablation magnitude at which the model still retains more than 50% of clean performance. A large gain margin means the circuit’s contribution is robust and findings from one perturbation strength are likely to hold at nearby strengths. A small gain margin means the circuit is operating near instability — findings are intervention-strength-specific and may not transfer.

The protocol sweeps perturbation magnitudes from 0.5x to 5.0x the standard ablation strength, measuring the fraction of clean logit difference retained at each level. The resulting dose-response curve (perturbation strength vs behavioral metric) is the stability analog of pharmacological dose-response.

MetricDescriptionThreshold
gain_marginLargest perturbation multiplier where performance exceeds 50% of the clean baseline2.0\geq 2.0
performance_at_2xFraction of clean logit difference retained at 2x perturbation strength0.5\geq 0.5

The perturbation magnitudes tested are: {0.5,1.0,1.5,2.0,2.5,3.0,4.0,5.0}\{0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0\}.

A gain margin of 2.0 or higher means the circuit’s behavioral contribution survives doubling the perturbation strength — the finding is not balanced on a knife-edge. This supports external validity: the causal claim generalizes across intervention conditions.

A gain margin below 2.0 is a warning, not a disqualification. It means findings at the standard ablation level may not hold at other strengths, and the causal claim should be qualified as intervention-strength-specific. A gain margin below 1.0 means the circuit’s behavior changes qualitatively even under a weaker-than-standard perturbation — the finding is fragile.

The distinction between quantitative and qualitative degradation matters. Gradual performance decline (the logit difference decreases smoothly with perturbation strength) indicates a stable circuit under stress. A discontinuity — a sharp drop at a critical perturbation strength — indicates a phase transition, where the circuit shifts from one behavioral mode to another. The latter is more informative: it identifies the perturbation strength at which the circuit’s computational regime changes.

SourceContribution
Ogata (2010), Modern Control EngineeringGain margin in feedback systems
Cohen & Saphra (2024), “Evaluating the Faithfulness of Circuit Hypotheses”Faithfulness under varying ablation strength
Wang et al. (2022), “Interpretability in the Wild”IOI ablation studies

Uses BEHAVIORAL_CALIBRATIONS — calibrations designed for behavioral (output-level) protocols, including random-circuit and shuffled-prompt baselines.

Terminal window
uv run python d08_stability_margin.py # all tasks, CPU
uv run python d08_stability_margin.py --device cuda # GPU
uv run python d08_stability_margin.py --tasks ioi induction # specific tasks

M9 — Observability (e09_observability.py)

Section titled “M9 — Observability (e09_observability.py)”

Lens: Control Theory | Validity type: Measurement | Framework: Representational (output-decodability of internal state)

Is the circuit’s internal state decodable from model outputs? If so, the circuit is “observable” in the control-theoretic sense: its hidden state can be reconstructed from the output trajectory. The protocol tests this by training linear probes from final logits to individual circuit head activations. High R2R^2 means the output contains enough information to reconstruct what the circuit computed internally.

Observability is the measurement-validity dual of controllability. A circuit can be controllable (you can steer its internal state by intervening on its inputs) without being observable (you cannot tell what state it reached from its outputs), and vice versa. Both properties are needed for the circuit to be fully characterizable — controllable means interventions are meaningful, observable means measurements are complete.

MetricDescriptionThreshold
mean_probe_r2Mean R2R^2 of linear probes from final logits to individual circuit head activations0.3\geq 0.3
observability_rank_ratioRank of the joint [logits, activations] matrix divided by the number of circuit heads0.8\geq 0.8

High mean_probe_r2 (0.3\geq 0.3) means the circuit’s internal state is at least partially recoverable from the model’s output. The probes are linear, so the recovered information is linearly accessible — the circuit’s internal representations are not hidden behind a nonlinear encoding that requires special decoding to access.

The observability_rank_ratio captures whether the mapping from outputs to internal states is full-rank. A ratio of 1.0 means every circuit head’s activation is independently recoverable from the output. A ratio below 0.8 means some heads’ activations are linearly dependent in the output space — they cannot be individually distinguished from logits alone. This does not mean those heads are unimportant, but it does mean that output-based measurement cannot fully characterize their individual contributions.

Low observability (R2<0.3R^2 < 0.3, rank ratio <0.8< 0.8) means the circuit’s internal dynamics are partially hidden from output-based evaluation. Behavioral metrics — which measure the circuit’s contribution through its effect on logits — are seeing a projection of the circuit’s state, not the full state. Any internal validity claim based solely on behavioral measurement implicitly excludes the unobservable subspace.

SourceContribution
Kalman (1960), “A New Approach to Linear Filtering and Prediction Problems”Observability in control theory
Alain & Bengio (2017), “Understanding Intermediate Layers Using Linear Classifier Probes”Linear probing methodology
Belinkov (2022), “Probing Classifiers: Promises, Shortcomings, and Advances”Probe reliability and limitations

Uses STRUCTURAL_CALIBRATIONS — calibrations appropriate for representational and structural protocols.

Terminal window
uv run python e09_observability.py # all tasks, CPU
uv run python e09_observability.py --device cuda # GPU
uv run python e09_observability.py --tasks ioi induction # specific tasks

EN — Engineering Robustness (engineering.py)

Section titled “EN — Engineering Robustness (engineering.py)”

Lens: Control Theory | Validity type: Internal | Framework: Cross-discipline (Engineering)

How robust is the circuit under stress? This protocol tests four complementary aspects of engineering robustness: whether performance degrades gracefully as components are removed, whether the circuit recovers from component perturbation, how performance holds under extreme conditions, and how local faults cascade through the circuit.

Engineering robustness is the practical counterpart of the gain margin (E8). Where E8 scales a single perturbation continuously, engineering robustness tests qualitatively different stress conditions — component removal, fault injection, extreme inputs, and cascade analysis. Together, they characterize the circuit’s operational envelope: the range of conditions under which the circuit’s computational claims hold.

MetricDescriptionThreshold
graceful_degradationSmooth vs catastrophic performance decline as components are removed0.5\geq 0.5
fault_toleranceRecovery from component perturbation (noise injection, activation corruption)0.5\geq 0.5
stress_testingPerformance under extreme conditions (adversarial inputs, out-of-distribution prompts)0.5\geq 0.5
error_propagationFault cascade analysis — how local faults spread through the circuit0.0\leq 0.0 (lower is better)

Graceful degradation measures the shape of the performance curve as components are ablated one by one. A circuit with graceful degradation loses performance smoothly — each removed component contributes a roughly equal decrement. A circuit with catastrophic degradation has a cliff: performance is stable until a critical component is removed, then collapses. Graceful degradation suggests distributed computation; catastrophic degradation suggests a bottleneck architecture.

Fault tolerance measures recovery from noisy perturbation rather than clean ablation. A fault-tolerant circuit absorbs noise at individual components without propagating errors to the output. Low fault tolerance means the circuit has no error-correction mechanisms — any perturbation at any component reaches the output.

Stress testing pushes the circuit beyond its normal operating conditions. The question is whether the circuit’s computational structure holds under extreme inputs or whether it relies on distributional assumptions that break under stress.

Error propagation tracks how a localized fault cascades through the circuit’s layers and components. Low error propagation means faults are contained locally. High error propagation means a single component failure disrupts the entire circuit.

SourceContribution
Leveson (2011), Engineering a Safer WorldSystems safety and fault containment
Avizienis et al. (2004), “Basic Concepts and Taxonomy of Dependable Computing”Fault tolerance taxonomy
Strogatz (2001), “Exploring Complex Networks”Robustness in complex networks

Uses STRUCTURAL_CALIBRATIONS.

Terminal window
uv run python engineering.py # all tasks, CPU
uv run python engineering.py --device cuda # GPU
uv run python engineering.py --tasks ioi induction # specific tasks

WC_M1 — PID Activation Steering (pid_steering.py)

Section titled “WC_M1 — PID Activation Steering (pid_steering.py)”

Lens: Control Theory | Validity type: Internal | Framework: Wildcard (Control Theory)

Can activation steering be modeled as a PID controller? Standard steering is proportional-only (P-only): a fixed scaling factor α\alpha multiplied by a steering vector. PID control adds two additional terms:

  • Integral (I): accumulated error across layers eliminates steady-state drift. If the steering effect attenuates through layers, the I-term compensates by accumulating the residual error.
  • Derivative (D): rate-of-change damping prevents overshoot. If the circuit overreacts to steering (the effect grows before settling), the D-term suppresses the oscillation.

The dose-response curve of a circuit determines which control regime applies. A steep dose-response (small changes in α\alpha produce large behavioral shifts) needs D-term damping to prevent overshoot. A shallow dose-response (the circuit barely responds to steering) needs I-term accumulation to build up sufficient effect. A moderate dose-response may work with P-only steering — standard activation engineering.

This protocol bridges the control theory lens to the pharmacology lens: the dose-response curve from pharmacological evaluation determines the control regime, and the PID analysis determines the optimal intervention strategy.

MetricDescriptionThreshold
dose_responseCircuit response curve steepness — how sensitively the output responds to steering magnitude0.5\geq 0.5
sigma_ablationNoise tolerance — how much Gaussian noise the steering signal can absorb before losing effect (integral accumulation budget)0.5\geq 0.5
effect_sizeOverall control authority — the maximum behavioral shift achievable through steering0.8\geq 0.8

High dose-response + high effect size indicates a circuit that responds strongly and steeply to steering. The circuit is controllable in the strong sense: small interventions produce large, predictable behavioral changes. This is the regime where D-term damping matters — without it, steering overshoots the target.

Low dose-response + low sigma ablation indicates a circuit that is hard to steer and sensitive to noise. The circuit has low control authority. I-term accumulation across layers may be needed to build sufficient steering effect, but noise sensitivity limits the accumulation budget. This regime suggests the circuit’s computation is distributed or that the steering vector is not well-aligned with the circuit’s functional subspace.

High sigma ablation indicates the steering signal is robust to noise — the circuit’s response to the steering vector survives corruption. This is consistent with the steering vector targeting a low-dimensional, robust feature of the circuit rather than relying on precise alignment with a high-dimensional representation.

The connection to Ziegler-Nichols tuning is direct: the dose-response curve determines the ultimate gain and ultimate period, from which PID gains can be computed. Whether PID steering outperforms P-only steering on a given circuit is an empirical question that this protocol is designed to answer.

SourceContribution
ICLR 2026, “Activation Steering with a Feedback Controller”PID framework for activation steering
Ziegler & Nichols, tuning methodGain selection from dose-response curves

Uses STRUCTURAL_CALIBRATIONS.

Terminal window
uv run python pid_steering.py # all tasks, CPU
uv run python pid_steering.py --device cuda # GPU
uv run python pid_steering.py --tasks ioi induction # specific tasks

The five control theory protocols provide converging evidence when combined:

PatternProtocolsWhat it establishes
Short settling + high gain margin + high observabilityI14 + E8 + M9Fully characterizable, robust control system
Short settling + low gain marginI14 + E8Self-correcting but fragile — findings are intervention-strength-specific
High observability + low controllability (via PID)M9 + WC_M1The circuit is readable but not steerable — internal state is decodable from outputs, but steering interventions have low effect size
Graceful degradation + high fault toleranceENThe circuit absorbs component loss and noise without catastrophic failure
Steep dose-response + high sigma ablationWC_M1Strong, noise-robust control authority — the circuit is a good target for activation engineering
Non-settling step response + high error propagationI14 + ENPerturbations cascade through all remaining layers and propagate to other components — the circuit lacks containment

The control theory lens is complementary to other lenses in the framework:

  • Neuroscience lens: Ablation (I1) establishes necessity; control theory establishes whether the intervention reached the circuit’s full state space (controllability) and whether the measurement captured the full internal state (observability).
  • Pharmacology lens: Dose-response curves (D1) characterize how the circuit responds to graded intervention; PID steering (WC_M1) extends this by asking which control regime (P, PI, PD, PID) produces optimal intervention outcomes.
  • Dynamical systems lens: Koopman spectral decomposition characterizes the modes of the circuit’s dynamics; control theory characterizes whether those modes are reachable (controllable) and measurable (observable).