Control Theory — Metrics & Protocols
Section titled “Control Theory — Metrics & Protocols”These protocols implement the criteria defined in the control theory lens overview. Each protocol translates a classical control-theoretic concept into a concrete experimental procedure for evaluating transformer circuit claims.
The control theory lens contributes criteria across three validity types: internal validity (settling depth, controllability), external validity (stability margin), and measurement validity (observability). The protocols below operationalize these criteria as runnable experiments with defined metrics, thresholds, and calibration suites.
I14 — Settling Depth (a14_settling_depth.py)
Section titled “I14 — Settling Depth (a14_settling_depth.py)”Lens: Control Theory | Validity type: Internal | Framework: Causal (perturbation propagation and recovery)
Question
Section titled “Question”When a circuit component is perturbed at layer , how quickly does the residual stream recover? If the circuit is modular and self-contained, perturbations should either persist (the component is necessary) or settle quickly (the network compensates). Long settling depths indicate distributed processing; short settling depths suggest localized computation or self-repair.
This protocol directly measures the step response — the central analytical construct of the control theory lens. The step response curve (perturbation magnitude vs layer depth) is the primary diagnostic artifact, revealing whether the circuit exhibits overdamped self-correction, underdamped overshoot, a persistent plateau, or divergent instability.
Metrics
Section titled “Metrics”| Metric | Description | Threshold |
|---|---|---|
mean_settling_depth | Average number of layers until the perturbation decays below 20% of its initial magnitude | layers |
steady_state_error | Residual perturbation at the final layer, measured as normalized distance from the clean run |
The settling threshold is set at 20% of the initial perturbation magnitude. A component that never reaches this threshold within the model’s remaining layers is classified as “non-settling.”
Interpretation
Section titled “Interpretation”A circuit with short settling depth ( layers) and low steady-state error () has effective self-correction — either through its own internal stability or through backup mechanisms in the network. The distinction matters: settling that survives compound ablation (removing backup components) is circuit-intrinsic; settling that collapses under compound ablation is network-mediated.
Long settling depth combined with high steady-state error means the perturbation propagates through all remaining layers and never dissipates — the circuit component has a lasting, distributed effect on downstream computation. This is consistent with the component being load-bearing but inconsistent with the computation being localized to the proposed circuit boundary.
Theoretical grounding
Section titled “Theoretical grounding”| Source | Contribution |
|---|---|
| Ogata (2010), Modern Control Engineering | Settling time in LTI systems |
| Elhage et al. (2021), “A Mathematical Framework for Transformer Circuits” | Residual stream as a dynamical system |
| McGrath et al. (2023), “The Hydra Effect” | Emergent self-repair in language model computations |
Calibrations
Section titled “Calibrations”Uses CAUSAL_CALIBRATIONS — the standard calibration suite for perturbation-based protocols. Calibrations include random-component baselines and ablation-method comparisons.
uv run python a14_settling_depth.py # all tasks, CPUuv run python a14_settling_depth.py --device cuda # GPUuv run python a14_settling_depth.py --tasks ioi induction # specific tasksE8 — Stability Margin (d08_stability_margin.py)
Section titled “E8 — Stability Margin (d08_stability_margin.py)”Lens: Control Theory | Validity type: External | Framework: Behavioral (robustness under scaled perturbation)
Question
Section titled “Question”How robust is the circuit’s behavioral contribution to scaled perturbation? The gain margin is the largest multiplier on the ablation magnitude at which the model still retains more than 50% of clean performance. A large gain margin means the circuit’s contribution is robust and findings from one perturbation strength are likely to hold at nearby strengths. A small gain margin means the circuit is operating near instability — findings are intervention-strength-specific and may not transfer.
The protocol sweeps perturbation magnitudes from 0.5x to 5.0x the standard ablation strength, measuring the fraction of clean logit difference retained at each level. The resulting dose-response curve (perturbation strength vs behavioral metric) is the stability analog of pharmacological dose-response.
Metrics
Section titled “Metrics”| Metric | Description | Threshold |
|---|---|---|
gain_margin | Largest perturbation multiplier where performance exceeds 50% of the clean baseline | |
performance_at_2x | Fraction of clean logit difference retained at 2x perturbation strength |
The perturbation magnitudes tested are: .
Interpretation
Section titled “Interpretation”A gain margin of 2.0 or higher means the circuit’s behavioral contribution survives doubling the perturbation strength — the finding is not balanced on a knife-edge. This supports external validity: the causal claim generalizes across intervention conditions.
A gain margin below 2.0 is a warning, not a disqualification. It means findings at the standard ablation level may not hold at other strengths, and the causal claim should be qualified as intervention-strength-specific. A gain margin below 1.0 means the circuit’s behavior changes qualitatively even under a weaker-than-standard perturbation — the finding is fragile.
The distinction between quantitative and qualitative degradation matters. Gradual performance decline (the logit difference decreases smoothly with perturbation strength) indicates a stable circuit under stress. A discontinuity — a sharp drop at a critical perturbation strength — indicates a phase transition, where the circuit shifts from one behavioral mode to another. The latter is more informative: it identifies the perturbation strength at which the circuit’s computational regime changes.
Theoretical grounding
Section titled “Theoretical grounding”| Source | Contribution |
|---|---|
| Ogata (2010), Modern Control Engineering | Gain margin in feedback systems |
| Cohen & Saphra (2024), “Evaluating the Faithfulness of Circuit Hypotheses” | Faithfulness under varying ablation strength |
| Wang et al. (2022), “Interpretability in the Wild” | IOI ablation studies |
Calibrations
Section titled “Calibrations”Uses BEHAVIORAL_CALIBRATIONS — calibrations designed for behavioral (output-level) protocols, including random-circuit and shuffled-prompt baselines.
uv run python d08_stability_margin.py # all tasks, CPUuv run python d08_stability_margin.py --device cuda # GPUuv run python d08_stability_margin.py --tasks ioi induction # specific tasksM9 — Observability (e09_observability.py)
Section titled “M9 — Observability (e09_observability.py)”Lens: Control Theory | Validity type: Measurement | Framework: Representational (output-decodability of internal state)
Question
Section titled “Question”Is the circuit’s internal state decodable from model outputs? If so, the circuit is “observable” in the control-theoretic sense: its hidden state can be reconstructed from the output trajectory. The protocol tests this by training linear probes from final logits to individual circuit head activations. High means the output contains enough information to reconstruct what the circuit computed internally.
Observability is the measurement-validity dual of controllability. A circuit can be controllable (you can steer its internal state by intervening on its inputs) without being observable (you cannot tell what state it reached from its outputs), and vice versa. Both properties are needed for the circuit to be fully characterizable — controllable means interventions are meaningful, observable means measurements are complete.
Metrics
Section titled “Metrics”| Metric | Description | Threshold |
|---|---|---|
mean_probe_r2 | Mean of linear probes from final logits to individual circuit head activations | |
observability_rank_ratio | Rank of the joint [logits, activations] matrix divided by the number of circuit heads |
Interpretation
Section titled “Interpretation”High mean_probe_r2 () means the circuit’s internal state is at least partially recoverable from the model’s output. The probes are linear, so the recovered information is linearly accessible — the circuit’s internal representations are not hidden behind a nonlinear encoding that requires special decoding to access.
The observability_rank_ratio captures whether the mapping from outputs to internal states is full-rank. A ratio of 1.0 means every circuit head’s activation is independently recoverable from the output. A ratio below 0.8 means some heads’ activations are linearly dependent in the output space — they cannot be individually distinguished from logits alone. This does not mean those heads are unimportant, but it does mean that output-based measurement cannot fully characterize their individual contributions.
Low observability (, rank ratio ) means the circuit’s internal dynamics are partially hidden from output-based evaluation. Behavioral metrics — which measure the circuit’s contribution through its effect on logits — are seeing a projection of the circuit’s state, not the full state. Any internal validity claim based solely on behavioral measurement implicitly excludes the unobservable subspace.
Theoretical grounding
Section titled “Theoretical grounding”| Source | Contribution |
|---|---|
| Kalman (1960), “A New Approach to Linear Filtering and Prediction Problems” | Observability in control theory |
| Alain & Bengio (2017), “Understanding Intermediate Layers Using Linear Classifier Probes” | Linear probing methodology |
| Belinkov (2022), “Probing Classifiers: Promises, Shortcomings, and Advances” | Probe reliability and limitations |
Calibrations
Section titled “Calibrations”Uses STRUCTURAL_CALIBRATIONS — calibrations appropriate for representational and structural protocols.
uv run python e09_observability.py # all tasks, CPUuv run python e09_observability.py --device cuda # GPUuv run python e09_observability.py --tasks ioi induction # specific tasksEN — Engineering Robustness (engineering.py)
Section titled “EN — Engineering Robustness (engineering.py)”Lens: Control Theory | Validity type: Internal | Framework: Cross-discipline (Engineering)
Question
Section titled “Question”How robust is the circuit under stress? This protocol tests four complementary aspects of engineering robustness: whether performance degrades gracefully as components are removed, whether the circuit recovers from component perturbation, how performance holds under extreme conditions, and how local faults cascade through the circuit.
Engineering robustness is the practical counterpart of the gain margin (E8). Where E8 scales a single perturbation continuously, engineering robustness tests qualitatively different stress conditions — component removal, fault injection, extreme inputs, and cascade analysis. Together, they characterize the circuit’s operational envelope: the range of conditions under which the circuit’s computational claims hold.
Metrics
Section titled “Metrics”| Metric | Description | Threshold |
|---|---|---|
graceful_degradation | Smooth vs catastrophic performance decline as components are removed | |
fault_tolerance | Recovery from component perturbation (noise injection, activation corruption) | |
stress_testing | Performance under extreme conditions (adversarial inputs, out-of-distribution prompts) | |
error_propagation | Fault cascade analysis — how local faults spread through the circuit | (lower is better) |
Interpretation
Section titled “Interpretation”Graceful degradation measures the shape of the performance curve as components are ablated one by one. A circuit with graceful degradation loses performance smoothly — each removed component contributes a roughly equal decrement. A circuit with catastrophic degradation has a cliff: performance is stable until a critical component is removed, then collapses. Graceful degradation suggests distributed computation; catastrophic degradation suggests a bottleneck architecture.
Fault tolerance measures recovery from noisy perturbation rather than clean ablation. A fault-tolerant circuit absorbs noise at individual components without propagating errors to the output. Low fault tolerance means the circuit has no error-correction mechanisms — any perturbation at any component reaches the output.
Stress testing pushes the circuit beyond its normal operating conditions. The question is whether the circuit’s computational structure holds under extreme inputs or whether it relies on distributional assumptions that break under stress.
Error propagation tracks how a localized fault cascades through the circuit’s layers and components. Low error propagation means faults are contained locally. High error propagation means a single component failure disrupts the entire circuit.
Theoretical grounding
Section titled “Theoretical grounding”| Source | Contribution |
|---|---|
| Leveson (2011), Engineering a Safer World | Systems safety and fault containment |
| Avizienis et al. (2004), “Basic Concepts and Taxonomy of Dependable Computing” | Fault tolerance taxonomy |
| Strogatz (2001), “Exploring Complex Networks” | Robustness in complex networks |
Calibrations
Section titled “Calibrations”Uses STRUCTURAL_CALIBRATIONS.
uv run python engineering.py # all tasks, CPUuv run python engineering.py --device cuda # GPUuv run python engineering.py --tasks ioi induction # specific tasksWC_M1 — PID Activation Steering (pid_steering.py)
Section titled “WC_M1 — PID Activation Steering (pid_steering.py)”Lens: Control Theory | Validity type: Internal | Framework: Wildcard (Control Theory)
Question
Section titled “Question”Can activation steering be modeled as a PID controller? Standard steering is proportional-only (P-only): a fixed scaling factor multiplied by a steering vector. PID control adds two additional terms:
- Integral (I): accumulated error across layers eliminates steady-state drift. If the steering effect attenuates through layers, the I-term compensates by accumulating the residual error.
- Derivative (D): rate-of-change damping prevents overshoot. If the circuit overreacts to steering (the effect grows before settling), the D-term suppresses the oscillation.
The dose-response curve of a circuit determines which control regime applies. A steep dose-response (small changes in produce large behavioral shifts) needs D-term damping to prevent overshoot. A shallow dose-response (the circuit barely responds to steering) needs I-term accumulation to build up sufficient effect. A moderate dose-response may work with P-only steering — standard activation engineering.
This protocol bridges the control theory lens to the pharmacology lens: the dose-response curve from pharmacological evaluation determines the control regime, and the PID analysis determines the optimal intervention strategy.
Metrics
Section titled “Metrics”| Metric | Description | Threshold |
|---|---|---|
dose_response | Circuit response curve steepness — how sensitively the output responds to steering magnitude | |
sigma_ablation | Noise tolerance — how much Gaussian noise the steering signal can absorb before losing effect (integral accumulation budget) | |
effect_size | Overall control authority — the maximum behavioral shift achievable through steering |
Interpretation
Section titled “Interpretation”High dose-response + high effect size indicates a circuit that responds strongly and steeply to steering. The circuit is controllable in the strong sense: small interventions produce large, predictable behavioral changes. This is the regime where D-term damping matters — without it, steering overshoots the target.
Low dose-response + low sigma ablation indicates a circuit that is hard to steer and sensitive to noise. The circuit has low control authority. I-term accumulation across layers may be needed to build sufficient steering effect, but noise sensitivity limits the accumulation budget. This regime suggests the circuit’s computation is distributed or that the steering vector is not well-aligned with the circuit’s functional subspace.
High sigma ablation indicates the steering signal is robust to noise — the circuit’s response to the steering vector survives corruption. This is consistent with the steering vector targeting a low-dimensional, robust feature of the circuit rather than relying on precise alignment with a high-dimensional representation.
The connection to Ziegler-Nichols tuning is direct: the dose-response curve determines the ultimate gain and ultimate period, from which PID gains can be computed. Whether PID steering outperforms P-only steering on a given circuit is an empirical question that this protocol is designed to answer.
Theoretical grounding
Section titled “Theoretical grounding”| Source | Contribution |
|---|---|
| ICLR 2026, “Activation Steering with a Feedback Controller” | PID framework for activation steering |
| Ziegler & Nichols, tuning method | Gain selection from dose-response curves |
Calibrations
Section titled “Calibrations”Uses STRUCTURAL_CALIBRATIONS.
uv run python pid_steering.py # all tasks, CPUuv run python pid_steering.py --device cuda # GPUuv run python pid_steering.py --tasks ioi induction # specific tasksCross-protocol evidence patterns
Section titled “Cross-protocol evidence patterns”The five control theory protocols provide converging evidence when combined:
| Pattern | Protocols | What it establishes |
|---|---|---|
| Short settling + high gain margin + high observability | I14 + E8 + M9 | Fully characterizable, robust control system |
| Short settling + low gain margin | I14 + E8 | Self-correcting but fragile — findings are intervention-strength-specific |
| High observability + low controllability (via PID) | M9 + WC_M1 | The circuit is readable but not steerable — internal state is decodable from outputs, but steering interventions have low effect size |
| Graceful degradation + high fault tolerance | EN | The circuit absorbs component loss and noise without catastrophic failure |
| Steep dose-response + high sigma ablation | WC_M1 | Strong, noise-robust control authority — the circuit is a good target for activation engineering |
| Non-settling step response + high error propagation | I14 + EN | Perturbations cascade through all remaining layers and propagate to other components — the circuit lacks containment |
Relationship to other lenses
Section titled “Relationship to other lenses”The control theory lens is complementary to other lenses in the framework:
- Neuroscience lens: Ablation (I1) establishes necessity; control theory establishes whether the intervention reached the circuit’s full state space (controllability) and whether the measurement captured the full internal state (observability).
- Pharmacology lens: Dose-response curves (D1) characterize how the circuit responds to graded intervention; PID steering (WC_M1) extends this by asking which control regime (P, PI, PD, PID) produces optimal intervention outcomes.
- Dynamical systems lens: Koopman spectral decomposition characterizes the modes of the circuit’s dynamics; control theory characterizes whether those modes are reachable (controllable) and measurable (observable).