Criterion E2 — Graded Response
Section titled “Criterion E2 — Graded Response”| Validity type | External |
| Pass condition | The effect scales monotonically with intervention strength; a threshold and plateau are visible |
| Evidence family | Causal |
| Minimum reporting | ≥7 intervention-strength values; task metric at each; identification of threshold and plateau |
| Common failure mode | Testing only 1–2 intervention strengths; never characterizing the dose-response curve |
What this criterion requires
Section titled “What this criterion requires”A mechanistically real intervention produces a response that:
- Scales monotonically with strength (more intervention → more effect, up to plateau).
- Has a threshold: below some minimal strength, the effect is negligible. A real mechanism has a threshold; random noise does not.
- Has a plateau: at some maximal strength, the effect saturates. Beyond the plateau, off-task metrics should degrade before on-task metrics plateau.
Operationalized via steering multiplier sweep using steering multiplier sweep in factor_analysis.py: sweep the strength parameter from 0 to 20 across ≥7 values; measure the target task metric at each; identify threshold and plateau.
Why the plateau matters
Section titled “Why the plateau matters”A result showing only the increasing part — without the plateau — is consistent with the hypothesis that the intervention is generically disrupting the model rather than specifically saturating a mechanism. A genuine mechanistic intervention will plateau when the specific computation is saturated; generic disruption keeps degrading all metrics without a meaningful plateau.
Minimum reporting rule
Section titled “Minimum reporting rule”- Full dose-response curve with ≥7 strength values.
- Identify threshold and plateau (or note that neither was found, with explanation).
- Compare on-task and off-task metrics at each strength level.
- If only 1–2 strength values tested: graded response is unsatisfied — flag it.