Skip to content

D — How to Choose an Instrument: Criterion → Instrument Lookup

Section titled “D — How to Choose an Instrument: Criterion → Instrument Lookup”

CriterionPrimary instrument(s)Evidence familyNotes
I1 NecessityZero ablation, resample ablation, mean ablationCausalRun ≥2 methods; if all three agree, I5 partially addressed
I2 SufficiencyComplement ablation (ablate all except circuit)CausalAlternative: activation patching into corrupted run
I3 SpecificityControl-axis DAS-IIARepresentationalCausal axis (target task) vs. control axis (unrelated); ratio = specificity score
I3 Specificity (alt)Cross-task ablation effectCausalAblate; measure target AND control task degradation; ratio = selectivity
I4 ConsistencySigma-ablation (c03sigmaablation.py)Measurement8 ablation variants; sigma = SD of metric
I4 Consistency (alt)Bootstrap prompt subsamplingMeasurement100 subsamples of 50% prompt set; CI on metric
I5 Confound controlComponent-specific ablationCausalZero only target head; compare to full-circuit ablation
I5 Confound control (alt)Zero vs. resample vs. mean comparisonCausalIf all three give similar Δ, mean-field confound ruled out

CriterionPrimary instrument(s)Evidence familyNotes
E1 Intervention reachActivation delta loggingCausalMeasure act[after] − act[before] at hook; confirm direction + magnitude
E2 Graded responseSteering multiplier sweep (steering multiplier sweep)Causal7+ values 0–20; plot metric vs. multiplier; identify threshold and plateau
E3 SelectivityCross-task metric ratioBehavioralSame intervention; on-task vs. off-task metric; ratio ≥ 2.0
E4 Effect magnitudeRecovery fractionBehavioral(metric_circuit − metric_zero) / (metric_full − metric_zero)
E5 RobustnessCross-prompt-family transferBehavioralNew prompt distribution; report IIA or faithfulness
E5 Robustness (alt)Cross-scale weight transfer F1StructuralWeight classifier on Pythia-160M or GPT-2 Medium
E6 Cross-architecturedictionary_alignment()Structuralmean_max_cos between W_dec of circuit heads across model families

CriterionPrimary instrument(s)Evidence familyNotes
C1 FalsifiabilityPre-registered threshold statementN/ANot an instrument — a claim structure requirement; must precede data collection
C2 Structural plausibilityWeight classifier, SVD of W_OV/W_QKStructuralCos alignment to known role-direction (e.g., low-rank copying for name-movers)
C3 Task specificityCross-task weight classifier F1StructuralF1 on target vs. F1 on control; target should dominate
C3 Task specificity (alt)Cross-task DAS-IIA ratioRepresentationalIIA on target vs. IIA on control; ratio ≥ 2.0
C4 MinimalityPer-head ablation pruningCausalRemove one head at a time; exclude heads whose removal has no effect
C5 Convergent validityJaccard(weight-circuit, EAP-circuit)Measurement≥2 instruments, different evidence families; Jaccard ≥ 0.5 = strong convergence

CriterionPrimary instrument(s)Evidence familyNotes
M1 ReliabilityBootstrap CI, seed SDMeasurement100 subsamples; 3 seeds; report CI and SD
M2 InvarianceCross-scale F1 transferStructuralClassifier trained on GPT-2 Small; evaluated on Pythia-160M
M3 Baseline separationRandom-vector IIA, untrained-model IIAMeasurementBoth baselines required; separation = IIA_circuit − IIA_random
M4 SensitivityAUROC, AUPRC for circuit membershipMeasurementBinary: circuit head vs. non-circuit head
M5 CalibrationPublished SOTA comparisonMeasurementSee task_reference_baselines.py; transcoder range 0.40–0.60 for GPT-2 Small SVA
M6 Construct coverageConstrained vs. unconstrained IIARepresentationalLinear constraint vs. none; compare on OOD prompts

PhaseMinimum instrumentsAdditional if time
Initial sweepWeight classifier (C2), DAS-IIA (M3, M5)Activation patching (I1 partial)
First publicationI1 + I2 + M3I3, E1, C2
Full publicationI1–I5 + M1–M3 + ≥1 external + ≥1 constructAll remaining criteria