Skip to content
Validity typeMeasurement
Pass conditionInstrument detects real circuits at acceptable hit rates (AUROC ≥ 0.85) without excessive false positives (AUPRC above random baseline)
Evidence familyMeasurement
Minimum reportingAUROC for circuit head membership classification; AUPRC with random baseline comparison
Common failure modeReporting only positive results; never characterizing the false positive rate

Sensitivity is the detection-theory characterization of the measurement instrument.

Satisfied when:

  1. AUROC ≥ 0.85 for circuit head membership classification. AUROC = 0.5 means chance performance; 1.0 is perfect.
  2. AUPRC is above the random baseline. For a circuit of k heads out of n total, random baseline AUPRC = k/n. For GPT-2 Small IOI (26 heads out of 144 total): random baseline = 0.18. The instrument’s AUPRC should substantially exceed this.

The project’s circuit scan results report Test 16 AP = 1.0 — the weight classifier achieves perfect average precision on the held-out test set. This is the strongest possible sensitivity result and is a primary finding that should be highlighted in any write-up. This means the classifier can perfectly separate circuit members from non-members on the held-out set (at this circuit size / task combination).

A more complete characterization uses d-prime from signal detection theory:

d' = Z(hit_rate) - Z(false_alarm_rate)

where Z is the inverse normal CDF. d’ ≥ 2.0 indicates good discrimination. Report d’ alongside AUROC when circuit membership is known from published circuits.

AUPRC (not AUROC) is preferred for small circuits because AUROC is insensitive to false positive rates when the positive class is rare. For a circuit of 5 heads (random baseline AUPRC = 0.035), AUPRC = 0.40 represents excellent sensitivity. Report both metrics.

  • AUROC and AUPRC with random baseline.
  • d-prime if circuit membership is known.
  • The Test 16 AP result if computed.
  • If sensitivity was not characterized: flag as partially satisfied.