Skip to content
Validity typeMeasurement
Pass conditionRaw scores are interpretable relative to known reference points; a number without context is not a result
Evidence familyMeasurement
Minimum reportingComparison to ≥1 published baseline on the same task and model; the score in the context of the published range
Common failure modeReporting scores without calibration context; treating an absolute number as self-interpreting

Calibration transforms a raw score into an interpretable claim. Every reported score must be placed in the context of at least one of:

  1. A published baseline on the same task and model.
  2. The SOTA range for the metric.
  3. A within-project comparison.
TaskMetricPublished baselineSource
IOILogit diff faithfulness87% recoveryWang et al. 2022
IOICircuit CMD (lower=better)UGS: 0.035; EAP(CF): 0.214; random: ~0.75MIB benchmark
Greater-ThanProb diff recovery89.5%Hanna et al. 2023
SVALogit diff faithfulness93%Lazo et al. 2025
SVADAS-IIA (transcoder/CLT)0.40–0.60Mueller et al. MIB; transcoder papers
Gendered pronounLogit diff faithfulness≥ full modelMathwin 2023
BLiMP SVABehavioral accuracy95–97%Warstadt et al. 2020
BLiMP anaphor_genderBehavioral accuracy99%Warstadt et al. 2020

Every result in the project should include a calibration sentence: “This score of X is [above/within/below] the published range of Y–Z for [task] in [model] ([source]).”

If no published baseline exists for the task/model combination, state this explicitly and propose the relevant comparison.

“The DAS-IIA score of 0.48 at L8.MLP for SVA in GPT-2 Small is within the published transcoder baseline range of 0.40–0.60 (Mueller et al. MIB; Lazo et al. 2025), making it competitive with SOTA for this task. Subject to baseline separation confirmation (M3), this constitutes a calibrated, competitive result.”

This is what a calibrated result statement looks like.

Every IIA, faithfulness, or classification score must include a calibration sentence referencing a specific published baseline. If no baseline exists, state this explicitly.