Skip to content

Three baselines are required for any IIA or faithfulness score to be interpretable.


What: Run DAS-IIA with the same architecture but replace circuit factor activations with random unit vectors from the same space.

Why non-negotiable: In high-dimensional spaces, random vectors can produce surprisingly high IIA because the alignment map has enough degrees of freedom to fit noise. IIA(circuit) = 0.48 vs. IIA(random) = 0.44 is not a finding; the separation of 0.04 is within noise.

How to compute:

  1. Draw 100 random unit vectors from ℝ^d_model (uniform on sphere).
  2. Run DAS-IIA with each as the “circuit subspace.”
  3. Report: mean, SD, and 95th percentile.

Reporting format:

IIA(circuit) = 0.48
IIA(random, mean) = [X] (SD = [Y], 95th pct = [Z], n = 100)
Separation = 0.48 − [X] = [Δ]

What: Run DAS-IIA on a model with the same architecture but randomly initialized weights (no training).

Why it matters: Separates signal from architectural priors. If the untrained model produces IIA = 0.30, your trained model’s IIA = 0.48 has a learning-attributable separation of 0.18.

How to compute:

  1. Initialize model with same architecture + hyperparameters.
  2. Run DAS-IIA on 3 random initializations with the same prompt distribution.
  3. Report: mean and SD.

Reporting format:

IIA(circuit, trained) = 0.48
IIA(untrained, mean) = [X] (SD = [Y], n = 3 random inits)
Learning contribution = 0.48 − [X] = [Δ]

What: Compare to the best published baseline for the same task and model.

Reference table (from task_reference_baselines.py):

TaskMetricFull ModelBest CircuitRecoverySource
IOIlogit diff3.563.1087%Wang et al. 2022
Greater-Thanprob diff81.7%72.7%89.5%Hanna et al. 2023
SVA (base)logit diff0.700.6593%Lazo et al. 2025
Gendered pronounlogit diff≥ full model100%Mathwin 2023

IIA-specific reference values:

InstrumentTaskGPT-2 Small rangeSource
Transcoder IIASVA0.40–0.60Published (multiple)
DAS IIAIOI0.86–0.95MIB benchmark (Mueller et al.)
Raw neuron IIAIOI0.60–0.75MIB (SAE features < raw neurons)

Every published IIA score must include:

## Baseline Report: [metric] at [component] on [task]
Observed score: [X]
Random-vector (mean): [Y] (SD = [z], n = 100)
Untrained-model (mean): [W] (SD = [v], n = 3 inits)
Published SOTA: [range or value] ([source])
Separation from random: [X − Y] = [Δ_r]
Separation from untrained: [X − W] = [Δ_u]
Relative to SOTA: [X] is [above/within/below] the [source] range of [range]
Interpretation: [X] is [signal/noise/competitive/below SOTA]
because Δ_r = [Δ_r] and Δ_u = [Δ_u].

ComponentIIARandom-vectorUntrainedSOTA rangeStatus
L8.MLP (SVA)0.48NOT YET COMPUTEDNOT YET COMPUTED0.40–0.60M3 partial — run both baselines