Skip to content

This framework asks: Do the individual items within our evaluation instrument measure a single coherent construct?

A circuit faithfulness evaluation typically aggregates scores across multiple prompts, multiple metrics, or multiple ablation targets. Internal consistency asks whether these sub-measurements hang together. If half the prompts suggest the circuit is faithful and the other half suggest it is not, the aggregate score is meaningless — the instrument lacks coherence.

This is the measurement-theoretic analogue of split-half reliability: divide the evaluation items into halves and check whether both halves yield the same conclusion. High internal consistency means the evaluation captures a unitary construct; low consistency means it conflates multiple unrelated signals.

SourceYearKey contribution
Cronbach, “Coefficient alpha and the internal structure of tests”1951Defined Cronbach’s alpha as a reliability lower bound
McDonald, “Test Theory: A Unified Treatment”1999Omega coefficient as alternative to alpha
Revelle & Zinbarg, “Coefficients alpha, beta, omega, and the glb”2009Comparison of reliability estimators
Sijtsma, “On the use, misuse, and the very limited usefulness of Cronbach’s alpha”2009Limitations and correct interpretation

Given ( N ) evaluation items (prompts) each yielding a per-item score ( x_{ij} ) for circuit ( j ), Cronbach’s alpha is:

[ \alpha = \frac{N}{N-1}\left(1 - \frac{\sum_{i=1}^{N} \sigma^2_{x_i}}{\sigma^2_{\text{total}}}\right) ]

where ( \sigma^2_{x_i} ) is the variance of item ( i ) across circuits and ( \sigma^2_{\text{total}} ) is the variance of the total score. For split-half reliability, we compute faithfulness on two random halves of the prompt set and correlate:

[ r_{\text{split}} = \text{Pearson}(\theta_{\text{half}1}, \theta{\text{half}_2}) ]

applying the Spearman-Brown correction: ( \rho = 2r_{\text{split}} / (1 + r_{\text{split}}) ).

The bootstrap script includes a split-half mode: it partitions the prompt set into two equal halves, computes faithfulness on each, and reports the Spearman-Brown corrected correlation as the reliability coefficient.

What it establishes: That the evaluation prompts measure a single coherent construct — faithfulness scores are not driven by a few outlier prompts. What it does not establish: That the construct being measured is the right one — only that the instrument is internally coherent.

Usage:

uv run python 11_bootstrap.py --tasks ioi sva --mode split-half
PatternWhat it means
Alpha > 0.9Excellent internal consistency — items form a tight scale
Alpha 0.7–0.9Good — acceptable for reporting aggregate scores
Alpha 0.5–0.7Questionable — some items may not belong to the same construct
Alpha < 0.5Poor — the aggregate score is not meaningful; inspect item-level results