F05 — Internal Consistency
Section titled “F05 — Internal Consistency”This framework asks: Do the individual items within our evaluation instrument measure a single coherent construct?
A circuit faithfulness evaluation typically aggregates scores across multiple prompts, multiple metrics, or multiple ablation targets. Internal consistency asks whether these sub-measurements hang together. If half the prompts suggest the circuit is faithful and the other half suggest it is not, the aggregate score is meaningless — the instrument lacks coherence.
This is the measurement-theoretic analogue of split-half reliability: divide the evaluation items into halves and check whether both halves yield the same conclusion. High internal consistency means the evaluation captures a unitary construct; low consistency means it conflates multiple unrelated signals.
Theoretical grounding
Section titled “Theoretical grounding”| Source | Year | Key contribution |
|---|---|---|
| Cronbach, “Coefficient alpha and the internal structure of tests” | 1951 | Defined Cronbach’s alpha as a reliability lower bound |
| McDonald, “Test Theory: A Unified Treatment” | 1999 | Omega coefficient as alternative to alpha |
| Revelle & Zinbarg, “Coefficients alpha, beta, omega, and the glb” | 2009 | Comparison of reliability estimators |
| Sijtsma, “On the use, misuse, and the very limited usefulness of Cronbach’s alpha” | 2009 | Limitations and correct interpretation |
Core concept
Section titled “Core concept”Given ( N ) evaluation items (prompts) each yielding a per-item score ( x_{ij} ) for circuit ( j ), Cronbach’s alpha is:
[ \alpha = \frac{N}{N-1}\left(1 - \frac{\sum_{i=1}^{N} \sigma^2_{x_i}}{\sigma^2_{\text{total}}}\right) ]
where ( \sigma^2_{x_i} ) is the variance of item ( i ) across circuits and ( \sigma^2_{\text{total}} ) is the variance of the total score. For split-half reliability, we compute faithfulness on two random halves of the prompt set and correlate:
[ r_{\text{split}} = \text{Pearson}(\theta_{\text{half}1}, \theta{\text{half}_2}) ]
applying the Spearman-Brown correction: ( \rho = 2r_{\text{split}} / (1 + r_{\text{split}}) ).
Instruments under F05
Section titled “Instruments under F05”Split-Half Reliability (11_bootstrap.py)
Section titled “Split-Half Reliability (11_bootstrap.py)”The bootstrap script includes a split-half mode: it partitions the prompt set into two equal halves, computes faithfulness on each, and reports the Spearman-Brown corrected correlation as the reliability coefficient.
What it establishes: That the evaluation prompts measure a single coherent construct — faithfulness scores are not driven by a few outlier prompts. What it does not establish: That the construct being measured is the right one — only that the instrument is internally coherent.
Usage:
uv run python 11_bootstrap.py --tasks ioi sva --mode split-halfReading the scores
Section titled “Reading the scores”| Pattern | What it means |
|---|---|
| Alpha > 0.9 | Excellent internal consistency — items form a tight scale |
| Alpha 0.7–0.9 | Good — acceptable for reporting aggregate scores |
| Alpha 0.5–0.7 | Questionable — some items may not belong to the same construct |
| Alpha < 0.5 | Poor — the aggregate score is not meaningful; inspect item-level results |