Criterion M2 — Invariance
Section titled “Criterion M2 — Invariance”| Validity type | Measurement |
| Pass condition | The instrument gives comparable results across model sizes and model families — the same construct is being measured |
| Evidence family | Structural, Behavioral |
| Minimum reporting | Cross-scale transfer result (F1 or equivalent) with null expectation; template invariance test |
| Common failure mode | Treating cross-scale agreement as assumed; never testing whether the instrument measures the same thing in different models |
What this criterion requires
Section titled “What this criterion requires”Invariance asks: when the same instrument is applied to a different model size or family, is it measuring the same construct?
Cross-scale invariance: Weight classifier trained on GPT-2 Small achieves non-zero F1 when applied to Pythia-160M or GPT-2 Medium, above the null expectation for random cross-model comparison. Operationalized via c13invariance.py. Cross-scale F1 substantially above null indicates invariance; cross-scale F1 equal to null indicates the classifier is measuring something specific to GPT-2 Small’s initialization.
Template invariance: The metric does not vary significantly across prompt templates (e.g., different sentence structures for the same task). Welch ANOVA across template groups; p > 0.05 is a pass. Significant variation across template groups may indicate the instrument measures something that varies with template structure — which may be expected or problematic.
Invariance gates cross-architecture claims
Section titled “Invariance gates cross-architecture claims”Invariance testing at the measurement level is the prerequisite for making cross-architecture generalization claims (E6). You cannot claim the mechanism generalizes across models if you have not established that the instrument is measuring the same thing across models.
Minimum reporting rule
Section titled “Minimum reporting rule”- Cross-scale transfer result with null expectation.
- Template invariance test (Welch ANOVA or equivalent).
- If invariance was not tested: cross-architecture claims (E6) cannot be made — flag as open criterion.