Skip to content
Validity typeMeasurement
Pass conditionThe instrument gives comparable results across model sizes and model families — the same construct is being measured
Evidence familyStructural, Behavioral
Minimum reportingCross-scale transfer result (F1 or equivalent) with null expectation; template invariance test
Common failure modeTreating cross-scale agreement as assumed; never testing whether the instrument measures the same thing in different models

Invariance asks: when the same instrument is applied to a different model size or family, is it measuring the same construct?

Cross-scale invariance: Weight classifier trained on GPT-2 Small achieves non-zero F1 when applied to Pythia-160M or GPT-2 Medium, above the null expectation for random cross-model comparison. Operationalized via c13invariance.py. Cross-scale F1 substantially above null indicates invariance; cross-scale F1 equal to null indicates the classifier is measuring something specific to GPT-2 Small’s initialization.

Template invariance: The metric does not vary significantly across prompt templates (e.g., different sentence structures for the same task). Welch ANOVA across template groups; p > 0.05 is a pass. Significant variation across template groups may indicate the instrument measures something that varies with template structure — which may be expected or problematic.

Invariance gates cross-architecture claims

Section titled “Invariance gates cross-architecture claims”

Invariance testing at the measurement level is the prerequisite for making cross-architecture generalization claims (E6). You cannot claim the mechanism generalizes across models if you have not established that the instrument is measuring the same thing across models.

  • Cross-scale transfer result with null expectation.
  • Template invariance test (Welch ANOVA or equivalent).
  • If invariance was not tested: cross-architecture claims (E6) cannot be made — flag as open criterion.