Research
May 20, 2026AI Research

Epistemic Humility in Neural Network Interpretability

Current interpretability methods reveal correlations between model internals and human-legible concepts, but conflating correlation with causal explanation leads to overconfident claims about how these systems actually reason.

Machine LearningPhilosophy of ScienceAI

Abstract

The field of neural network interpretability has produced a rich literature of methods for mapping internal representations to human-legible concepts. Techniques such as probing classifiers, activation patching, and sparse autoencoders have generated compelling visualizations and mechanistic narratives. This paper argues that the epistemic standards applied to these findings are frequently insufficient, and that the gap between observation and causal explanation is routinely understated.

The Correlation/Causation Problem

When a linear probe trained on a transformer's residual stream achieves 94% accuracy at detecting the syntactic role of a token, this demonstrates that the information is linearly decodable, not that the model uses a linear representation of syntactic role for its computations. The distinction matters enormously. A model could store syntactic information in a redundant, distributed, or transformed fashion that happens to be linearly separable, while computing with an entirely different internal structure.

Activation patching experiments — "causal tracing" — are a step forward, since they test the causal necessity of a representation for a behavior. But they test necessity under intervention, which is not the same as testing the mechanisms by which necessity is implemented. A necessary node in a circuit may be playing a role quite different from the human-interpretable label we assign to it.

Toward Stricter Standards

I propose three criteria for interpretability claims:

  1. Causal not correlational: Claims about what a component "does" should be supported by interventional evidence, not merely associational evidence.

  2. Behavior-specific: Interpretability findings are often behavior-specific, and generalizations across prompts, tasks, or model sizes should be made cautiously.

  3. Uncertainty-explicit: Published work should report the conditions under which an interpretation fails or degrades, not only the conditions under which it holds.

Conclusion

None of this is an argument against interpretability research. The findings in this literature are genuinely interesting and practically useful. The argument is for clearer epistemic accounting: knowing what we know, what we are guessing, and what we do not yet know how to find out. The history of science is full of disciplines that accelerated by becoming more careful about what they were claiming.

All Research