Sparse Attention Post-Training for Mechanistic Interpretability

Explainable & Ethical AI
Published: arXiv: 2512.05865v1
Authors

Florent Draye Anson Lei Ingmar Posner Bernhard Schölkopf

Abstract

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.3 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

Paper Summary

Problem
Large language models (LLMs) have achieved remarkable capabilities, but their increasing complexity makes their internal mechanisms largely opaque. This lack of transparency hinders our understanding of how they work and makes it difficult to improve their reliability and alignment with human values. Even with sophisticated reverse-engineering techniques, the underlying computations implemented by large models can remain highly complex and uninterpretable.
Key Innovation
This research introduces a simple post-training method that makes transformer attention sparse without sacrificing performance. The method applies a flexible sparsity regularisation under a constrained-loss objective, allowing pre-trained models to reorganise their connectivity into a much more selective and structured pattern. This innovation enables the preservation of the original pretraining loss while reducing attention connectivity to approximately 0.3% of its edges.
Practical Impact
The practical impact of this research is significant. By making transformer attention sparse, it is possible to recover most of the models' behaviour from circuits that are orders of magnitude smaller than in the dense case. This positions post-hoc sparse attention as a practical tool for "cleaning up" pre-trained models. The potential applications of this research include improving transparency, reliability, and alignment of large language models, as well as enabling more efficient and interpretable model design.
Analogy / Intuitive Explanation
Imagine a complex city with many streets and buildings. The dense attention pattern is like a city with many streets and alleys, making it difficult to navigate and understand. The sparse attention pattern is like a city with a few main roads and clear intersections, making it easier to understand and navigate. The research shows that it is possible to simplify the city (reduce the attention connectivity) while still maintaining the essential functionality (preserving the original pretraining loss).
Paper Information
Categories:
cs.LG cs.AI
Published Date:

arXiv ID:

2512.05865v1

Quick Actions