A Mathematical Framework for Understanding Transformer Circuits

Overview of Transformer Models
This paper examines transformer language models, primarily focusing on two-layer or smaller models containing attention blocks. Analyzing simplified models aids in understanding complex, large-scale models like GPT-3. The main goal is to reverse-engineer transformers to unravel their computations, decoding complex binaries into understandable source code.

Model Simplifications and Assumptions
To streamline the analysis, the research concentrates on “attention-only” transformers, excluding MLP layers and biases. It discusses the high-level structure of autoregressive transformer models, emphasizing the residual stream’s importance and how each layer contributes to this stream, preserving linear structures crucial for comprehending the models.

Residual Stream and Virtual Weights
The concept of residual streams as a communication channel is critical, representing a sum of outputs from all previous layers and the original embedding. This linear structure facilitates virtual weights, enabling direct connections between layers and highlighting the interaction of layer results.

Dividing Attention Heads’ Roles
Attention heads are analyzed as independent and additive operations. Each head is deconstructed into query-key (QK) and output-value (OV) circuits, clarifying how attention patterns dictate the flow of information among tokens and their effects on outputs. The independence of these circuits helps isolate their influence across various model behaviors.

Mechanistic Interpretability via Path Expansion
The “Path Expansion Trick” is introduced to simplify the transformer model into more interpretable paths. By examining the OV and QK circuits, the analysis reveals how one-layer models predict “skip-trigram” sequences and how two-layer models develop more complex algorithms through the composition of attention heads.

The Role of Induction Heads
Two-layer attention-only transformers utilize induction heads for implementing in-context learning algorithms. Induction heads can replicate previously seen sequences, both exactly and approximately, by attending to prior tokens. This mechanism significantly enhances the model’s predictive power compared to simple copying in one-layer models.

Importance of Understanding Higher-Order Compositions
The research also explores higher-order compositions, like virtual attention heads, though their significance appears limited in smaller models. However, these compositions might be crucial in larger, more complex transformers and represent an essential focus for future research.

Future Directions and Practical Implications
While this initial work emphasizes attention-only models, understanding MLP layers remains challenging. Addressing the complexities of these layers will require more holistic insights into transformers. The ultimate goal is to develop systematic interpretability tools that anticipate and mitigate safety issues in current and future AI models.

Resource
Read more in A Mathematical Framework for Transformer Circuits.