How to Visually Understand the Self-Attention Equation
Background Throughout this post, we follow the standard mathematical convention that vectors are represented as column vectors, not row vectors. We’ll start by understanding self-attention for a single token, then generalize to the batched matrix form used in practice. Single Token Self-Attention Consider a sentence with $N$ tokens: $[t_1, t_2, \cdots, t_i, \cdots, t_N]$. For a single token $t_i$, its embedding is $\mathbf{x}_i \in \mathbb{R}^d$. The question is: how do we compute its output vector $\mathbf{z}_i \in \mathbb{R}^{d_2}$ after applying self-attention? ...