How to Visually Understand the Self-Attention Equation

Background Throughout this post, we follow the standard mathematical convention that vectors are represented as column vectors, not row vectors. We’ll start by understanding self-attention for a single token, then generalize to the batched matrix form used in practice. Single Token Self-Attention Consider a sentence with $N$ tokens: $[t_1, t_2, \cdots, t_i, \cdots, t_N]$. For a single token $t_i$, its embedding is $\mathbf{x}_i \in \mathbb{R}^d$. The question is: how do we compute its output vector $\mathbf{z}_i \in \mathbb{R}^{d_2}$ after applying self-attention? ...

Published on January 4, 2026 · Updated on January 5, 2026 · 806 words

Building an LLM from Scratch: CS336 Assignment 1

This post documents my journey through Stanford CS336: Language Models from Scratch — specifically Assignment 1, where we implement core components of a language model. Overview CS336 is Stanford’s deep dive into building large language models from first principles. Assignment 1 focuses on: Tokenization (BPE implementation) Transformer architecture components Training loop fundamentals Key Takeaways 1. Byte-Pair Encoding (BPE) We first need to consider the related Python methods. BPE was originally developed for data compression but works remarkably well for subword tokenization in NLP. ...

Published on January 3, 2026 · Updated on January 4, 2026 · 555 words