This post documents my journey through Stanford CS336: Language Models from Scratch — specifically Assignment 1, where we implement core components of a language model.
Overview
CS336 is Stanford’s deep dive into building large language models from first principles. Assignment 1 focuses on:
- Tokenization (BPE implementation)
- Transformer architecture components
- Training loop fundamentals
Key Takeaways
1. Byte-Pair Encoding (BPE)
We first need to consider the related Python methods. BPE was originally developed for data compression but works remarkably well for subword tokenization in NLP.
UTF-8 introduction. UTF-8 is a variable-width encoding that can represent every character in the Unicode standard while remaining backward compatible with ASCII.
Regular expression.
Regular expression like [?.,]
Coming soon…
2. Attention Mechanism
Coming soon…
3. Training Considerations
Coming soon…
Implementation Highlights
def find_chunk_boundaries(
file: BinaryIO,
desired_num_chunks: int,
split_special_token: bytes,
) -> list[int]:
"""
Chunk the file into parts that can be counted independently.
May return fewer chunks if the boundaries end up overlapping.
"""
assert isinstance(split_special_token, bytes), "Must represent special token as a bytestring"
# Get total file size in bytes
file.seek(0, os.SEEK_END)
file_size = file.tell()
file.seek(0)
chunk_size = file_size // desired_num_chunks
# Initial guesses for chunk boundary locations, uniformly spaced
# Chunks start on previous index, don't include last index
chunk_boundaries = [i * chunk_size for i in range(desired_num_chunks + 1)]
chunk_boundaries[-1] = file_size
mini_chunk_size = 4096 # Read ahead by 4k bytes at a time
for bi in range(1, len(chunk_boundaries) - 1):
initial_position = chunk_boundaries[bi]
file.seek(initial_position) # Start at boundary guess
while True:
mini_chunk = file.read(mini_chunk_size) # Read a mini chunk
# If EOF, this boundary should be at the end of the file
if mini_chunk == b"":
chunk_boundaries[bi] = file_size
break
# Find the special token in the mini chunk
found_at = mini_chunk.find(split_special_token)
if found_at != -1:
chunk_boundaries[bi] = initial_position + found_at
break
initial_position += mini_chunk_size
# Make sure all boundaries are unique, but might be fewer than desired_num_chunks
return sorted(set(chunk_boundaries))Challenges & Lessons Learned
Coming soon…
References
- Stanford CS336 Course Page
- Vaswani et al., “Attention Is All You Need” (2017)
- Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units” (2016)
This post is a work in progress. Check back for updates!