Summary of Chapter 2 from Building LLMs from Scratch

Chapter 2’s focus is on the process of preparing textual data for LLMs, covering steps from raw text to the numerical format that is required for deep learning procedures. Raschka begins by stating how unlike humans, computers are unable to understand content based off of words or sentences, they require a different format, which is sets, or dimensions, of numbers. This need for a numerical representation is what leads to the widespread process of tokenization, the process in which text is split into smaller pieces called tokens. Tokens can be as big as a complete word, but it can also be split into parts of words or even singular characters.
A major topic, or rather method, Raschka mentions is Byte Pair Encoding (BPE), a tokenization method that is intended to allow models to handle words that may not necessarily be in their token dictionary1
. It does this process by breaking down all the words into their individual characters (i.e “the” —> [“t”, ”h”, “e”)) and then repeatedly merging the most common, or reoccurring, pairs of characters together into single tokens. This approach not only keeps the token dictionary, also known as vocabulary, at a manageable size but also allows the model to comprehend unfamiliar words by splitting them into the familiar chunks. For example, if a model using BPE was required to tokenize the word “tissue” even though the word is not in the dictionary, it may split it into recognizable characters such as “t” and “issue”. The chapter highlights BPE in particular in order to give a example of a tokenization method and show how crucial it is for making a LLM flexible and efficient.
After going into the BPE method, Raschka talks about the the sliding window technique. The sliding window technique is used to break a longer text, such as a article, into smaller segments that fit better into a models context window2
. This process is important for LLMs because they can only “see“, or refer to, a certain amount of text at a time. Using a sliding window allows a LLM to ensure that every part of the text is used giving it a plethora of examples, even though some have overlaps, for training. From what I gathered, the sliding window method allows for a increase in the number of training examples and helps the model learn about the context and relationships surrounding each word better.
Once the text is tokenized, the next step in order for it to be readable for a model is to assign the tokens numbers through a process called token embeddings. Each token is then mapped to a vector with a unique set of numbers, capturing its meaning and relationship with the other tokens. A neat thing I learned when reading through this is that the embeddings are learned during training so words with similar meanings are often assigned similar vector combinations.
A challenge that is seen during language modeling is that not only do the models need to know what words are present and what they mean, but also the order at which they are inputted in. For example, there is a big differences between I ate the burger and the burger ate me! In order to combat this the chapter talks about a well know method called positional encoding. Because transformer blocks don’t actually understand the sequence of words, positional encodings are added to token embeddings. The extra numbers tell the model where each token should appear in the sentence allowing it to make the distinction between a person eating a burger and a burger eating them.
The second half of the chapter focuses more on tensors, a multidimensional list of numbers commonly used to represent embeddings, and PyTorch, which is a key tool for handling and processing tensors. The chapter then goes into the different types of tensors: scalars (single dimension), vectors(two dimension), matrices(tables) and higher dimensional tensors(cubes and beyond).
Finally, the chapter goes into how all these different parts fit together. It shows how you use PyTorch as a way to create the layers for tokens and their positional encodings, and how to build a efficient data loader that feeds batches of data into the model.
Thank you for reading,
Abhyut Tangri
Token Dictionary- the dictionary of each token the model holds, ex: [‘the’, ‘fox’, ‘blue’, ‘beans’]
Context window-the maximum amount of text (measured in tokens) that a language model can “see” and use at one time to generate a response.
Subscribe to my newsletter
Read articles from Abhyut Tangri directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
