Quick Recap

In the last part we learnt that Transformers are foundations of Large Language Models(LLMs) like GPT, Llama etc. It is a deep learning architecture designed to process text efficiently using the self-attention mechanism.

As promised we will learn about each of the components of the transformer architecture.

Language learning

We know the stages of the transformers - “Self Attention” , “Parallel Processing” , “Multi layer stacking” , “Pre-Training” and “Fine Tuning”.

PreTraining is the step where large organizations with large datasets and unlabelled corpus creates the LLM Model. In this section we will try to break it down :

At the core of the transformers the 2 key components are “Encoders” and “Decoders”

When an input text is accepted , its converted into embeddings by the Encoder which then has its own “self attention technique”.

Encoders are only responsible for embeddings and classification - they perform classification tasks like sentiment analysis , multimodal classification. They do not perform generative tasks - thats meant for Decoders.

Most of the LLMs are either Encoder only or Decoder only models.

What is Encoder?

a. Tokenization

Break down input text into smaller units called tokens : Tokens can be words, subwords or characters depending on tokenizer (eg Wordpiece, Byte Pair Encoding)

For Eg. “John Doe wants to learn English” could be broken down to “John”,”Doe”,”wants”…..
Each token is assigned an ID as vocabulary : “John” → 101, “Doe” → 314 …
Add special tokens to mark the start, beginning of sentences : [CLS],[SEP]

Input - “John Doe wants to learn English. He loves the language”

Result token - [[CLS],101,314,456,12106,986,9011,[SEP][CLS]….] ([CLS] [SEP] are also converted into integer tokens)

b. Embedding generation

Imagine the scale at which these encoders need to operate , if we keep the data at the integer ID level , operations to convert the tokens to understand context and semantics would be be impossible to perform.

Here comes the beauty of matrix operations :) If you dont remember , dont worry - just chatgpt basic matrix operations like addition, subtraction , multiplication , transpose, inverse etc.

Now imagine if you could semantically generate word relationships like :

“King” + “Man” - “Woman” = “Queen”

“Dancing” + “yesterday” - “today” = “Danced” …

Basic English and Maths combine = Magic , right?

We already generated tokens in the previous step , however in order to perform the operations we need to convert the tokens to matrix or better called vectors. And yes thats embeddings.

[[CLS],101,314,456,12106,986,9011,[SEP][CLS]….]

gets converted to

[ -0.0132, 0.0741, -0.0368, 0.0519, 0.0225,

-0.0096, 0.0171, -0.0034, 0.0287, -0.0665,

...

]

Now imagine what power you have to generate semantics of this vector. You can perform all the matrix operations on this vector with other vectors and generate semantic vectors that represent meaningful words if decoded.

P.S. there are lots of embedding model libraries available to generate these vector representations like * OpenAI’s text-embedding-ada-002, BERT (from HuggingFace), Sentence Transformers etc.

c. Positional Embedding

Our ultimate aim is to create contextual embeddings - the embeddings from the previous steps does not have context knowledge.

Imagine these 2 sentences -

“The most beautiful flower is Rose”

“Rose went to the market”

In generic embedding terms “Rose” will have the same vector generation - however does it represent the same context across both the sentences?

No right? This is where positional embeddings adjustment comes handy.

“Rose” in the first sentence will have embedding information closer to the all the different types of flowers.

“Rose” in the second sentence will have embedding information close to the personality of Rose, the person.

d. Self Attention

Lovely! The fact that you have reached this section probably means you are following along :)

From the previous step we have the positional embeddings - We can use them for semantic calculations.

But all of this process is still going to be computationally too intense in the larger scheme of things.

Remember we talked how RNNs are sequential modelling techniques which was not practically feasible for a real world application - then came “Self attention transformer” model which helps identify how each tokens could be processed in parallel and the most important words are identified from the overall sentence.

When you ask ChatGPT - “How is the weather in Delhi?” - Chatgpt only processes “weather” , “Delhi” , thereby reducing operations and improving efficiency. But how does that happen?

Compute the attention score for each token

where Attention(Query,Key) = (Query (dot) Transpose(Key)) / square root(vector dimension)

where Query = “How” , Key = each word of the sentence (“is”,”the”,”weather”,”in”,”Delhi”) , Dimension = scalar value of the vector size.

Remember we know that matrix operations can create new semantic values ?

So for each of the words we now have a value vector(V).

We now create a normalized score using softmax
The V for each token is weighted across the tokens using the normalized scores using

Attention(Query,Key,V) = softmax(Attention(Query,Key)) x V

This process is repeated multiple times across combinations of the “Attention” values - Multi Head Self Attention.

This Multi-Head self attention then feeds the value to a “Feed Forward Network (FFN)” which applies non-linearity and transformations.

We get the single representation from the FFN for a sentence.

e. Contextual Vector Generation and Output

After passing through multiple transformer layers each token embedding becomes a context vector.

The output of the encoder is finally the desired output:

Classification : [CLS]
Translation : The context vector is passed to the decoder.

Done! You now know the key process of how your input text is tranformed in a efficient manner and passed to the decoder for sequence predictions etc.

"Transformer"s made easy - Part 2

Table of contents