Decoding AI Jargons

Gaurav-JethuriGaurav-Jethuri
3 min read

Attention Attention Attention

What does mean by Generative Pre-trained Transformer (a.k.a GPT) ?

But first understand what is “Transformer“, why are we using it at first place ?

Listen, AI didn't start on November 30, 2022 (when ChatGPT was released). Researchers have been working on AI since 1956, according to Wikipedia. Since then, many models have been developed for tasks like sentence translation, which was considered fascinating at the time.

Google in 2017 introduces a new paper called “Attention is all you need”.
This paper introduces a new way for machines (like translation tools or chatbots) to understand and process sentences using a technique called the Transformer. It changed the world of AI because it's simpler, faster, and more powerful than older methods.

Before Transformer:

  • Machine used RNNs (Recurrent neural networks).

  • RNNs read the sentences word by word which was very time consuming for long sentences.

The Transformer:

  • It looks at all words in a sentence (even in long sentences)

  • It used something called “attention“ to focus on important words

From here the story begins:

What is Attention?

Imagine you’re reading this sentence:

“The cat sat on the mat because it was tired.”

To know what “it” refers to, your brain pays more attention to “cat”.

That’s what attention does — it lets the model focus on important words when making decisions.

Key features of transformer:

  • No loops - Unlike RNNs, Transformers don’t go word-by-word. They look at all words together.

  • Self-Attention – The model checks how each word relates to every other word.

  • Multi-Head Attention – Instead of one way of looking, it uses multiple “heads” to look at different relationships between words. Example: One head might look for verbs, another for names, etc.

  • Position Information – Since Transformers read everything at once, they need help knowing the order of words. So we add a special pattern (like a rhythm) to show word positions.

How it works in simplest way ?

There are two big parts:

  1. encoder - Understand the input sentence.

  2. decoder - Generates the output

Each has a multiple layer that repeat♾️ the process

  • Pay attention to the words.

  • Pass the info through the small brains (called the feed-forward network).

  • Normalize and clean the data

Visual Summary (in words)

Imagine below layout

Input sentence --> [ encoder layers ] --> [ decoder layers ] --> output sentence
                    (looks at input)       (Generates output)

Each layer inside looks like below:

[ Multi-head Attention ] -> [ Tiny Brain (Feed-Forward) ] -> [ Layer Normalization ]

Attention:

Think of each word having laser beams pointing to other important words in the sentence!

Multi-Head:

Imagine 8 laser beams from different angles — one focuses on verbs, another on names, one on location, etc.

Positional Encoding:

Each word is given a unique rhythm so the model knows the order of words (even without reading one-by-one).

Credits for my learning:

0
Subscribe to my newsletter

Read articles from Gaurav-Jethuri directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Gaurav-Jethuri
Gaurav-Jethuri