Decoding AI and LLMs with Football Analogies

Disclaimer: The author has used a lot of sarcastic tone in this article. Don’t take the analogies too seriously!

In this weird and exciting new world of large language models (LLMs), things are changing so rapidly that it’s hard to keep up. You are not sure if the girl you are talking to online is even real or just outright generated. Let’s be real though, would a real girl even talk to someone like you. Someone who is reading this instead of talking to her. But guess what, you could technically generate an online girlfriend for yourself who is a match made in transformer.

How you may ask? Don’t get too excited. This premise was only to get you just the right amount of excited about LLMs. And now that you are excited about, let’s switch to a less controv-ahem- more universally loved topic as our core example.

This example that I am talking about is the beautiful game - ⚽️

Football is a game loved by many and will hopefully be an analogy where everyone finds something relatable.

Recurrent Neural Network

• Let’s start from the 8th day. There was this thing in the ancient world called Recurrent Neural Networks or RNNs.

• It was used to process sequential data like text.

• How? Well, I have got two words for you. Ankara Messi!

• Remember the game when Messi scored that goal against Getafe in 2007.

• In that goal, Messi dribbled past everyone in a sequence.

• He skipped past all the players, one at a time and got to its end result of scoring the goal.

• Each of his touches were heavily dependent on the last one that he took.

• That’s how RNN’s works - it handles sequences one step at a time, depending heavily on its own past.

• Now the problem with RNN was that Messi would need to everything alone.

• Every time the team needs to score, Messi would have to go all the way.

• This seems so stupid to put all burden on one player to do everything, when we can take the help of 10 other players on the field.

• This is what Transformer solved.
Transformer

• Widely regarded as one of the greatest teams of all time, there is no better way to explain the working of a transformer by taking the example of, the Spanish National Team between 2008 -12.

• That team was a symphony of 11 players moving the ball on the pitch so elegantly if felt like a hive mind.

• Now, the game didn’t only depend on just one player like Messi to dribble past all the players.

• Each player would be communicating almost telepathically with every other player.

• Sergio Busquets would not need to worry about the last pass he received. • Instead, he could see the whole field to play a pass.

• Instead of relying on the last pass, Busquets could now look at the positions of all the players and decide where to pass.

• Then there was each player’s intellect and the outstanding coaching of Vicente del Bosque that would help the players “weigh” who to pass to or look out for.

• In this symphony, everyone works in parallel – as a team.

• There is no “I” in Transformer.
Tokenization

• Read the sentence below and make sense of it. If you are even a casual football fan, it would be a piece of cake.

• “Keylor Navas Marcelo Raphael Varane Sergio Ramos Dani Carvajal Luka Modric Casemiro Toni Kroos Cristiano Ronaldo Karim Benzema Gareth Bale”

• Now, if you really are a football fan and not someone gaslighting themselves into believing that you are, then here is how you would have read it:

• Keylor Navas | Marcelo | Raphael Varane | Sergio Ramos | Dani Carvajal | Luka Modric | Casemiro | Toni Kroos | Cristiano Ronaldo | Karim Benzema | Gareth Bale

• See what your brain just did there? It broke down these random ass words to the best playing XI of the greatest Champions League team in history.

• It broke down that sentence into names of players you already know.

• Show that to someone who doesn’t watch football and they wouldn’t be able to do that.

• But you were able to break down the sentences into players you know.

• This is tokenization.

• You breakdown a sentence into smaller known parts.
Dictionary or Vocabulary

• How many football players you know the name of can be termed as your dictionary of football players.

• The player names are the “keys” and their squad numbers are their “values”.

• Kind of like this:

o Bale– 11

o Benzema – 9

o Cristiano – 7

• An LLM dictionary is all the words or tokens it can identify.
Vocab size

• It basically is the number of players you know the name of. The total players you know is your vocab size.
Encoder

• So, you can think of encoding as replacing the player names with their squad names.

• An LLM understands everything in terms of numbers.

• So, if you would again look at the Real Madrid XI we saw earlier, it would be like this:

• Keylor Navas, Marcelo, Pepe, Sergio Ramos, Dani Carvajal, Luka Modric, Casemiro, Toni Kroos, Cristiano Ronaldo, Karim Benzema, Gareth Bale

• [1, 12, 5, 3, 2, 19, 14, 8, 7, 9, 11]

• Here, we have mapped:
Decoder

• This is the reverse of encoding.

• You convert the encoded value to its original value.

• You convert the squad numbers back to the player names.
Vector Embedding

• Just like each player is assigned a position on the field, like LB and LWF, to form a relation between them that helps them play better, LLMs also assign the encoded words a position on its “field”.

• LLMs use a high-dimensional vector space (often hundreds or thousands of dimensions) to assign positions to different tokens.

• Think of it like each player having not just a position like LB or LWF, but a complex stat card with hundreds of attributes describing everything from their speed to their passing accuracy to their leadership qualities. All these attributes together create their 'embedding' in the team.
Self-attention

• Well, each player on the pitch can communicate with the any other player directly.

• They don’t need an intermediary or go through a certain chain of command to talk to each other.

• A LB will communicate with LW to make an overlapping run and then with the RW to play a cross-field pass.

• A simple example: a goalkeeper like Ederson, could directly call out a striker like Haaland to let him know a lofted through ball is coming his way.
Multi-headed attention

• While, these players are communicating with each other, the coach and analysts are watching everything from the sidelines.

• What they are doing is that they are paying attention to all the players parallelly, analyzing different aspects and perspectives for the best strategy.

• This is also very important. Just like Hansi Flick came in and changed the way Barca play taking them from being runner-ups and left outs to chasing trophies.

• This parallel self-attention, where the coach pays attention to all players at the same time and the relation between them, is multi-headed attention.

• Just like different coaches might focus on different aspects of the game (one on defense, another on set pieces, etc.), multi-headed attention has multiple 'heads' that learn different patterns and relationships between words.
Attention score

• We are back in the Galactico Era.

• Zinedine Zidane has the ball.

• Who does he pass the ball to while Real Madrid is attacking?

• Who does he pay his “attention” to?

• Should he pass the ball to Ronaldo? Figo? Or all the way back to Casillas?

• During this forward pass, let’s score Zidane’s judgement of passing in the final third:

• This judgement score here is the attention score.

• Attention score in a transformer is the scoring done by each token for every other token, where it scores how good or relevant is this relation.

• Attention score tells the model, which other word should it focus on to understand the current word.
Softmax

• So great. Now you have the attention scores.

• But alone they are not everything. The final decision still needs to be made to predict the next pass. • So, how does Zidane decide who to pass? Ronaldo? Figo? Or San Iker?

• If you are smart like Zizou, you would take these attention scores and compare them.

• An easy way to do this is to pool them together and then figure out the most likely pass using probability based on your judgement scores. Like this:
Temperature

• Although, what I am going to explain about temperature is not technically true but this will give you a gist of what it means.

• Temperature defines the randomness of the final answer.

• A good example would be comparing different players taking penalty kicks.

• Low temperature (close to 0) is like Jorginho - very predictable, always does the hop-skip and places it in a corner. Similarly, the model will consistently pick the most likely next word.

• Medium temperature (around 0.7) is like Robert Lewandowski - has a signature run-up but might mix things up occasionally. The model balances predictability with some creative variations.

• High temperature (above 1) is like Neymar - might do a Panenka, might fake out the keeper, might shoot a top bins rocket. Here, the model gets wild and creative, sometimes brilliantly, sometimes disastrously.

• Even higher temperature (~ 2) is Zidane against Buffon. Yeah, you know what I am talking about. That penalty is like the ultimate level of randomness in predicting penalties otherwise who in their right mind would hit a crossbar Panenka on prime Buffon. As for the model, only god knows what will it predict.

• So, my Padawan, when you want consistent, reliable outputs, go low temperature.

• And when you are feeling creative, or want surprising outputs, crank that temperature up.
Prompt Engineering

• Remember how Pep Guardiola gives extremely specific instructions to his players before and during every match?

• That's exactly what prompt engineering is.

• You're not changing the model (the player), you're giving it specific instructions on how to play the game.

• The better your instructions, the better the performance you'll get.

• Just like some players need more detailed instructions than others, different models need different levels of prompting.

• A world-class model like GPT-4 or Grok 4 or Claude Sonnet 3.7 is like your Messi - give it a general direction and it'll figure things out.

• A smaller model might be more like your academy prospect - needs more guidance and explicit instructions.
Knowledge Cutoff

• Imagine you're a loyal Barcelona fan who went into a coma right after the 2015 Champions League final.

• Barca just completed their second treble, MSN was destroying everyone, life was good.

• Then you wake up in 2025 and someone tells you Messi is playing in Miami, Barcelona is broke AF, Ronaldo is in Saudi and there is this new 17 year old kid who is the new ray of sunshine for Barcelona.

• You would be like what in the hell happened since then?

• That's exactly what knowledge cutoff is for an LLM.

• These models are trained on data up to a certain date, and after that – shhhh.

• Ask GPT-4 about the 2022 World Cup final and it'll give you a blow-by-blow account of that Messi masterclass.

• Ask it about something more recent and it would either hallucinate or say it doesn’t know what are you talking about.

• (Note: ChatGPT and some other AI interfaces have search capabilities so it can search for recent events. To test knowledge cutoff, try to use API playground of any LLM you like and keep search off.)

• It has a hard stop in its knowledge timeline, like a football fan who stopped watching after a specific season.

• So, if the model says something totally wrong about recent events, don't be too harsh. It’s like asking someone who is in class 10th about college syllabus.

So, that’s it folks! I hope you understood and learn something here today. If any analogies confuse you, then take that excerpt and stick it in an LLM and try to understand it. Hopefully it will give you an analogy more suited to you.

May you be saved from the LLMs sycophancy.

🎤 🚮

Decoding AI and LLM Jargons

Subscribe to my newsletter

DEV

DEV