How I Built My Own LLM from Scratch: A Student’s Deep Dive into AI

Table of contents
- LLM Basics: What Are They, and Why All the Hype?
- What Are Transformers, Really?
- The Stages of Building an LLM (as a Student!)
- LLM Data Preprocessing: Turning Text into Tensors
- Attention Mechanism: How LLMs Focus Like Humans
- Masked Multi-Head Attention: The Brain of the Decoder
- LLM Architecture: Putting It All Together
- Layer Normalization: Stabilizing Deep Learning
- GELU Activation: The Smooth Power Behind Transformers
- Shortcut Connections: Keeping Gradients Flowing Smoothly
- LLM Pretraining Loop
- Decoding Strategies: How the Model Generates Text
- Fine-Tuning: Teaching My LLM to Do Specific Tasks
- Evaluation: How Well Did My LLM Perform?
- Conclusion & Future Scope: A Student’s Takeaway from Building an LLM
I come from an Electrical and Electronics Engineering background—but somewhere in my second year of college, I stumbled into a different world: computer science. What started as a side interest soon turned into a full-blown passion. I taught myself the fundamentals—data structures, algorithms, operating systems, networking. By my third year, I was deep into web development, building projects, learning new frameworks, and soaking up as much as I could.
Whenever I had the chance, I chose computer science electives. I was always drawn to how things worked under the hood.
Then came the question:
How do models like ChatGPT actually work?
Not just on the surface level—but really, from the inside out. How does text get converted into numbers? How does it “understand” language? How does it generate answers that feel so human?
During my final semester, I came across a YouTube channel called Vizuara. That’s where everything clicked. I started learning the basics of machine learning, and eventually discovered their “Build LLMs from Scratch” playlist. The name alone was enough to get me excited. I didn’t want to just use AI—I wanted to build it.
So, I decided to take on the challenge:
Build my own mini LLM from scratch.
This blog is a deep dive into that journey. From understanding the theory to implementing the model, fine-tuning it, and even evaluating it on real-world tasks—I've documented it all here.
LLM Basics: What Are They, and Why All the Hype?
What are LLMs?
LLMs, or Large Language Models, are a type of artificial intelligence designed to understand and generate human language. They work by predicting the next word in a sentence—one word at a time. That might sound simple, but when scaled up with billions of parameters and trained on massive datasets, these models can write essays, answer questions, translate languages, summarize articles, write code, and even generate poetry.
At their core, LLMs are built on deep learning models called transformers—we’ll get into those soon. But for now, think of an LLM as a supercharged auto-complete on steroids.
Why Are They Called “Large”?
The word “large” here refers to two things:
Model Size: These models often have millions or billions of parameters. (GPT-3, for example, has 175 billion!)
Training Data: They're trained on vast amounts of text from the internet—books, articles, websites, social media, etc.
This scale is what gives them their power: the ability to generalize, understand nuanced language, and perform tasks without being explicitly programmed for them.
LLMs vs Earlier NLP Models
Before LLMs, NLP (Natural Language Processing) models used rule-based systems, bag-of-words, or simpler architectures like RNNs and LSTMs. These worked fine for basic tasks but struggled with long-term dependencies and context.
LLMs changed the game.
Thanks to the transformer architecture, they can process an entire sentence—or even a whole paragraph—in parallel. They understand the context of words in a way older models couldn’t. That’s why the jump from traditional NLP to LLMs feels like going from Nokia phones to smartphones.
The Secret Sauce: Transformers
If there’s one innovation responsible for the LLM revolution, it’s the transformer.
Introduced in the 2017 paper “Attention is All You Need”, transformers ditched recurrence entirely and focused on something called self-attention. Instead of processing text word by word like RNNs, transformers can attend to all words at once, understanding how each word relates to the others—no matter their distance in the sentence.
This architecture made it possible to scale models to previously unimaginable sizes and capabilities. We’ll dive deeper into how transformers work in the next section.
So… Why All the Hype?
Because LLMs feel almost magical. You give them a prompt, and they generate responses that seem thoughtful, coherent, and human-like. They can:
Write code from scratch
Help debug errors
Summarize long documents
Translate languages
Hold conversations
Even write entire blog posts 😄
But what’s more impressive is that they can do this across tasks they were never explicitly trained on. That’s the power of generalization—something AI researchers have been chasing for decades.
Coming Up Next: We’ll demystify the transformer architecture—what’s actually going on under the hood of these powerful models.
What Are Transformers, Really?
Before I started building my own LLM, I had heard the word “transformer” tossed around constantly. Every YouTube tutorial, blog, or research paper seemed to echo the same phrase: “Transformers changed everything.” But what are they? And why are they at the heart of every modern LLM?
Let’s break it down.
Why Transformers?
Previously, NLP models relied on RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory networks). While they were decent at handling sequences, they processed data one step at a time, making them slow and inefficient—especially for long texts.
Transformers, introduced in the landmark 2017 paper “Attention is All You Need”, solved this by doing everything in parallel. Instead of reading sentences word-by-word, they look at the entire sentence at once and use a mechanism called attention to figure out which words matter most to each other.
Core Idea: Self-Attention
Self-attention is the heart of a transformer. Suppose you’re reading the sentence:
“The cat sat on the mat because it was warm.”
What does “it” refer to? The mat, not the cat. A self-attention mechanism helps the model understand these relationships by weighing each word’s relevance to others.
Transformer Building Blocks
Here’s what makes up a transformer model:
Input Embeddings – Words are converted into vectors.
Positional Encodings – Since transformers don’t read word-by-word, we add information about position.
Self-Attention Layers – Each word attends to others based on relevance.
Feedforward Networks – After attention, each vector is passed through a neural network.
Normalization & Residual Connections – Help with faster training and gradient flow.
Stacking Layers – All of the above are repeated multiple times (called "layers") to build depth.
This design turns out to be incredibly powerful—capable of understanding complex language patterns, grammar, and meaning.
Encoders, Decoders, and LLMs
Transformers have two parts:
Encoders: Take input and understand its context (used in models like BERT).
Decoders: Generate output word-by-word (used in GPT-like models).
Encoder–Decoder: Used in translation tasks (like T5 or original Transformer model).
Most LLMs (like GPT) are built using decoder-only transformers, trained to predict the next word given the previous ones.
Why Transformers Matter in My Journey
Understanding transformers was the turning point for me. Once I saw how attention works and how everything fits together, the architecture stopped being a black box—and that’s when I really got excited to build one myself.
Transformer Architecture
🔍 Next Up: The stages of building an LLM—from gathering data to training your own model.
The Stages of Building an LLM (as a Student!)
Once I understood the magic behind transformers, I was eager to build one myself. But let me tell you—it’s not just about throwing data at a neural network and hoping for the best. Building an LLM involves a series of well-defined (and sometimes frustrating) stages. Here's how I approached it step-by-step.
1. Data Collection
Every LLM starts with one key ingredient: text. The model learns by reading tons of it.
Since I didn’t have the compute power or resources to train a large model from scratch, I used the open-sourced GPT-2 weights as a starting point. This gave me a solid foundation to work with while still allowing room for customization and experimentation.
For fine-tuning, I turned to a GitHub repository by rasbt (a well-known ML educator). It provided a clean and flexible setup to finetune transformer models on custom datasets—which was perfect for my use case.
2. Preprocessing the Data
This includes tokenization (breaking text into smaller units) and converting them into numbers (token IDs) the model can understand. We’ll dig into tokenization soon, but trust me—this part is more important than it looks.
3. Defining the Architecture
Even though I used GPT-2 weights, I made sure to understand and experiment with the architecture. That meant diving deep into:
Embedding layers
Attention mechanisms
Layer normalization
Activation functions like GELU
Output heads for classification tasks
I used PyTorch to explore and tinker with each of these components manually before moving to higher-level abstractions.
4. Training Loop
Using the rasbt repo and PyTorch under the hood, I set up a robust training loop to process input batches, compute loss, backpropagate gradients, and update weights.
I experimented with various optimizers, learning rates, and batch sizes to get the best performance. Watching the loss curve drop over time? Pure dopamine.
5. Evaluation and Fine-Tuning
I fine-tuned the model on two tasks:
An Email Spam Classifier, achieving 96.73% average accuracy
A Basic Personal Assistant, scoring an average of 52.26 across evaluation prompts
These results gave me valuable insight into how pre-trained models can be adapted to very different domains.
6. Experimentation
Nothing in machine learning goes as planned on the first try. I experimented with:
Different tokenizers
Varying model depths and widths
Attention heads
Fine-tuning strategies
Prompt formatting
Every failure taught me something new, and I kept refining until the pieces started to click.
This was the student-friendly roadmap I followed: start with a solid base, understand every component deeply, and iterate like crazy. The rest of the blog dives deeper into how each of these parts actually works.
📌 Next Up: Let’s break down how raw text is transformed into numerical input the model can learn from—starting with tokenization.
LLM Data Preprocessing: Turning Text into Tensors
Before a language model can learn anything, it needs to understand text the only way computers know how — as numbers. This stage is all about converting human-readable input into something the model can process and learn from.
Here’s how that pipeline works:
1. Tokenization: Breaking Text into Chunks
Raw text is messy. Tokenization is the process of splitting it into smaller units called tokens. These can be:
Words (e.g., "hello", "world")
Subwords (e.g., "play", "##ing")
Characters (e.g., "h", "e", "l", "l", "o")
I used Byte Pair Encoding (BPE), which is what GPT-2 uses. It starts with characters and merges frequent pairs until it creates a vocabulary of useful subwords.
Example:
Input: "Machine Learning is fun"
Tokens: ["Machine", "Learning", "is", "fun"]
2. Token Embeddings: Giving Tokens Meaning
Each token is mapped to a dense vector — a bunch of numbers that the model can use. This is the embedding. These vectors carry semantic meaning (e.g., "king" and "queen" will have similar vectors).
At this point, each sentence becomes a matrix of numbers — a 2D tensor that flows into the model.
3. Positional Embeddings: Giving Tokens Order
Transformers don’t have a built-in sense of word order. They see input all at once. That’s why we inject positional embeddings — vectors that indicate each token’s position in the sequence.
So, the model knows that in “The cat sat”, “The” came first and “sat” came last.
4. Input Embeddings = Token + Positional
These two embeddings are added together to form the final input embedding:
InputEmbedding = TokenEmbedding + PositionalEmbedding
This resulting vector is what gets fed into the model for training or prediction.
This preprocessing step is critical. If you mess up here, the model’s learning will be garbage — like trying to read a book where all the letters are jumbled and out of order.
📌 Next Up: We’ll get into the secret sauce of LLMs — the attention mechanism that helps the model decide which words to focus on.
Attention Mechanism: How LLMs Focus Like Humans
When we read a sentence, our brain doesn’t treat all words equally — it pays attention to the important ones based on context. That’s exactly what the attention mechanism helps LLMs do.
Let’s break down four key types of attention, building up step by step.
1. Simplified Self-Attention: The Core Idea
At its core, attention helps the model weigh how much each word should focus on every other word in the sequence.
Imagine the sentence:
“The cat sat on the mat.”
When processing the word “sat,” attention helps the model focus more on “cat” (subject) and “mat” (location), and less on “the.”
2. Self-Attention: Every Word Looks at Every Word
In self-attention, each token generates three vectors:
Query (Q): What this word is looking for.
Key (K): What each word has to offer.
Value (V): The actual information.
The model computes:
javaCopyEditAttention Score = Q · Kᵗ / √d
This score determines how much attention to pay to each word. It’s then multiplied with the Value to get a weighted sum of all words — this becomes the new representation for the word.
Every word is re-represented in terms of its relationships to other words.
3. Causal Attention: Preventing the Future Peek
For LLMs trained to generate text (like GPT), we must prevent the model from “seeing the future.”
Enter causal (or masked) attention: we mask out future tokens during training so the model can only look at the past and present tokens.
This enables auto-regressive generation, where the model predicts the next word one at a time.
4. Multi-Head Attention: Seeing From Different Perspectives
Instead of doing one big attention calculation, we split it into multiple smaller "heads". Each head learns to focus on different relationships.
One head may focus on grammar.
Another might learn syntax.
Another might focus on long-range dependencies.
Then all heads are concatenated and linearly transformed.
This gives the model the ability to capture a rich variety of signals.
🧠 TL;DR
Self-attention tells the model what to focus on.
Causal attention ensures no cheating during training.
Multi-head attention enables diverse understanding.
📌 Next Up: Let’s explore Masked Multi-Head Attention — a combination that powers text generation in GPT-style models.
Masked Multi-Head Attention: The Brain of the Decoder
Masked Multi-Head Attention is the core engine behind language generation in models like GPT. It’s essentially a combination of the two concepts we just learned:
Causal (Masked) Attention
Multi-Head Attention
Here’s how it all comes together.
🔒 Why “Masked”?
In generative settings, we want the model to predict the next word without looking ahead. So, we mask out all future tokens during training using a triangular matrix (called a causal mask).
Example:
Input: The dog chased the cat
Predict: ↑ ↑ ↑
t1 t2 t3
At time step t3
, the model can only see t1
and t2
, not t4
. This forces the model to learn sequential prediction like humans write one word at a time.
🎯 Why “Multi-Head”?
Instead of computing a single attention map, the model creates multiple attention heads, each focusing on different parts of the sentence or types of relationships.
Each head:
Learns unique attention patterns
Works in parallel
Captures different semantic features
Then, we concatenate all the heads’ outputs and pass them through a linear layer.
⚙️ What Happens Internally?
Input tokens → split into multiple heads
For each head: Compute masked attention
Combine all the head outputs
Apply LayerNorm and residual connection (we’ll get into this soon!)
This is repeated multiple times across layers to form deep representations.
🧠 Why It Matters
Masked Multi-Head Attention lets the model:
Generate meaningful sequences
Focus on context accurately
Scale efficiently with depth
Without this, LLMs wouldn’t be able to write coherent essays, poems, or even this blog you’re reading.
📌 Next Up: We’ll zoom out and look at the LLM architecture — how all the components like attention, embeddings, and layers are stitched together to build today’s powerful AI.
LLM Architecture: Putting It All Together
Once I understood the nuts and bolts like token embeddings and attention, it was time to architect the full model. It turns out, despite how complex it seems on the surface, a transformer-based LLM is beautifully modular.
🧱 The Building Blocks
Here’s a simplified view of what a decoder-only transformer (like GPT) consists of:
Input Embeddings
Sum of token embeddings and positional embeddings
Represents the input text numerically
Stack of Transformer Decoder Blocks
Each block contains:Masked Multi-Head Attention
Residual Connections
Layer Normalization
Feedforward Neural Network (FFN) with GELU activation
Final Linear Layer + Softmax
- Converts the final hidden state into probabilities over the vocabulary to predict the next token
🧩 Architecture Diagram (Simplified)
Each decoder block is identical in structure but with separate weights. The number of layers (blocks), heads, and hidden size are what make a model “small” like GPT-2 or “large” like GPT-3.
🔧 My Implementation Choices
In my implementation:
I used open-sourced GPT-2 weights for initial architecture reference.
Number of heads, layers, and embedding dimensions were kept manageable so I could train on a local GPU setup.
💡 Key Insight
The architecture’s strength lies in its depth and parallelism. While earlier NLP models were often shallow and sequential, transformers leverage stacked layers and attention heads to model complex patterns in language.
📌 Next Up: We’ll break down Layer Normalization, a tiny component with a massive impact on training stability and performance.
Layer Normalization: Stabilizing Deep Learning
When building deep models like transformers, one silent killer of performance is unstable gradients. That’s where Layer Normalization (LayerNorm) comes in—a subtle but powerful technique that helps ensure our model trains smoothly and efficiently.
🧠 What Is LayerNorm?
Layer normalization standardizes the inputs across the features (hidden dimensions) of each token, not across the batch. For a given token embedding, it ensures that the values going into each layer have:
Mean = 0
Standard deviation = 1
This helps:
Reduce internal covariate shift
Speed up convergence
Make training more stable—especially important for deep models like LLMs
⚙️ How It Works
Given an input vector x∈Rdx \in \mathbb{R}^dx∈Rd, layer normalization computes:
$$LayerNorm(x)=\frac{x - \mu}{\sigma} \cdot \gamma + β$$
Where:
μ is the mean of the elements in xxx
σ is the standard deviation
γ and β are learnable parameters (scale and shift)
🔍 Why Not BatchNorm?
Unlike BatchNorm, which normalizes across the batch dimension, LayerNorm is independent of batch size. This is crucial for NLP tasks where batch sizes might be small or variable and where sequence length plays a more central role.
🧪 In My LLM
I applied LayerNorm:
After the attention sub-layer
After the feedforward (FFN) sub-layer
This ordering helps stabilize the residual connections and overall flow of gradients during backpropagation.
💡 Fun Fact
Almost all modern LLMs—GPT-2, GPT-3, PaLM, LLaMA—rely on LayerNorm. It’s one of those “set it and forget it” ingredients that quietly powers the whole system.
📌 Next up: Let’s dive into the GELU Activation Function—a quirky but powerful non-linearity used in transformer blocks.
GELU Activation: The Smooth Power Behind Transformers
While classic neural networks often use ReLU or tanh, transformer-based LLMs like GPT and BERT use something a bit more exotic: GELU, or the Gaussian Error Linear Unit.
It might sound fancy, but it plays a key role in helping transformers learn better.
⚙️ What Is GELU?
GELU is an activation function that applies a smooth, probabilistic approach to passing values through a neuron. Instead of deciding a hard threshold like ReLU, GELU weighs inputs by their probability of being meaningful.
The formula:
$$GELU(x)= x \cdot \Phi(x)$$
Where Φ(x) is the cumulative distribution function (CDF) of the standard normal distribution.
In practice, a commonly used approximation is:
$$GELU(x) \approx 0.5x \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}} \left(x + 0.044715x^3\right)\right]\right)$$
🤔 Why GELU?
Compared to ReLU:
Smoother: It doesn't introduce sharp cutoffs like ReLU (which zeroes out negatives).
Better for small values: It allows small negative inputs to still pass through (weighted).
Works well in practice: Empirically shown to improve performance in deep transformer models.
🚀 In My LLM
I used GELU inside the feed-forward network (FFN) layers, just after the first linear projection. It helped stabilize training and enabled better gradient flow compared to simpler activations.
🔬 Where Else Is It Used?
You’ll find GELU in almost every modern transformer:
BERT
GPT-2/3/4
T5
LLaMA
PaLM
It’s now the default in most NLP architectures.
📌 Up Next: Shortcut Connections – how they keep our LLM from getting lost in depth.
Shortcut Connections: Keeping Gradients Flowing Smoothly
As transformer models grow deeper, one big challenge arises: vanishing gradients. The deeper the network, the harder it is for gradients to flow back during training. That’s where shortcut connections, also known as residual connections, come in.
🔄 What Are Shortcut (Residual) Connections?
These are simple yet powerful additions. Instead of just passing data through layers sequentially, a residual connection adds the input of a layer directly to its output, like this:
$$Output=Layer(x)+x$$
This “skip” allows information and gradients to flow freely, reducing training issues.
🧠 Why Are They Important?
Prevent vanishing gradients in deep networks.
Preserve information from earlier layers.
Allow much deeper architectures to be trained effectively.
Originally popularized by ResNet in computer vision, they became a core part of the transformer architecture.
🛠️ In My LLM
I used residual connections in two main places inside each transformer block:
Around the multi-head attention block
Around the feed-forward network
Each block looks like this:
x = x + SelfAttention(LayerNorm(x))
x = x + FeedForward(LayerNorm(x))
This structure helped me train deeper models more efficiently.
⚡ Fun Insight
Without residuals, transformer models would struggle to learn from deep layers. With them, training becomes faster and more stable—essential when working with millions (or billions) of parameters.
📌 Next Up: Decoding Strategies – how our LLM decides what to say next using Temperature Scaling and Top-k Sampling.
LLM Pretraining Loop
Decoding Strategies: How the Model Generates Text
Once the LLM has been trained, the next challenge is getting it to generate coherent, meaningful output. This is where decoding strategies come in—techniques that control how the model selects the next word or token in a sequence.
🔥 1. Temperature Scaling
Temperature controls how confident or creative the model is during generation. The idea is simple: before picking the next word, the model predicts a probability distribution over the vocabulary. Temperature modifies that distribution:
$$P_i = \frac{\exp(\frac{z_i}{T})}{\sum_j \exp(\frac{z_j}{T})}$$
Low temperature (T < 1): More confident and deterministic (picks the most likely tokens).
High temperature (T > 1): More random and diverse outputs.
In my model:
T = 0.7
gave balanced results, reducing repetition while staying coherent.T = 1.0
added a bit more variety when I needed creativity.
🎯 2. Top-k Sampling
Rather than sampling from the entire vocabulary, Top-k sampling narrows the choice down to the top k most probable tokens, then randomly picks from them.
k = 50: The model selects from the 50 most likely words.
This reduces the chance of picking low-probability (and often nonsensical) words.
Combining Temperature + Top-k gave the best results in my testing.
🧠 Why This Matters
Without smart decoding, even the best-trained model can:
Repeat itself.
Output gibberish.
Or just be plain boring.
These strategies give control over how the model "talks"—whether you want safe and reliable answers or something more diverse and surprising.
🧪 My Results
During my experiments with the email spam classifier and the assistant model:
Lower temperature with top-k = 40–50 gave the most human-like and context-aware completions.
Higher temperatures worked better when I wanted brainstorming-style results.
📌 Next Up: Fine-Tuning – How I taught the model specific tasks like spam detection and following instructions.
Fine-Tuning: Teaching My LLM to Do Specific Tasks
Once the base model is pretrained, it becomes a versatile but general-purpose tool. To make it actually useful—like classifying spam or answering questions—we need to fine-tune it on specific datasets.
In my project, I explored two fine-tuning paths: Instruction Fine-Tuning and Classification Fine-Tuning.
📘 1. Instruction Fine-Tuning
Inspired by models like FLAN-T5 and OpenAssistant (Also Jarvis 😆 ), instruction fine-tuning trains the model to better understand and follow human-written instructions.
Dataset format example:
Instruction: Translate this sentence to French.
Input: Hello, how are you?
Output: Bonjour, comment ça va ?
By feeding in lots of these prompt-response examples, the model learns how to:
Follow directions
Generate answers in a structured way
Perform multi-step reasoning (to some extent)
While I didn’t build a massive instruction-tuned dataset, I tested this using smaller instruction-style prompts—especially in the assistant module I developed later.
📨 2. Classification Fine-Tuning – Email Spam Classifier
This is where I saw the strongest results.
I fine-tuned my model to classify emails as spam or not spam using a clean labeled dataset and a simple binary classification head.
Steps:
Feed tokenized emails into the model.
Modify the architecture by adding a classification head(Here only 2 output heads used signifying spam and not spam)
Select the layers to finetune.
🧠 Key Takeaway
Pretraining teaches the model how language works.
Fine-tuning teaches the model what you want it to do.
You can build very targeted applications by just adding the right dataset and training logic on top of your LLM foundation.
📌 Next Up: Evaluating the model—how well did it actually perform, and how do we measure it?
Evaluation: How Well Did My LLM Perform?
After all the effort in pretraining and fine-tuning, it was time to see if the model could actually deliver. Evaluation is where theory meets reality—and I was both nervous and excited.
📊 1. Email Spam Classifier Results
The email spam classifier performed surprisingly well, especially considering the model’s small size and limited compute.
✅ Average Accuracy: 96.73%
📉 Loss: Consistently decreased across epochs
🧪 Test Set Generalization: Strong performance even with unseen spam types
🔍 Accuracy vs Epochs
The model steadily improved and plateaued at high accuracy—always a good sign.
📉 Loss vs Epochs
Clean downward trend, indicating healthy learning.
🧠 Loss vs Tokens Seen
This graph visualized how the model's understanding of language improved with more exposure to data.
💬 2. Personal Assistant Results
I also fine-tuned the model to act as a lightweight personal assistant, capable of answering basic prompts and responding to instruction-like inputs.
- 📈 Average Score (Evaluation Metric): 52.26
(Based on a small custom benchmark I built for evaluation)
While it’s not ChatGPT (yet 😄), it handled:
Simple Q&A
Date and time questions
Factual prompts
Some follow-up reasoning
🧠 Reflection
These evaluations confirmed two things:
A small, well-designed LLM can be surprisingly capable.
With the right data and objectives, you can specialize even small models for real-world use cases.
📌 Next Up: Wrapping it all up—my takeaways, future goals, and final thoughts.
Conclusion & Future Scope: A Student’s Takeaway from Building an LLM
Building my own Large Language Model from scratch has been one of the most challenging and rewarding projects of my academic journey. Coming from an Electrical and Electronics Engineering background, I never imagined I would one day be writing code that trains neural networks, implements attention mechanisms, and fine-tunes models to classify spam emails with over 96% accuracy.
Along the way, I got a hands-on understanding of how neural networks actually work, not just in theory but in practice. I learned how data flows through layers, how losses are computed and minimized, and most exciting of all—how attention helps models “focus” on the right parts of a sentence. Honestly, the first time I truly understood the attention mechanism, it blew my mind.
Through countless hours of debugging, graph plotting, model crashing, and small wins, I ended up with a working LLM that can not only classify emails but also serve as a basic personal assistant with a respectable average score of 52.26 on a custom benchmark.
🚀 What’s Next?
This journey has only sparked more curiosity. For my next step, I want to train and fine-tune LLMs in my regional languages—Odia and Sambalpuri. There's a lack of AI models for underrepresented languages, and I believe this could be my contribution to the open-source and research community.
There’s still so much to explore—model optimization, instruction tuning at scale, and maybe even building an LLM from scratch without pre-trained weights someday.
But for now, I’m proud of what I’ve built, what I’ve learned, and the doors this experience has opened.
Subscribe to my newsletter
Read articles from Omm Pani directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
