A lot of heated discussions have been going around regarding DeepSeek-R1 lately. Instead of getting caught up in various discussions, I choose to focus on the underlying technology. I wrote an article about it last week called DeepSeek-R1: A Primer, explaining the methodology they used based on their published paper.

A lot of talk has also focused on distillation and Reinforced Learning (RL), so I thought it would be a good idea to give examples of how they work. The sample code is pretty basic but it needs to be “bare metal” so we can filter out the noise and focus on the interesting bits.

On the surface RL and model distillation sounds pretty intimidating. Yet, when we break these concepts down into something more digestible, we will see that they really are not.

Before we start, let’s have a bit of a refresher and talk about RL and distillation.

RL, is like training a puppy - you reward good behavior and over time our puppy learns what actions lead to treats and wants more of it. Speaking in code terms, instead of having a puppy, we have an agent - still after the treats.

When you are trying to learn something, it is much better to have an experienced teacher or a mentor who can show you the ropes rather than going at it on your own, right? It cuts down the learning time tremendously (and therefore making it cheaper) and is much more fun. That’s what distillation is. Model having an experienced teacher.

The App

I've put together a demo that shows these concepts in action. This is what we are going to build:

I've kept it simple but functional. No fancy neural networks or complex math, just the core ideas in simple, straight-forward C#.

Our little application has only 5 files, so let’s take a look at them and see what they do. (Note: Do not worry about LearninSummary.cs file for now - we’ll come back to that one).

RLEnvironment.cs

This is our playground. Since we are not dealing with an ready-made environment, aka the model, we are making one up ourselves. In a way, this is our super simple “world” where the learning happens.

public class RLEnvironment
{
    private int currentState;
    private bool isDone;
    private double reward;

    public void SetState(int state)
    {
        currentState = Math.Max(0, Math.Min(10, state));
    }

    public (int state, double reward, bool done) Step(int action)
    {
        if (action == 1) currentState++;
        else if (action == 0) currentState--;

        currentState = Math.Max(0, Math.Min(10, currentState));
        reward = currentState == 5 ? 1.0 : -0.1;
        isDone = currentState == 5;

        return (currentState, reward, isDone);
    }

    public void Reset()
    {
        currentState = 0;
        reward = 0;
        isDone = false;
    }

    public string Visualize()
    {
        var sb = new StringBuilder();
        sb.Append('[');
        for (int i = 0; i <= 10; i++)
        {
            if (i == currentState) sb.Append('A');
            else if (i == 5) sb.Append('G');
            else sb.Append('_');
        }
        sb.Append(']');
        return sb.ToString();
    }
}

This is a simple 1D world with 11 positions (0 to 10). Position 5 is our goal (G) and our agent (A) can move either left(0) or right (1). Our agent gets a reward of +1 for reaching the goal, and -0.1 for any other move.

RLAgent.cs

This is our agent who is tasked with learning.

public class RLAgent
{
    private readonly Dictionary<int, double[]> qTable = new();
     private readonly double learningRate = 0.1; // How much the agent learns from new information, how fast does it update its knowledge
    private readonly double gamma = 0.99; // Discount factor, meaning by how much the future rewards are less valuable
    private readonly double epsilon = 0.1; // Exploration rate, meaning 10% of the time we will explore
    private readonly Random random = new();
    private int successfulEpisodes = 0;

    public int GetAction(int state)
    {
        // Exploration: Sometimes try random actions to discover new strategies
        if (random.NextDouble() < epsilon)
            return random.Next(2);

        // If we haven't seen this state, initialize it
        if (!qTable.ContainsKey(state))
            qTable[state] = new double[2];

        // Exploitation: Choose the action with the highest expected reward
        return qTable[state].ToList().IndexOf(qTable[state].Max());
    }

    public void Learn(int state, int action, double reward, int nextState)
    {
        if (!qTable.ContainsKey(state))
            qTable[state] = new double[2];
        if (!qTable.ContainsKey(nextState))
            qTable[nextState] = new double[2];

        var oldValue = qTable[state][action];
        var nextMax = qTable[nextState].Max();
        qTable[state][action] = oldValue + learningRate * (reward + gamma * nextMax - oldValue);
    }

    public void IncrementSuccess()
    {
        successfulEpisodes++;
    }

    public int GetSuccessCount()
    {
        return successfulEpisodes;
    }

    public string GetQTableVisualization(int state)
    {
        if (!qTable.ContainsKey(state))
            return "No data for this state";

        return $"Left: {qTable[state][0]:F2}, Right: {qTable[state][1]:F2}";
    }
}

Our agent code uses a Q-table (dictionary) to store what it learns about each state. You can play with the epsilon, the learning rate and the gamma values to get the maximum result. If you are curious of what a Q-table is, please see the Light Reading section at the end of this part.

This class is an implementation of Q-Learning, a fundamental reinforcement learning algorithm.

The agent stores what it learns in the Q-Table. Easy state (key) maps to an array of two values representing the expected rewards for moving left (0) or right (1).

Then you have the learning parameters. You can play with those values and compare the results.

The GetAction(int state) method implements the epsilon-greedy strategy. This is a reinforcement learning method to balance exploration and exploitation by choosing between them randomly. There is a good explanation of this method in Appendix B.

In our code, we try random actions 10% of the time and choose the best known action 90% of the time.

How this agent learns, happens in the, well, Learn method.

 public void Learn(int state, int action, double reward, int nextState)
    {
        if (!qTable.ContainsKey(state))
            qTable[state] = new double[2];

        if (!qTable.ContainsKey(nextState))
            qTable[nextState] = new double[2];

        var oldValue = qTable[state][action];
        var nextMax = qTable[nextState].Max();
        qTable[state][action] = oldValue + learningRate * (reward + gamma * nextMax - oldValue);
    }

We implement a Q-learning update formula here:

Get the current Q-value for this state-action pair
Look at the best possible future reward from the next state
Update the Q-value using:
1. learningRate controls how much to trust new information
2. gamma weights future rewards
3. The difference between expected and actual outcomes drives learning

In a nutshell, our agent starts knowing nothing (empty Q-table), and then tries random and calculated actions, updates its knowledge based on the rewards it gets. Gradually this leads to learning which actions work best and making better decisions over time.

This is a very simple environment but the same principles apply to more complex reinforcement learning problems.

Before we continue to see what Distillation code does, let’s take a look at our Student, which we will compare it to our other student who learns from the teacher.

Light Reading: Understanding Q-Tables

A Q-table is like a cheat sheet for our AI agent. Imagine you're playing a video game and keeping notes about which moves work best in different situations. That's basically what a Q-table is!

Let’s say you are keeping a diary of your daily commute:

When it is early morning, you can take the train and it gives you a stress-free trip (+10 points).

When it is raining, taking the bus might be more efficient (+15 points)

When it is rush hour, you can take the train but riding bike to the train station helps you not to stress about parking (+20 points)

If it is raining, you cannot walk or take the bike to the station, so you drive but you get stuck in the traffic or if you drive fast, you get struck in a body of water (-100 points)

Every time you (our AI agent) make a choice, you check your diary (Q-table) to see what worked well before. The neat thing is that you get smarter with experience. Maybe one day you’ll discover a shortcut that will help you sleep more! You’ll update your notes accordingly.

This is exactly how Q-tables work in AI - they're just bigger diaries that keep track of how good different actions are in different situations. The "Q" stands for "Quality" - like how good is this choice in these conditions?

The more our AI agent experiences different situations and tries different actions, the better its cheat sheet becomes, until it knows exactly what to do in any situation!

Student.cs

Next is our Student class. We have 2 types of students - one that learns from the teacher and one that tries things on its own, so we can compare the results

public class Student(bool isLearning)
{
    private readonly Dictionary<int, int> learnedMoves = new();
    private readonly Random random = new();

    public void Learn(int state, int teacherAction)
    {
        if (isLearning)
        {
            learnedMoves[state] = teacherAction;
        }
    }

    public int GetAction(int state)
    {
        if (isLearning && learnedMoves.ContainsKey(state))
        {
            return learnedMoves[state];
        }
        return random.Next(2);
    }

    public string GetType() => isLearning ? "Learning" : "Clueless";
}

This class is simple, especially compared to the RLAgent Class. Our student class demonstrates the concept of learning through imitation vs random behavior.

If you look at the Learn method, it is a straight forward imitation learning. If this is a learning student, it just stores what the teacher has done - in state X, do action Y because my teacher did it.

In GetAction method, the learning student uses memorized actions from the teacher when available, whereas the clueless student just does whatever randomly.

This simpler implementation help demonstrate how knowledge transfer (distillation) works.

Speaking of distillation…

Distillation.cs

This code is where knowledge transfer happens.

This code takes a trained agent who is our teacher. It then creates two students and tracks the success of both for comparison.

This code shows how a simpler agent can learn from a more sophisticated one through direct imitation and how much better this performs compared to random actions.

public class Distillation(RLAgent teacher)
{
    private readonly Student learningStudent = new(true);
    private readonly Student cluelessStudent = new(false);
    private readonly List<(double learning, double clueless)> successRates = new();
    private readonly Random random = new();

    public void Distill(int episodes)
    {
        var env = new RLEnvironment();
        Console.WriteLine("\n=== Starting Learning Comparison ===");

        for (var episode = 0; episode < episodes; episode++)
        {
            TrainEpisode(env, episode);
            var (learningSuccess, cluelessSuccess) = TestStudents(env);
            successRates.Add((learningSuccess, cluelessSuccess));
            ShowPerformanceMetrics(episode);
        }

        ShowFinalResults();
    }

    private void TrainEpisode(RLEnvironment env, int episode)
    {
        var startPos = random.Next(0, 11);
        Console.WriteLine($"\nEpisode {episode + 1} (Starting from position {startPos})");

        // Teacher demonstrates
        env.Reset();
        env.SetState(startPos);
        var state = startPos;
        var done = false;
        var steps = 0;

        Console.WriteLine("\nTeacher demonstrating:");
        while (!done && steps < 50)
        {
            var teacherAction = teacher.GetAction(state);
            learningStudent.Learn(state, teacherAction);

            var (nextState, _, isDone) = env.Step(teacherAction);
            Console.WriteLine($"{env.Visualize()} | Action: {(teacherAction == 0 ? "Left" : "Right")}");

            state = nextState;
            done = isDone;
            steps++;
            Thread.Sleep(25);
        }
    }

    private (double learningRate, double cluelessRate) TestStudents(RLEnvironment env)
    {
        int testEpisodes = 5;
        int learningSuccesses = 0, cluelessSuccesses = 0;

        for (int i = 0; i < testEpisodes; i++)
        {
            int startPos = random.Next(0, 11);

            if (RunTest(env, learningStudent, startPos)) learningSuccesses++;
            if (RunTest(env, cluelessStudent, startPos)) cluelessSuccesses++;
        }

        return ((double)learningSuccesses / testEpisodes, (double)cluelessSuccesses / testEpisodes);
    }

    private bool RunTest(RLEnvironment env, Student student, int startPos)
    {
        env.Reset();
        env.SetState(startPos);
        var state = startPos;
        var done = false;
        var steps = 0;
        var maxSteps = 20;

        while (!done && steps < maxSteps)
        {
            var action = student.GetAction(state);
            var (nextState, _, isDone) = env.Step(action);
            state = nextState;
            done = isDone;
            steps++;
        }

        return done;
    }

    private void ShowPerformanceMetrics(int episode)
    {
        var windowSize = Math.Min(5, successRates.Count);
        var recentRates = successRates.TakeLast(windowSize).ToList();
        var avgLearning = recentRates.Average(r => r.learning);
        var avgClueless = recentRates.Average(r => r.clueless);

        Console.WriteLine("\n=== Recent Performance (Last 5 Episodes) ===");
        Console.WriteLine($"Learning Student: {avgLearning:P2}");
        Console.WriteLine($"Clueless Student: {avgClueless:P2}");
    }

    private void ShowFinalResults()
    {
        var finalLearningRate = successRates.Average(r => r.learning);
        var finalCluelessRate = successRates.Average(r => r.clueless);

        Console.WriteLine("\n=== Final Results ===");
        Console.WriteLine($"Learning Student: {finalLearningRate:P2}");
        Console.WriteLine($"Clueless Student: {finalCluelessRate:P2}");
    }
}

Program.cs

Finally everything comes together in our main body of code.

 public static void Main()
    {
        var env = new RLEnvironment();
        var agent = new RLAgent();
        var summary = new LearningSummary();
        var totalEpisodes = 50;

        Console.WriteLine("Training the teacher agent...");

        for (var episode = 0; episode < totalEpisodes; episode++)
        {
            env.Reset();
            var state = 0;
            var done = false;
            var steps = 0;
            var totalReward = 0.0;

            // Show simple progress indicator
            Console.Write($"\rEpisode {episode + 1}/{totalEpisodes}");

            while (!done && steps < 50)
            {
                var action = agent.GetAction(state);
                var (nextState, reward, isDone) = env.Step(action);
                agent.Learn(state, action, reward, nextState);

                totalReward += reward;
                state = nextState;
                done = isDone;
                steps++;
            }

            if (done) agent.IncrementSuccess();

            var successRate = (double)agent.GetSuccessCount() / (episode + 1);
            summary.AddResult(episode + 1, successRate, totalReward, steps);
        }

        Console.Clear();


        Console.WriteLine("\nStarting knowledge distillation...");
        var distillation = new Distillation(agent);
        distillation.Distill(25);

        summary.DisplaySummaryChart();

        Console.WriteLine("\nPress any key to exit...");
        Console.ReadKey();
    }

When we run this code we see how the teacher trains, and then teaches the student and how the clueless student also tries to randomly solve this problem. It goes by pretty fast, you can make it faster or slower by playing with the Thread.Sleep(100); number.

At the end, our application gives us the comparison results which should look something like this:

As you can see our learning student has a higher success rate than the clueless student - this is how Reinforced Learning and Distillation works.

Before we conclude this piece, remember the LearningSummary class I said that we will come back to? OK, it is time.

LearningSummary.cs

There is an additional class in our application, called LearningSummary. This code simply plots the learning data in an ASCII chart. The chart does not look pretty as this is a console application, however, still does the job.

Caveats

As good as an example our little application is, there are a couple of caveats we need to mention.

We are running the risk of overfitting in our little app and implementation because the learning student is doing pure memorization without any generalization.

The student simply memorizes “In state X, do exactly action Y”, which means it can only handle states it has seen before and cannot adopt to different situations. It also has not understanding of why actions work.

This is problematic because the teacher actually makes decisions based on a more sophisticated Q-learning approach. Since the teacher uses exploration (epsilon-greedy), some demonstrations might not be optimal which means our student copies sub-optimal actions.

A more robust approach would be for the student to learn patterns rather than exact state-action pairs, keeping track of how well each memorized action works and include its own exploration mechanism so it can truly learn from the teacher.

Having said that, however, I decided to go this route because I was afraid that making a more robust example would take away from the learning point. This code clearly demonstrates:

How an agent can learn through experience
How knowledge can be transferred
The difference between learned behavior and random actions
As a bonus: how overfitting occurs

Conclusion

Now, DeepSeek-R1 is much more complicated than this but the concept is the same. Just more code, more variables and more math. DeepSeek-R1 uses a teacher to teach their models certain things and then use distillation to produce smaller and very capable models.

You can try different scenarios and see how the performance changes with different parameters. It’s quite fun to watch.

The source code for the C# console application is available here: https://github.com/tjgokken/RL_Distillation

Appendix A: Epsilon-Greedy Method

In RL, the agent learns how to map situations to actions by maximizing a numerical reward signal. The agent is not told what actions to take but must discover which action yields the most reward.

This leads to something called multi-armed bandit problem, which is used to formalize the notion of decision-making under uncertainty. In this problem, which sounds like it jumped right pout of a western movie, an agent chooses between k different actions and receives a reward based on the chosen action. The goal of the agent is to identify which action to choose to get the maximum reward after a given set of trials.

While it is doing this, the agent uses either Exploration or Exploitation.

Exploration allows an agent to improve its current knowledge about each action, hopefully leading to long-term benefit. Improving the accuracy of the estimated action-values, enables an agent to make more informed decisions in the future.

Exploitation on the other hand, chooses the greedy action to get the most reward by exploiting the agent’s current action-value estimates. However, this may not actually get the most reward and lead to sub-optimal behaviour because it did not test other actions..

Exploration leads to more accurate estimates of action-values, whereas, exploitation might lead to a higher reward (being greedy), it might get more reward (or it might not). Both cannot be done simultaneously, which is also called the exploration-exploitation dilemma.

Appendix B: How Does The Agent Learn? A Look at the Q-Table Values

Let’s take a look at the Q-Table values to get a better understanding how our agent learns.

Early in Training (after few episodes)

State 0: [Left: -0.10, Right: 0.05]   // Slightly prefers going right
State 1: [Left: -0.15, Right: 0.08]   // Learning to move right
State 2: [Left: -0.12, Right: 0.15]   // Starting to favor right movement
State 3: [Left: -0.10, Right: 0.25]   // Stronger preference for right
State 4: [Left: -0.20, Right: 0.45]   // Strong preference for right (near goal)
State 5: [Left: -0.15, Right: -0.15]  // Goal state
State 6: [Left: 0.45, Right: -0.20]   // Learning to move left
State 7: [Left: 0.25, Right: -0.10]   // Preference for left
State 8: [Left: 0.15, Right: -0.12]   // Learning to move left

Mid-Training (After ~25 episodes)

State 0: [Left: -0.25, Right: 0.35]   // Clear preference for right
State 1: [Left: -0.30, Right: 0.45]   // Strong right preference
State 2: [Left: -0.35, Right: 0.55]   // Very strong right preference
State 3: [Left: -0.40, Right: 0.65]   // Dominant right strategy
State 4: [Left: -0.45, Right: 0.85]   // Almost certain right move
State 5: [Left: -0.20, Right: -0.20]  // Goal state
State 6: [Left: 0.85, Right: -0.45]   // Almost certain left move
State 7: [Left: 0.65, Right: -0.40]   // Dominant left strategy
State 8: [Left: 0.55, Right: -0.35]   // Very strong left preference

After Training (After 50 episodes)

State 0: [Left: -0.50, Right: 0.75]   // Optimal right movement
State 1: [Left: -0.55, Right: 0.85]   // Strong right strategy
State 2: [Left: -0.60, Right: 0.90]   // Very strong right preference
State 3: [Left: -0.65, Right: 0.95]   // Near-perfect right choice
State 4: [Left: -0.70, Right: 0.99]   // Optimal right movement
State 5: [Left: -0.25, Right: -0.25]  // Goal state
State 6: [Left: 0.99, Right: -0.70]   // Optimal left movement
State 7: [Left: 0.95, Right: -0.65]   // Near-perfect left choice
State 8: [Left: 0.90, Right: -0.60]   // Very strong left preference

Looking at the 3 training phases above, this is how learning happens:

At First (Early Training):

Our agent is pretty unsure about every choice
Makes lots of random guesses
Has only slight preferences based on a few lucky successes

After Some Time (Mid-Training):

Starts getting more confident about choices near the goal
When close to the goal, strongly prefers the right door
Still a bit unsure about choices in far away numbers
Learning that wrong choices waste time (negative values)

After Lots of Practice (Final Training):

Super confident about the best route from number to the goal
Really dislikes choices that lead away from the goal
The closer to the goal, the more sure about which direction to pick
When reaches the goal, doesn't need to make any more choices

In time, our AI agent gets better at choosing the right direction from each position. The numbers in the Q-table just show how strongly our agent prefers each choice.

Understanding Reinforcement Learning and Distillation: A Practical C# Example