Patience. The single greatest trait a programmer can have.

Patience to grapple with the random AttributeErrors Python decides to throw at you, patience to deal with that sys.exit() you forgot to comment out, and patience to deal with your past self’s questionable program design.

I had to learn this the hard way, after nearly ten days of grappling with Jupyter Notebook (which, for some reason, decided to deny me a functioning debugger).

Anyways, I’m happy to report that the NFQ agent is now able to shuffle around on the board and randomly shoot itself in the foot.

Why NFQ? Or Rather, Why Use Deep RL At All?

Excellent question.

Remember the 1D world we talked about in the last article? (No need to answer, I don’t wish to embarrass you.)

To define a state in that 1D world, all we needed was a single variable to tell us which cell we were in.

Unfortunately, chess (even the downsized 5×5 version I’m currently using) has a massive state space, meaning I’d need a laughably large number of variables to describe the current position on the board and differentiate it from the other possible positions.

That means we can’t use a table to store values for every state-action pair. Bellman has given this situation a delightfully dark name: “the curse of dimensionality”.

Like Gandalf brandishing his staff at the Balrog, I too decided to brandish my chosen weapon — a neural network ( = deep RL) — at this monstrosity.

What Is a Neural Network?

The simplest possible neural network is just a single neuron. A tiny decision-making machine.

We can think of its core equation like this:

💡

prediction = (weight * input) + bias

The input is the information we give the neuron.
The weight is like a volume knob. It determines how much the neuron cares about the input. The weights are what the network learns.
The bias is like the mood of the neuron. A positive bias makes the neuron easier to activate, and a negative bias needs a stronger input signal to activate.

You might ask: where’s the decision-making part? That’s what the activation function is for.

A common activation function is ReLU (Rectified Linear Unit), which is defined as f(x) = max(0, x). ReLU allows only positive outputs to pass through unchanged, and sets negative outputs to zero.

This injects non-linearity into what would otherwise become a fancy (and costly) matrix multiplication. By combining thousands of such neurons, the network can learn to represent incredibly complex patterns.

Input to the Neural Network

The neural network takes a state s as input. Consider a 5×5 chessboard:

I couldn’t possibly pass this into a neural network as-is.

The simplest idea is to give each piece a numerical value, and represent them on a 5×5 grid. Empty squares are represented with zeros, and pieces with their respective values.

Let’s say you gave the pieces the following values:

The more astute among you may have seen a problem with passing the values in as-is.

Take the simple neural network (NN) we just saw. What would be the prediction if we kept the weight and the bias the same?

That’s right, the NN would be led to believe that a Black King (value 12) is worth more than a White King (value 4). And that the bishops and knights are worth more than rooks.

Solving this Problem

There’s a very ingenious solution computer scientists came up with to fix this problem. What if, instead of using one single 5×5 grid, we used multiple, stacked on top of each other?

This means we could use a single 5×5 grid (often called a plane) to represent the positions of only one piece type of a given color.

That gives us the freedom to use 0’s for empty squares and 1’s for pieces, making it a binary plane. So now, the network can no longer form erroneous assumptions about piece value.

In my 5×5 board implementation, I’ve used one King, one Knight, one Bishop, and two Rooks per side. That means I’d need four binary planes per side, each of dimensions 5×5.

So that leads to a total of eight binary planes.

The input to the neural network is now of dimensions 8×5×5.

This doesn’t include the additional planes many top chess AI use, such as a plane to determine whose turn it is, castling rights, etc.

Starting to see why a table wouldn’t work?

Output of the Neural Network

Alright, we pass in a state as input to the NN. Great. What do we get out of it?

If you’ve ever played chess before, this notation should seem familiar:

e4 e5
Ke2 Nf6

This is the start of a ~~brilliant~~ opening called the Bongcloud. Try it out. I promise you a glorious victory.

Okay, back to the output. Unfortunately, I can’t exactly get “Knight to f6” as output from my NN.

What I can get, however, is the start square and the end square. Those are just numbers. The NN can handle that.

I split the output into two “heads”, each of size 25. (Because there’s 25 squares on my 5×5 chessboard.)

One head, let’s call that Lefty, gives me the square to move from, and the other head, Righty, gives me the square to move to.

Lefty and Righty combine to give me… dun dun dun… a set of 625 possible moves! I can now represent any possible move as a single integer in an array.

Of course, not all of them are legal. (This is the problem my random agent faced last article!) I couldn’t move a rook diagonally.

That’s what a legal move mask is for. I generate all legal moves from the current position within my chess_logic file, and mask over the illegal moves with a value of -inf (negative infinity).

What do I mean by “masking over with a value”? Well, the 625 possible moves are indexes in an array. Or like identifying numbers for lockers. Locker 1, Locker 2, etc.

What’s inside the lockers? That’s our Q-value. Let’s call it q.

q that tells you, given the current position on the board, if you open this locker (make this move), you will receive q eventual rewards.

So that’s it. Our NN takes in the state, and gives us an array of possible moves, and how much it likes each move. The higher the Q-value for a given action, the better it thinks that move is.

Why NFQ?

This one is just to make my life easier. Haha. You were expecting another long explanation, weren’t you?

But no. NFQ is the simplest of the DRL algorithms I learned about, so I started with it.

Neural-Fitted Q Iteration

NFQ has three main steps:

Collecting experiences
Calculating targets
Adjust Q-function using loss function and optimizer

I’ll get to what all those terms mean in a bit.

Collecting Experiences

An experience tuple is defined as (s, a, r, s', d).

Check this out again.

s (state): The current board position, represented by our 8×5×5 binary planes. (In this case, the starting position, before the rook moved.)
a (action): White moves a rook from a1 to a3. The action is the integer representing this move. (The index of the locker, to draw from the previous analogy.)
r (reward): In chess, traditional rewards are 0 for draw, 1 for victory, and -1 for loss. Game hasn’t terminated, so current reward is 0.
s’ (next state): The position after the rook has moved, that’s what you see in the image.
d (done): Is the game over? No. So d = False.

In the case of the 5×5 board, each experience basically refers to a move.

I asked the agent to collect 1024 experiences before learning.

Why? Why not just learn from each experience on the fly?

Another brilliant question. You’re on fire today.

The problem is that the data will not be independently and identically distributed (IID).

Think about it. Each move in a game is directly correlated to the one before it. Independent data, out the window you go.

The agent will make terrible decisions on the board (~~at least it’s only on the board, unlike me~~) in the beginning of its training arc. If it were to learn from that, the data wouldn’t be identically distributed, it’d be skewed in one direction (the wrong one, too, sadly ☹️).

The saved experiences are often referred to as a replay buffer. After it collects enough experiences, I ask the agent to learn from them, at random. This doesn’t entirely solve the IID problem, but it does alleviate it significantly.

Also, there’s a problem with the size of the replay buffer I used — it’s too small. Replay buffers usually contain tens (if not hundreds) of thousands of experiences. As I said, though, I just wanted NFQ up and running. Fiddling with parameters comes later.

Calculating Targets

So we have a buffer full of experiences. Great. Now we need a “target” to aim for.

In DRL, this target is the optimal Q-value, and we calculate it using the Bellman equation.

Let’s translate that from math-speak:

The best possible total reward we get by taking action a in state s — Q*(s, a) is…
… the immediate reward r you get…
… plus the maximum possible future reward you can get from the next state s’, discounted by a factor gamma.

We do this for all possible actions in a given state. All 625 of the actions our NN provides. Then we mask over the illegal actions with such a low value -inf that the agent will never choose them.

There’s another problem here. (and if you just thought, “again?”, welcome to the club.)

The agent collects experiences, then updates the optimal Q-value (the target) using the Bellman Equation.

But how do we calculate the optimal Q-value? Using the Q-value of the next state. And how do we get that value? By feeding the next state s’ into our neural network.

In other words, we are using our network’s own, often erroneous, predictions, to find the optimal Q-value.

This is the infamous moving target problem. It’s the equivalent of attempting to sink a three-pointer in basketball… while you’re moving on a skateboard and the hoop is also moving. The target keeps shifting with every update, which makes training unstable.

Adjusting the Q-Values

Now, the Bellman Equation has given us the target to aim for. How do we get the ball in the basket?

By enlisting the help of two powerful allies:

the loss function,
and the optimizer

The loss function tells us how far off our current Q-values are from the targets. In other words, how far off from the hoop was your shot?

Now, the optimizer, that tells us how big of an adjustment we take, and in which direction we take it. This is so we don’t accidentally shoot the ball into orbit (you’d be surprised by what AI agents can do), and don’t take longer than the lifetime of the Sun to adjust our shot’s trajectory enough to reach the hoop.

I used mean squared error (MSE) as my loss function and RMSprop as my optimizer. (Not going to delve into an explanation for those here, but there are some wonderful resources available online, check them out.)

NFQ: A Summary

NFQ collects as many experiences as we tell it to, then it freezes training (that is, it stops playing).

It then moves on to steps two and three, and repeats those as many times as we tell it to, slowly moving its Q-values towards the optimal ones.

Repeat this entire process as much as needed.

And that’s it.

Yeah, you did it. You learned what NFQ does. Take a deep breath. Do a little dance. Good job, seriously.

NFQ’s Performance

Okay, full disclosure: I forgot to implement the stacked binary planes trick, and my agent is facing severe divergence issues. This wasn’t the sole reason, but it definitely contributed.

I trained it on about 5,000 games of self-play, and I was seeing the Q-values hit two, three thousand. (No, they should not be that high — I used a gamma of 0.99, and for the rewards I mentioned earlier, the maximum theoretical value should be 100.)

Reasons for divergence? To begin with, NFQ is notoriously prone to divergence unless it has near-perfect parameters.

Remember the moving target problem? That’s a significant contributor. This problem is actually what the next generation of RL algorithms, like Deep Q-Networks (DQN) attempt to solve… but that’s a story for next time.

Second, I’m asking it to pull off a good performance in chess. That’s hard even for human brains (most of us, anyway), and we’re way more advanced than NFQ.

That said, it did manage to pull off this checkmate:

Sheer luck. But it’s a start.

Next Up: Lichess

I know, I know, I said I’d work my way all the way up to the actor-critic methods in 5×5 before I went to 8×8, and only then I’d go to Lichess, but I’m changing the plan. Perks of being a solo dev.

I’ll be implementing Dueling DDQN (with PER — Prioritized Experience Replay) in 5×5, then 8×8, and push that to Lichess ASAP.

So yeah, that’s it for now. This was a bit longer than usual, so thanks for sticking with me. And as always, any feedback is greatly appreciated. Drop me a message here.

Hopefully, the next time I see you guys, Knightmare Protocol will be alive and kicking (and ready to play you) on Lichess.

See you soon… and prepare to face your worst Knightmare.

Diverging Neural Networks and Debugging Woes

Table of contents