Tic Tac Toe Reinforcement Learning using Minimax Algorithm

Debdip MukherjeeDebdip Mukherjee
17 min read

Introduction

Brief overview of Tic Tac Toe

Tic Tac Toe — the classic game we’ve all played with pen and paper — is one of the simplest yet most iconic games in existence. Played on a 3×3 grid, two players take turns marking spaces with Xs and Os. The goal? Line up three of your marks in a row — horizontally, vertically, or diagonally — before your opponent does.

Despite its simplicity, Tic Tac Toe is surprisingly deep when it comes to decision-making and strategy, which makes it a great playground for experimenting with game AI.

Introduction to reinforcement learning

Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to make decisions by interacting with an environment. The idea is simple: perform an action, observe the result, and learn from the reward or penalty you receive.

Think of it like training a dog — when it does something right, you reward it. Over time, it learns the best way to behave in different situations.

Unlike supervised learning (where we feed the model labeled data) or unsupervised learning (where it finds patterns on its own), RL thrives on trial and error — learning from consequences rather than explicit instruction. This type of technology is widely used in training game bots or self-driving cars.

Introduction to the Minimax algorithm

Minimax is a classic algorithm in game theory used for decision making in two-player turn-based games like Chess, Checkers, and yes — Tic Tac Toe. It works by simulating every possible move and assuming that both players play optimally.

The idea is to maximize your own chance of winning while minimizing the opponent’s chance — hence the name Minimax. This algorithm builds a game tree and traverses it to pick the best possible move.

In zero-sum games like Tic Tac Toe (where one player's win is another's loss), Minimax shines as a logical and reliable strategy.

Purpose and structure of the article

This article is more than just a tutorial — it's a little story of how a second-year ML college project turned into something I'm genuinely proud of.

While the syllabus didn’t require reinforcement learning or game trees, I wanted to go the extra mile and build something that looks cool, works well, and teaches both myself and others something new. And here we are!

Why write this? Because when I was working on this, I couldn’t find a solid, easy-to-follow explanation of Tic Tac Toe using Reinforcement Learning and the Minimax algorithm together. So this article aims to fill that gap — explaining the theory, showing you the code, and breaking everything down step-by-step.

Here’s what we’ll cover:

  • The game logic of Tic Tac Toe

  • How Reinforcement Learning works in this context

  • An in-depth look at the Minimax algorithm

  • Implementation with Python

  • Optimization techniques

  • Evaluation and results

  • Possible extensions and future directions

Whether you’re a fellow student, an AI hobbyist, or just someone who loves games and algorithms, I hope you find this helpful and enjoyable!

# Just a teaser of what's coming next
def minimax(self, state, maximizing):
        # Check if the state has already been evaluated
        state_tuple = tuple(map(tuple, state))  # Convert to tuple for immutability and hashing
        if state_tuple in self.memo:
            return self.memo[state_tuple]  # Return cached result

        # Evaluate the current state
        winner = self.check_winner(state)
        if winner == 1:
        # .......

Understanding Tic Tac Toe

Game rules and objectives

Tic Tac Toe is played on a 3×3 grid by two players: one playing X and the other O. The rules are simple:

  • Players take turns placing their symbol (X or O) in an empty cell.

  • The first player to align three of their symbols horizontally, vertically, or diagonally wins.

  • If the board is filled and no player has won, the game ends in a draw.

Here’s a quick example of a win:

 X | O | X
-----------
 X | O | O
-----------
 X |   |

In this case, X wins by forming a vertical line in the first column.

Possible strategies

Despite being a small game, Tic Tac Toe actually has quite a few possible strategies. Here are a few common ones:

  • Center first: Always try to take the center if it's free. It gives access to the most winning combinations.

  • Block opponent: Watch for two-in-a-row moves from the opponent and block them.

  • Forking: Create a situation where you have two potential winning paths on your next turn — your opponent can only block one!

In theory, with perfect play from both players, the game should always end in a draw. That’s what makes it interesting from an algorithmic standpoint: there is a perfect strategy — and that’s where algorithms like Minimax come into play.

Representing the game as a tree of moves

Now let’s talk about one of the coolest things: turning the game into a tree.

Each state of the game (i.e., each unique board configuration) can be represented as a node in a tree. From any state, each possible move leads to a new state — or in tree terms, to a child node. Here’s what that might look like at the first few levels:

             [Empty Board]
               /   |   \
            [X1] [X2] [X3]  ← All possible first moves by X
             |
          [O1] [O2] ...     ← O's responses to that move

Eventually, the tree leads to terminal nodes — states where the game has been won, lost, or drawn.

This game tree representation is the backbone of the Minimax algorithm. By evaluating the outcome of each possible move down the tree, we can decide which move leads to the best possible result.

Basics of Reinforcement Learning

Definition and principles

Reinforcement Learning (RL) is a branch of machine learning where an agent learns how to make decisions by interacting with an environment. Instead of learning from examples (like in supervised learning), the agent learns by doing — through trial and error.

At its core, RL is based on a feedback loop:

  • The agent takes an action.

  • The environment responds with a new state and a reward.

  • The agent uses that feedback to adjust its strategy.

Here’s the classic RL cycle:

Agent → takes Action → Environment → gives Reward + New State → Agent

Over time, the agent aims to maximize the total reward it receives.

Difference between reinforcement learning and other machine learning types

To make things clearer, let’s compare RL to the other popular ML paradigms:

TypeWhat it doesExample
Supervised LearningLearns from labeled dataPredicting house prices from dataset
Unsupervised LearningFinds patterns in unlabeled dataClustering customer data
Reinforcement LearningLearns from actions + feedbackSelf Driving Cars

Key differences:

  • No labeled dataset in RL.

  • The agent influences the environment with actions.

  • Learning is sequential — decisions today affect rewards tomorrow.

Applications in game theory

Games are the perfect playgrounds for RL:

  • Clear rules.

  • Defined goals and rewards.

  • A limited (but often deep) action space.

RL has been used in:

  • Chess & Go (AlphaZero)

  • Video Games (Dota 2, StarCraft, Flappy Bird)

  • Robotics, traffic control, and even financial trading

In games like Tic Tac Toe, RL can be used to learn optimal moves by playing multiple rounds and refining its strategy with every game played. Over time, the agent can get better — to the point where it never loses, or always forces a draw when facing a perfect opponent.

Minimax Algorithm Explained

Definition of the Minimax algorithm

The Minimax algorithm is a decision-making algorithm used primarily in two-player turn-based games like chess, tic-tac-toe, and checkers. It’s designed to minimize the possible loss for a worst-case scenario.

In simple terms:

  • One player tries to maximize their score.

  • The other tries to minimize it.

Hence the name: Mini (minimizing opponent’s max gain) + Max (maximizing your own gain).

Think of it like this:

"What’s the best move I can make assuming my opponent also plays perfectly?"

The concept of game trees in Minimax

Minimax works by exploring all possible future moves in the game as a tree.

  • Each node is a game state (i.e., a board position).

  • Each edge is a possible move.

  • The leaf nodes represent end-game states (win, lose, draw).

  • The algorithm recursively backs up values from the leaf nodes to the root (current game state).

Let’s say it’s X’s turn:

  • It simulates all possible moves (each one becomes a new node).

  • For each of those, it assumes O will make the best response.

  • Then X again, and so on...

Eventually, it selects the move that leads to the best worst-case outcome.

        X's Move (root)
        /      |      \
      O        O        O
    / | \    / | \    / | \
   X  X  X  X  X  X  X  X  X

Each leaf has a score:

  • Win = +1

  • Draw = 0

  • Lose = -1

Minimax as a strategy for zero-sum games

Tic Tac Toe is a zero-sum game — meaning one player’s win is the other player’s loss.

  • If X wins, that’s a score of +1 for X and –1 for O.

  • If it’s a draw, it’s 0 for both.

  • If O wins, it's -1 for X and +1 for O.

This makes Minimax a perfect fit:
It aims to maximize your score while minimizing the opponent’s score, keeping in mind that every move you make gives your opponent a chance to respond.

Application to Tic Tac Toe

Here’s how Minimax works in Tic Tac Toe:

  1. Agent plays as X (or O).

  2. At every move, the algorithm:

    • Simulates all possible future games from the current state.

    • Assigns scores to end results (win = +1, draw = 0, loss = -1).

    • Chooses the move that maximizes the score if it's the agent’s turn, or minimizes it if it's the opponent’s turn.

  3. It always selects the optimal move, assuming the opponent plays optimally too.

     def minimax(self, state, maximizing):
             # Check if the state has already been evaluated
             state_tuple = tuple(map(tuple, state))  # Convert to tuple for immutability and hashing
             if state_tuple in self.memo:
                 return self.memo[state_tuple]  # Return cached result
    
             # Evaluate the current state
             winner = self.check_winner(state)
             if winner == 1:
                 self.memo[state_tuple] = 1.0
                 return 1.0
             elif winner == -1:
                 self.memo[state_tuple] = -1.0
                 return -1.0
             elif self.is_full(state):
                 self.memo[state_tuple] = 0.5
                 return 0.5
    
             valid_actions = self.get_valid_actions(state)
    
             if maximizing:
                 max_eval = float('-inf')
                 for action in valid_actions:
                     next_state = self.apply_action(state, action, player=1)
                     eval = self.minimax(next_state, maximizing=False)
                     max_eval = max(max_eval, eval)
                 self.memo[state_tuple] = max_eval
                 return max_eval
             else:
                 min_eval = float('inf')
                 for action in valid_actions:
                     next_state = self.apply_action(state, action, player=-1)
                     eval = self.minimax(next_state, maximizing=True)
                     min_eval = min(min_eval, eval)
                 self.memo[state_tuple] = min_eval
                 return min_eval
    

Implementing Reinforcement Learning in Tic Tac Toe

Setting up the environment

Before any learning can happen, we need a game environment that the agent can interact with — just like how OpenAI Gym environments work.

You created a custom environment named TicTacToeEnv. It has everything needed to simulate the game:

📦 Features of the Environment:

  • A 3x3 NumPy board initialized with 0 (empty cells).

  • Two players: 1 (X) and -1 (O).

  • Methods to:

    • reset() the game.

    • step(action) to play a move.

    • get_valid_actions() to fetch playable moves.

    • get_state() to observe the current board.

    • render() to print the board.

    • is_done() to check for game over.

    • get_reward() to provide feedback.

env = TicTacToeEnv()
state = env.reset()
env.render()

This makes it modular and agent-friendly, following the standard Gym interface.

🧠 Fun fact: Structuring your environment like this makes it easier to plug into any RL algorithm later — even libraries like stable-baselines3.

Designing the reward system

The reward system defines how the agent learns — it's like giving it a treat when it performs well.

Here’s how rewards are designed:

OutcomeReward
Win (Player 1)+1
Lose (Player 1)-1
Draw+0.5
Ongoing0

This reward system:

  • Encourages winning by rewarding +1.

  • Punishes losing with -1, so the agent learns to avoid bad moves.

  • Encourages drawing as a fallback instead of losing.

pythonCopyEditdef get_reward(self, done, winner):
    if not done:
        return 0
    if winner == 1:
        return 1
    elif winner == -1:
        return -1
    else:
        return 0.5  # Draw

🧠 Tip for readers: You can tweak these values to make the agent more aggressive (e.g., reduce draw reward to 0) or more defensive.

Integration of the Minimax algorithm

Here’s where your project stands out: you combined Reinforcement Learning with the Minimax algorithm.

🧩 How do they work together?

  • The RL agent (say Player 1 or X) plays and learns from rewards.

  • The opponent (Player 2 or O) uses Minimax to make optimal moves.

  • This gives the agent a challenging opponent that never makes random mistakes — helping it learn from tough situations.

This approach simulates playing against a perfect opponent, which:

  • Speeds up the learning process.

  • Produces a smarter, more robust agent.

  • Adds complexity and value to your project beyond a typical RL loop.

Challenges and Optimizations

Training a reinforcement learning agent for a simple game like Tic Tac Toe may seem straightforward — but under the hood, a lot is happening. Here we break down the key challenges and the solutions you implemented.

Computational complexity

The first and most obvious challenge: Tic Tac Toe has 9! (362,880) possible move sequences (less due to symmetry and game-ending conditions), but it’s still enough to slow things down during training if not managed properly.

Especially when you're using:

  • Minimax, which explores every possible move recursively.

  • RL training, which needs thousands of games to converge.

⏱️ Without optimizations, your agent would take a long time to train or get stuck playing slow games.

To solve the speed issue, I used memoization in the Minimax function.

Why it works:

Memoization stores the results of previously evaluated board states, so the same calculations don’t need to be repeated over and over.

  • With this technique, Minimax becomes much faster and scalable, even when integrated into a training loop with tens of thousands of steps.

  • It also drastically reduced the training time and made model learning smoother.

🚀 Pro Tip for readers: Memoization is often more effective than rewriting entire logic for performance. Never underestimate a good cache!

Model Saving and Algorithm Used

I used Proximal Policy Optimization (PPO) from Stable-Baselines3 — a robust policy-gradient-based RL algorithm. Here's how I set it up:

pythonCopyEditrandom_opponent = MinMaxAgent()

monitored_env = Monitor(TicTacToeOpponentEnv(random_opponent), log_dir)
checkpoint_callback = CheckpointCallback(save_freq=25000, save_path=model_dir, name_prefix="ppo_model")

model = PPO("MlpPolicy", monitored_env, verbose=1, tensorboard_log=log_dir)

model.learn(total_timesteps=100_000, callback=checkpoint_callback)
  • MinMaxAgent() plays as the opponent, ensuring the RL agent always plays against a tough adversary.

  • Monitor logs training metrics and progress.

  • CheckpointCallback automatically saves the model every 25,000 steps — essential for long training sessions.

  • tensorboard_log allows real-time visualization of training using TensorBoard.

Improving Efficiency

Here are a few extra touches I made to improve the efficiency further:

1. Sparse Rewards Simplified

  • Avoided shaping the reward function too much to keep learning stable.

2. Early Stopping & Custom Checks

  • Periodically tested the model during training.

  • If it consistently beat the Minimax agent or reached a certain win-rate, I could manually pause training.

3. Lightweight Environments

  • Since the environment is custom-built and minimal (no rendering or heavy computation), it runs thousands of episodes quickly.

Evaluation of the Model

Now that we’ve trained our Tic Tac Toe agent with reinforcement learning and Minimax, the next big question is — how well does it actually play?

The best way to find out is by putting the model to the test against a real human player (that’s you!). Let me walk you through how I set up a simple interactive game where you can challenge the trained AI.

🕹️ Playing Against the Model

Here’s a Python function I wrote to play Tic Tac Toe against the RL-trained model:

pythonCopyEditdef play_against_model(model, env):
    inner_env = env.envs[0].env  # unwrap the actual TicTacToeOpponentEnv
    obs = env.reset()[0]         # unwrap observation
    done = False

    print("You are 'O' (represented by -1), model is 'X' (represented by 1)")
    print("Cell numbers:\n1 | 2 | 3\n4 | 5 | 6\n7 | 8 | 9\n")

    while not done:
        if inner_env.current_player == -1:
            # Human's turn
            valid = inner_env.get_valid_actions()
            inner_env.render()
            print(f"Valid moves: {[r * 3 + c + 1 for r, c in valid]}")
            while True:
                try:
                    move = int(input("Enter your move (1-9): ")) - 1
                    action = (move // 3, move % 3)
                    if action in valid:
                        break
                    print("Invalid move. Try again.")
                except ValueError:
                    print("Please enter a number between 1 and 9.")
            obs, reward, done, _ = inner_env.step(action)
        else:
            # Model's turn
            action_flat, _ = model.predict(obs)
            action_flat = int(action_flat)
            action = (action_flat // 3, action_flat % 3)
            obs, reward, done, _ = inner_env.step(action)

    inner_env.render()
    if reward == 1:
        print("Model won!")
    elif reward == -1:
        print("You won!")
    else:
        print("It's a draw!")

How It Works

  • You play as O (-1), and the AI plays as X (1).

  • The console shows the board with X, O, and dots (.) for empty cells.

  • Each turn, you input a number from 1 to 9 to place your O on the board.

  • The AI picks its move using the trained policy — combining reinforcement learning with Minimax logic.

  • The game ends when someone wins or the board is full (a draw).

  • At the end, the result is printed, letting you know who won!


How Did the Model Perform?

  • Challenging Opponent: The AI almost never loses because it learned to play optimally against a Minimax opponent during training.

  • Draws Are Common: Since Tic Tac Toe is a solved game, a strong player (or AI) often ends matches in a draw.

  • Fun Practice: This interactive test gives a hands-on feel of how reinforcement learning can master classic games — and it’s just plain fun to challenge your own model!


Comparing With Other Strategies

  • Compared to a random move generator, this RL+Minimax model is leagues ahead, winning or drawing nearly every game.

  • Against basic rule-based AI, our model is more adaptive and robust because it learns from experience rather than following fixed scripts.

  • If you want to push it further, you can try improving the reward system or training longer to see if the model gets even better.

Applications and Future Work

Other games and domains that can benefit

While Tic Tac Toe is a classic beginner’s playground, the same techniques apply to much more complex games and scenarios, such as:

  • Connect Four, Checkers, and Chess: These games also have clear win/loss conditions and well-defined moves. Minimax with RL can help build smarter AI opponents.

  • Turn-based Strategy Games: Games where players alternate moves, like Go or certain card games, are great fits for these algorithms.

  • Board Games with Large Move Trees: Reinforcement learning helps navigate huge game trees that are impossible to explore fully with brute force alone.

  • Decision-Making in Robotics and Control Systems: Beyond gaming, RL helps robots learn sequences of actions to achieve goals in dynamic environments.

  • Resource Management and Scheduling: RL can optimize multi-step processes where decisions impact future outcomes.

Potential improvements in the method

Our current Tic Tac Toe model works well, but there’s always room to grow:

  • Deeper Neural Networks: Using more complex models could improve generalization to new game scenarios or more complex games.

  • Better Reward Shaping: Designing nuanced rewards (e.g., rewarding good positioning or defensive moves) can speed up learning.

  • Hybrid Approaches: Combining Minimax with other RL algorithms like Deep Q-Networks (DQN) or Monte Carlo Tree Search (MCTS) might yield stronger strategies.

  • Transfer Learning: Training on simpler games first and transferring that knowledge to harder games could save time and computation.

  • More Efficient Exploration: Techniques like curiosity-driven learning can help agents discover smart moves faster.

Expansion to multi-agent scenarios

Tic Tac Toe is a two-player game, but what if we scale to more players or more complex interactions?

  • Multi-Agent Reinforcement Learning (MARL): Here, multiple AI agents learn simultaneously, adapting to each other’s strategies.

  • Cooperative and Competitive Settings: Agents could learn to collaborate in team-based games or compete in adversarial environments.

  • Real-World Applications: From autonomous vehicle coordination to financial market simulations, MARL opens the door to solving complex multi-entity problems.

Conclusion

In this article, we explored the fascinating intersection of reinforcement learning and the Minimax algorithm through the classic game of Tic Tac Toe. We started with understanding the game’s rules and strategies, then dove into the fundamentals of reinforcement learning and the power of Minimax as a decision-making algorithm for zero-sum games.

By combining these approaches, we built a Tic Tac Toe agent capable of learning optimal moves and making smart decisions, showcasing how classical algorithms and modern AI techniques can work hand-in-hand. This hybrid approach not only improves gameplay but also deepens our understanding of AI in strategic environments.

The journey of applying AI to games like Tic Tac Toe is much more than just a fun project — it’s a stepping stone into the larger world of artificial intelligence and machine learning, where such methods can solve complex problems and drive innovation.

Whether you’re a student, developer, or AI enthusiast, experimenting with projects like this can greatly enhance your portfolio and skills. And remember, the principles you learned here can be applied to countless other domains beyond gaming.

Feel free to check out the full project code on my GitHub — I hope it helps you build your own AI-powered games and inspires your next big project!


GitHub Repository: https://github.com/DebdipWritesCode/Tic_Tac_Toe_RL

0
Subscribe to my newsletter

Read articles from Debdip Mukherjee directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Debdip Mukherjee
Debdip Mukherjee