Simple Intro to Reinforcement Learning

What do think when the word reinforcement comes to your mind?

Let's go into the technical theory first according to which Reinforcement learning (RL) is a type of machine learning that involves training an agent to make decisions in an environment by rewarding it for good decisions and punishing it for bad ones. It has been used in many applications such as robotics, gaming, and autonomous vehicles

RL is a subfield of machine learning where an agent learns to make decisions by interacting with an environment. In RL, the agent receives a reward signal based on its actions and uses this signal to learn a policy that maximizes its long-term reward. The image below clearly represents the working of reinforcement learning in action.

Basic reinforcement learning model. When the Agent performs an action ...

Why do we need Reinforcement Learning in the first place?

In a scenario where we want to increase a cumulative reward to the maximum. It is especially helpful in circumstances where the agent must learn via trial and error due to a lack of explicit instructions or labeled data.

Here are some cases where reinforcement learning is used instead of traditional machine learning and deep learning.

complicated and Dynamic Environments: Reinforcement learning is useful when the best course of action is unknown beforehand and the environment is complicated and dynamic. An agent may interact with the environment, take feedback into account, and gradually modify its behaviour. This makes it appropriate for fields where the agent must continually learn and make judgements in response to shifting circumstances, such as robots, autonomous cars, and gaming.
Making Decisions in an Uncertain World: Reinforcement learning facilitates making decisions in an uncertain world. Agents can learn to investigate their surroundings, acquire data, and decide based on probabilistic predictions of future events. This is significant in industries where judgements must be made with noisy or incomplete data, such as banking, healthcare, and logistics.
Reinforcement learning is a powerful tool for solving issues involving sequential decision-making. Each action the agent does impacts the next state and determines how future prizes will be distributed. In areas like gameplay, natural language processing, and recommendation systems, this sequential aspect is common.
Transfer learning and generalization: RL algorithms are capable of applying previously acquired information to new, analogous settings. This lessens the need for lengthy retraining by allowing agents to apply learnt policies or techniques to many circumstances.
Exploration-Exploitation Trade-off: RL algorithms achieve a compromise between exploitation (using known strategies to maximise near-term rewards) and exploration (trying out novel behaviours to identify possibly superior tactics). RL agents may acquire useful information and still make wise judgements because to this compromise.

Creating an RL environment is pretty straightforward with Gymnasium which is a standard API for reinforcement learning, and a diverse collection of reference environments.

All you have to do is to import the libraries and create an environment class that looks something like this --

class GridEnvironment(gym.Env):

    def __init__(self, grid, goal_cell_centroid, max_steps, coordinates):
    # this function is used to initialize all the relevant 
    # variables, list and dictionaries that are requireed in the creation          and operation of your environment 

    def render(self, mode='human'):
    # this function is used to print/ display/ render the environment
    # you can use either Matplotlib or normal print statements 
    # this function is called everytime when the agent takes a new step in the environment so that we can see the current position of the agent   

    def reset(self, seed=None):
    # this function is used to reset the environment to its original state so that the agent gets the same observation of the environment after every episode. 

    def step(self, action):
    # this is the main function in which we define the steps of out agent 
    # Where he will move, how it will move and how many steps at once.
    # In this function we also define the Reward as well that our agent will get after each step.

print('Environment created')
# lets visualize the environment
env = GridEnvironment()
print('Observation space: ', env.observation_space)
env.reset()
env.goal = goal
env.render()

Now we have created the environment, to train our agent in this environment we need a driver code that contains the number of episodes, the maximum steps an agent can take in an episode, exploration and exploitation thresholds (this I will explain in the later part of this blog, so keep reading) and at last learning rate variable. The driver code to train the agent looks something like this --

# initializing the q table
q_table = np.zeros([env.observation_space.n, env.action_space.n, 1])
q_table = q_table.astype('float32')
# lets print the shape of the q table
print('Shape of the q table: ', q_table.shape)


# defining the hyperparameters
num_episodes = 100 
max_steps_per_episode = 100

learning_rate = 0.1
discount_rate = 0.99

# exploration parameters
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

# lets train the agent for 100 episodes
rewards_all_episodes = []

# Q-learning algorithm
for episode in range(num_episodes):
    state = env.reset()
    done = False

    for step in range(max_steps_per_episode):
        env.render()
        # Updating the Q table after every step.
         q_table[state, action] = q_table[state, action] + learning_rate*(reward + discount_rate*np.max(q_table[state,:]) - q_table[state, action])
        if done:
            env.render()
            print("Training Complete")


env.close()

Important Terms in the Driver Code.

q_table: It represents the Q-table, which is a data structure used in Q-learning algorithms. The Q-table is a two-dimensional table (in this case, with an additional dimension of size 1) that maps states and actions to their corresponding Q-values. The Q-values represent the expected rewards of taking specific actions in specific states.
env: It refers to the environment in which the agent operates. The environment provides the agent with the current state, allows it to take action, and provides feedback in the form of rewards and next states. The environment is typically defined using a reinforcement learning library, and in this code, it has properties like observation_space (representing the possible states) and action_space (representing the available actions).
num_episodes: It denotes the total number of episodes or iterations the agent will go through during training. Each episode consists of multiple steps in which the agent takes actions and learns from the resulting rewards.
max_steps_per_episode: It represents the maximum number of steps the agent can take within each episode. If the agent does not reach a terminal state (end of the episode) within this limit, the episode will be terminated.
learning_rate: It determines the rate at which the agent updates the Q-values based on new information. A higher learning rate means the agent gives more weight to recent experiences when updating the Q-table.
discount_rate: Also known as the discount factor, it determines the importance of future rewards compared to immediate rewards. A discount factor of 0 means the agent only values immediate rewards, while a factor of 1 means the agent considers long-term rewards.
exploration_rate: It represents the probability of the agent taking a random action instead of exploiting the learned knowledge. Initially set to 1, it allows the agent to explore the environment and discover better strategies. The exploration rate typically decreases over time to favor the exploitation of the learned policy.
max_exploration_rate and min_exploration_rate: They define the upper and lower bounds for the exploration rate. The agent's exploration rate is annealed over time, gradually decreasing from the maximum to the minimum value.
exploration_decay_rate: It determines the rate at which the exploration rate decreases. A smaller decay rate results in a slower decrease in exploration, allowing for more exploration of the environment.
rewards_all_episodes: It is an empty list that will store the cumulative rewards obtained in each episode. It will be useful for analyzing the agent's learning progress.

Something about Q Learning Algorithm in Reinforcement Learning.

A frequently used RL technique known as Q-learning employs a table to learn the Q-values, or values for the state-action pairings. A model-free reinforcement learning technique called Q-learning uses direct estimation of the values of state-action pairs to develop an optimum policy. The predicted cumulative reward for performing a certain action in a specific condition is represented by the Q-values. Finding the best policy to maximise the predicted cumulative reward over the long run is the aim of Q-learning.

The ideal Q-value for a state-action pair, according to the Bellman equation, is equal to the sum of the immediate reward and the discounted maximum Q-value of the next state-action pair. The formula for the equation is:

Q(s,a) = r + γ max(Q(s',a'))

Where r is the immediate reward for taking action an in state s, Q(s,a) is the Q-value of state-action pair (s,a), s' is the next state, a' is the next action, and is the discount factor that establishes the significance of future rewards.

With an empty Q-table and all Q-values set to zero, the Q-learning procedure begins. By deciding which actions to do based on an exploration-exploitation strategy, the agent interacts with the environment. The agent chooses a random action with probability during the exploration phase, and during the exploitation phase, it chooses the action with the greatest Q-value.

The agent is rewarded for each action, and the Bellman equation-based Q-value of the state-activity pair is updated. The update rule is expressed as follows:

Q(s,a) ← Q(s,a) + α(r + γ max(Q(s',a')) - Q(s,a)) --> This is the same equation thaqt has been used in the driver code above.

where α is the learning rate that determines the step size of the Q-value update.

The Q-learning algorithm continues to interact with the environment, update the Q-values, and refine the policy until it converges to the optimal policy.

Simplified example of Q-table updating. | Download Scientific Diagram

The above image shows how the values in the Q table are updated after every step in an episode.

Compared to other RL algorithms, Q-learning provides a number of advantages. It can handle enormous state and action spaces, is computationally effective, and is easy to implement. Q-learning does, however, have significant drawbacks. To find the best course of action, extensive study is needed, which can be both time-consuming and ineffective. The selection of hyperparameters, including the learning rate and discount factor, may also have an impact on it.

In conclusion, Q-learning is a potent RL algorithm that has the ability to directly estimate the values of state-action pairings in order to develop an optimum policy. Large state and action spaces may be handled by this straightforward and effective approach. The best policy must first be discovered by extensive research, and it might be sensitive to the selection of hyperparameters. As a result, Q-learning has various uses in robotics, gaming, and autonomous systems and is a helpful tool for tackling RL issues.

Thank you so much for reading this blog till here, please feel free to email me and ask about any references and resources for this blog I will be more than happy to provide them.