What is Reinforcement Learning?

shalin Shahshalin Shah
5 min read

Reinforcement Learning is a trial-and-error learning process where an agent learns to make decisions by interacting with an environment. It is neither supervised nor unsupervised learning, but rather a third paradigm of learning.

Key Components of RL

  1. Agent: The learner or decision-maker.

    • E.g., a robot, a chess player, or Mario in a video game.
  2. Environment: The world the agent lives and interacts in.

    • E.g., the chessboard, the game world in Mario.
  3. State (s): A configuration or situation in the environment.

    • E.g., current position of Mario or current layout of a chessboard.
  4. Action (a): What the agent does at a state.

    • E.g., move left, jump, or take a chess piece.
  5. Reward (r): A scalar feedback signal for the action taken.

    • Positive for good actions, negative for bad ones.

The Core Idea of RL

Imagine you're training a dog to catch a ball:

  • When the dog catches the ball → you give a cookie (+reward).

  • When the dog fails to catch → no cookie (-reward or 0).

Over time, the dog learns to associate the action of catching the ball with the positive outcome (cookie). This is reinforcement at work — learning by experience, not by instruction.

Similarly, an RL agent:

  • Explores the environment

  • Takes actions

  • Receives rewards

  • Updates its behavior to maximize cumulative reward.


The Reinforcement Learning Algorithm (High-Level)

  1. The agent observes the current state.

  2. It selects and performs an action.

  3. It receives a reward and transitions to a new state.

  4. It uses the reward to update its policy (decision-making strategy).

  5. Repeat — until the optimal policy is learned.


A Simple Example: Grid World

Consider a 3x3 grid with some shaded (bad) states and a goal state:

  • The agent starts at state A.

  • It must reach state I.

  • Shaded states (e.g., B, C, G, H) give negative rewards.

  • Unshaded states give positive rewards.

The agent explores the environment:

  • In the first episode, it takes random actions, possibly hitting shaded states.

  • In the next episodes, it learns from its past mistakes and adjusts its behavior.

  • Eventually, it finds the optimal path from A to I avoiding bad states.

This shows how agents learn over episodes by improving based on the rewards received.

How RL Differs from Other Paradigms

ParadigmData ProvidedLearning Type
Supervised LearningInput-output pairsLearns from labeled data
Unsupervised LearningOnly inputsFinds patterns, clusters
Reinforcement LearningNo datasetLearns by interacting with environment

In RL:

  • There’s no explicit supervision.

  • Agent learns from consequences.

  • Rewards guide learning, not predefined labels.

Example:

  • Supervised: Give the dog explicit commands.

  • RL: Let the dog try and reward it for correct behavior.


Markov Decision Process (MDP)

MDP is a formal framework for modeling RL problems. It consists of:

  1. States: Different configurations.

  2. Actions: Possible moves.

  3. Transition Probabilities: Likelihood of moving from one state to another given an action.

  4. Reward Function: Maps transitions to scalar rewards.

Markov Property: The next state depends only on the current state (not the full history).

Variants:

  • Markov Chain: Just states and transitions.

  • Markov Reward Process (MRP): States + transitions + rewards.

  • MDP: MRP + actions.


Fundamental Concepts in RL

Expectation

Expected value is the probability-weighted average of all possible outcomes.

Action Space

The set of all possible actions:

  • Discrete: [Up, Down, Left, Right]

  • Continuous: e.g., speed of car (0-100 km/h)

Policy (π)

A policy defines the agent’s behavior — how it selects actions in states.

  • Deterministic Policy: Maps each state to a specific action.

      π(s) = a
    
  • Stochastic Policy: Maps each state to a probability distribution over actions.

      π(a|s) = P(a given s)
    

Types of stochastic policies:

  • Categorical (for discrete actions)

  • Gaussian (for continuous actions)

Episode

A complete run from start state to terminal state.

  • Trajectory: (s₀, a₀, r₀, ..., s_T)

  • Agents learn over multiple episodes to optimize performance.

Episodic vs Continuous Tasks

  • Episodic: Has a defined end (e.g., playing a chess game).

  • Continuous: No terminal state (e.g., robot vacuum cleaner).

Horizon(TIME FRAME)

The time frame over which rewards are accumulated.

  • Finite Horizon: Fixed number of steps.

  • Infinite Horizon: Continuous interaction.

Return & Discount Factor (γ)(REWARD)

Return is the total reward an agent accumulates.

For episodic tasks:

G = r₀ + r₁ + ... + r_T

For continuous tasks (or to avoid infinite sums), introduce:

G = r₀ + γr₁ + γ²r₂ + ...
  • γ (gamma): Discount factor (0 ≤ γ ≤ 1)

    • Close to 0 → values immediate reward more.

    • Close to 1 → values future rewards.


Value Functions

Value Function (V)

Expected return from a state s under policy π:

V(s) = E[ G_t | s_t = s ]

Action-Value Function (Q)

Expected return from state s and action a:

Q(s, a) = E[ G_t | s_t = s, a_t = a ]

These are used to evaluate how good a state or action is under a given policy.


Model-Based vs Model-Free RL

  • Model-Based: Learns the transition probabilities and reward functions.

  • Model-Free: Directly learns the value or policy without modeling the environment.


Types of Environments

  1. Deterministic vs Stochastic:

    • Deterministic: Next state is certain.

    • Stochastic: Outcome is probabilistic.

  2. Discrete vs Continuous:

    • State/action spaces can be finite or infinite.
  3. Episodic vs Non-Episodic:

    • Whether there’s a terminal condition.
  4. Single-Agent vs Multi-Agent:

    • One agent learning vs multiple interacting agents.

Applications of RL

  • Robotics

  • Game playing (Atari, Chess, Go)

  • Finance (portfolio optimization)

  • Healthcare (treatment planning)

  • Recommendation Systems

  • Self-driving cars

  • Smart traffic systems


RL Dictionary (Key Terms)

  • Agent: Learner/decision-maker

  • Environment: The world agent interacts with

  • State: Agent's position/situation

  • Action: Agent's move

  • Reward: Feedback signal

  • Policy: Strategy used by agent

  • Return: Cumulative reward

  • Trajectory: Sequence of states, actions, rewards

  • Discount Factor: Importance of future rewards

  • MDP: Framework for modeling RL problems

0
Subscribe to my newsletter

Read articles from shalin Shah directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

shalin Shah
shalin Shah