What is Reinforcement Learning?

Reinforcement Learning is a trial-and-error learning process where an agent learns to make decisions by interacting with an environment. It is neither supervised nor unsupervised learning, but rather a third paradigm of learning.
Key Components of RL
Agent: The learner or decision-maker.
- E.g., a robot, a chess player, or Mario in a video game.
Environment: The world the agent lives and interacts in.
- E.g., the chessboard, the game world in Mario.
State (
s
): A configuration or situation in the environment.- E.g., current position of Mario or current layout of a chessboard.
Action (
a
): What the agent does at a state.- E.g., move left, jump, or take a chess piece.
Reward (
r
): A scalar feedback signal for the action taken.- Positive for good actions, negative for bad ones.
The Core Idea of RL
Imagine you're training a dog to catch a ball:
When the dog catches the ball → you give a cookie (+reward).
When the dog fails to catch → no cookie (-reward or 0).
Over time, the dog learns to associate the action of catching the ball with the positive outcome (cookie). This is reinforcement at work — learning by experience, not by instruction.
Similarly, an RL agent:
Explores the environment
Takes actions
Receives rewards
Updates its behavior to maximize cumulative reward.
The Reinforcement Learning Algorithm (High-Level)
The agent observes the current state.
It selects and performs an action.
It receives a reward and transitions to a new state.
It uses the reward to update its policy (decision-making strategy).
Repeat — until the optimal policy is learned.
A Simple Example: Grid World
Consider a 3x3 grid with some shaded (bad) states and a goal state:
The agent starts at state A.
It must reach state I.
Shaded states (e.g., B, C, G, H) give negative rewards.
Unshaded states give positive rewards.
The agent explores the environment:
In the first episode, it takes random actions, possibly hitting shaded states.
In the next episodes, it learns from its past mistakes and adjusts its behavior.
Eventually, it finds the optimal path from A to I avoiding bad states.
This shows how agents learn over episodes by improving based on the rewards received.
How RL Differs from Other Paradigms
Paradigm | Data Provided | Learning Type |
Supervised Learning | Input-output pairs | Learns from labeled data |
Unsupervised Learning | Only inputs | Finds patterns, clusters |
Reinforcement Learning | No dataset | Learns by interacting with environment |
In RL:
There’s no explicit supervision.
Agent learns from consequences.
Rewards guide learning, not predefined labels.
Example:
Supervised: Give the dog explicit commands.
RL: Let the dog try and reward it for correct behavior.
Markov Decision Process (MDP)
MDP is a formal framework for modeling RL problems. It consists of:
States: Different configurations.
Actions: Possible moves.
Transition Probabilities: Likelihood of moving from one state to another given an action.
Reward Function: Maps transitions to scalar rewards.
Markov Property: The next state depends only on the current state (not the full history).
Variants:
Markov Chain: Just states and transitions.
Markov Reward Process (MRP): States + transitions + rewards.
MDP: MRP + actions.
Fundamental Concepts in RL
Expectation
Expected value is the probability-weighted average of all possible outcomes.
Action Space
The set of all possible actions:
Discrete: [Up, Down, Left, Right]
Continuous: e.g., speed of car (0-100 km/h)
Policy (π
)
A policy defines the agent’s behavior — how it selects actions in states.
Deterministic Policy: Maps each state to a specific action.
π(s) = a
Stochastic Policy: Maps each state to a probability distribution over actions.
π(a|s) = P(a given s)
Types of stochastic policies:
Categorical (for discrete actions)
Gaussian (for continuous actions)
Episode
A complete run from start state to terminal state.
Trajectory: (s₀, a₀, r₀, ..., s_T)
Agents learn over multiple episodes to optimize performance.
Episodic vs Continuous Tasks
Episodic: Has a defined end (e.g., playing a chess game).
Continuous: No terminal state (e.g., robot vacuum cleaner).
Horizon(TIME FRAME)
The time frame over which rewards are accumulated.
Finite Horizon: Fixed number of steps.
Infinite Horizon: Continuous interaction.
Return & Discount Factor (γ)(REWARD)
Return is the total reward an agent accumulates.
For episodic tasks:
G = r₀ + r₁ + ... + r_T
For continuous tasks (or to avoid infinite sums), introduce:
G = r₀ + γr₁ + γ²r₂ + ...
γ (gamma): Discount factor (0 ≤ γ ≤ 1)
Close to 0 → values immediate reward more.
Close to 1 → values future rewards.
Value Functions
Value Function (V)
Expected return from a state s
under policy π
:
V(s) = E[ G_t | s_t = s ]
Action-Value Function (Q)
Expected return from state s
and action a
:
Q(s, a) = E[ G_t | s_t = s, a_t = a ]
These are used to evaluate how good a state or action is under a given policy.
Model-Based vs Model-Free RL
Model-Based: Learns the transition probabilities and reward functions.
Model-Free: Directly learns the value or policy without modeling the environment.
Types of Environments
Deterministic vs Stochastic:
Deterministic: Next state is certain.
Stochastic: Outcome is probabilistic.
Discrete vs Continuous:
- State/action spaces can be finite or infinite.
Episodic vs Non-Episodic:
- Whether there’s a terminal condition.
Single-Agent vs Multi-Agent:
- One agent learning vs multiple interacting agents.
Applications of RL
Robotics
Game playing (Atari, Chess, Go)
Finance (portfolio optimization)
Healthcare (treatment planning)
Recommendation Systems
Self-driving cars
Smart traffic systems
RL Dictionary (Key Terms)
Agent: Learner/decision-maker
Environment: The world agent interacts with
State: Agent's position/situation
Action: Agent's move
Reward: Feedback signal
Policy: Strategy used by agent
Return: Cumulative reward
Trajectory: Sequence of states, actions, rewards
Discount Factor: Importance of future rewards
MDP: Framework for modeling RL problems
Subscribe to my newsletter
Read articles from shalin Shah directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
