Reinforcement Learning Explained rl Reinforcement Learning deep learni

Reinforcement Learning is a trial-and-error learning process where an agent learns to make decisions by interacting with an environment. It is neither supervised nor unsupervised learning, but rather a third paradigm of learning.

Key Components of RL

Agent: The learner or decision-maker.
- E.g., a robot, a chess player, or Mario in a video game.
Environment: The world the agent lives and interacts in.
- E.g., the chessboard, the game world in Mario.
State (s): A configuration or situation in the environment.
- E.g., current position of Mario or current layout of a chessboard.
Action (a): What the agent does at a state.
- E.g., move left, jump, or take a chess piece.
Reward (r): A scalar feedback signal for the action taken.
- Positive for good actions, negative for bad ones.

The Core Idea of RL

Imagine you're training a dog to catch a ball:

When the dog catches the ball → you give a cookie (+reward).
When the dog fails to catch → no cookie (-reward or 0).

Over time, the dog learns to associate the action of catching the ball with the positive outcome (cookie). This is reinforcement at work — learning by experience, not by instruction.

Similarly, an RL agent:

Explores the environment
Takes actions
Receives rewards
Updates its behavior to maximize cumulative reward.

The Reinforcement Learning Algorithm (High-Level)

The agent observes the current state.
It selects and performs an action.
It receives a reward and transitions to a new state.
It uses the reward to update its policy (decision-making strategy).
Repeat — until the optimal policy is learned.

A Simple Example: Grid World

Consider a 3x3 grid with some shaded (bad) states and a goal state:

The agent starts at state A.
It must reach state I.
Shaded states (e.g., B, C, G, H) give negative rewards.
Unshaded states give positive rewards.

The agent explores the environment:

In the first episode, it takes random actions, possibly hitting shaded states.
In the next episodes, it learns from its past mistakes and adjusts its behavior.
Eventually, it finds the optimal path from A to I avoiding bad states.

This shows how agents learn over episodes by improving based on the rewards received.

How RL Differs from Other Paradigms

Paradigm	Data Provided	Learning Type
Supervised Learning	Input-output pairs	Learns from labeled data
Unsupervised Learning	Only inputs	Finds patterns, clusters
Reinforcement Learning	No dataset	Learns by interacting with environment

In RL:

There’s no explicit supervision.
Agent learns from consequences.
Rewards guide learning, not predefined labels.

Example:

Supervised: Give the dog explicit commands.
RL: Let the dog try and reward it for correct behavior.

Markov Decision Process (MDP)

MDP is a formal framework for modeling RL problems. It consists of:

States: Different configurations.
Actions: Possible moves.
Transition Probabilities: Likelihood of moving from one state to another given an action.
Reward Function: Maps transitions to scalar rewards.

Markov Property: The next state depends only on the current state (not the full history).

Variants:

Markov Chain: Just states and transitions.
Markov Reward Process (MRP): States + transitions + rewards.
MDP: MRP + actions.

Fundamental Concepts in RL

Expectation

Expected value is the probability-weighted average of all possible outcomes.

Action Space

The set of all possible actions:

Discrete: [Up, Down, Left, Right]
Continuous: e.g., speed of car (0-100 km/h)

Policy (`π`)

A policy defines the agent’s behavior — how it selects actions in states.

Deterministic Policy: Maps each state to a specific action.
```
  π(s) = a
```
Stochastic Policy: Maps each state to a probability distribution over actions.
```
  π(a|s) = P(a given s)
```

Types of stochastic policies:

Categorical (for discrete actions)
Gaussian (for continuous actions)

Episode

A complete run from start state to terminal state.

Trajectory: (s₀, a₀, r₀, ..., s_T)
Agents learn over multiple episodes to optimize performance.

Episodic vs Continuous Tasks

Episodic: Has a defined end (e.g., playing a chess game).
Continuous: No terminal state (e.g., robot vacuum cleaner).

Horizon(TIME FRAME)

The time frame over which rewards are accumulated.

Finite Horizon: Fixed number of steps.
Infinite Horizon: Continuous interaction.

Return & Discount Factor (γ)(REWARD)

Return is the total reward an agent accumulates.

For episodic tasks:

G = r₀ + r₁ + ... + r_T

For continuous tasks (or to avoid infinite sums), introduce:

G = r₀ + γr₁ + γ²r₂ + ...

γ (gamma): Discount factor (0 ≤ γ ≤ 1)
- Close to 0 → values immediate reward more.
- Close to 1 → values future rewards.

Value Functions

Value Function (V)

Expected return from a state s under policy π:

V(s) = E[ G_t | s_t = s ]

Action-Value Function (Q)

Expected return from state s and action a:

Q(s, a) = E[ G_t | s_t = s, a_t = a ]

These are used to evaluate how good a state or action is under a given policy.

Model-Based vs Model-Free RL

Model-Based: Learns the transition probabilities and reward functions.
Model-Free: Directly learns the value or policy without modeling the environment.

Types of Environments

Deterministic vs Stochastic:
- Deterministic: Next state is certain.
- Stochastic: Outcome is probabilistic.
Discrete vs Continuous:
- State/action spaces can be finite or infinite.
Episodic vs Non-Episodic:
- Whether there’s a terminal condition.
Single-Agent vs Multi-Agent:
- One agent learning vs multiple interacting agents.

Applications of RL

Robotics
Game playing (Atari, Chess, Go)
Finance (portfolio optimization)
Healthcare (treatment planning)
Recommendation Systems
Self-driving cars
Smart traffic systems

RL Dictionary (Key Terms)

Agent: Learner/decision-maker
Environment: The world agent interacts with
State: Agent's position/situation
Action: Agent's move
Reward: Feedback signal
Policy: Strategy used by agent
Return: Cumulative reward
Trajectory: Sequence of states, actions, rewards
Discount Factor: Importance of future rewards
MDP: Framework for modeling RL problems

What is Reinforcement Learning?

Key Components of RL

The Core Idea of RL

The Reinforcement Learning Algorithm (High-Level)

A Simple Example: Grid World

How RL Differs from Other Paradigms

Markov Decision Process (MDP)

Variants:

Fundamental Concepts in RL

Expectation

Action Space

Policy (`π`)

Episode

Episodic vs Continuous Tasks

Horizon(TIME FRAME)

Return & Discount Factor (γ)(REWARD)

Value Functions

Value Function (V)

Action-Value Function (Q)

Model-Based vs Model-Free RL

Types of Environments

Applications of RL

RL Dictionary (Key Terms)

Subscribe to my newsletter

shalin Shah

shalin Shah

What is Reinforcement Learning?

Key Components of RL

The Core Idea of RL

The Reinforcement Learning Algorithm (High-Level)

A Simple Example: Grid World

How RL Differs from Other Paradigms

Markov Decision Process (MDP)

Variants:

Fundamental Concepts in RL

Expectation

Action Space

Policy (π)

Episode

Episodic vs Continuous Tasks

Horizon(TIME FRAME)

Return & Discount Factor (γ)(REWARD)

Value Functions

Value Function (V)

Action-Value Function (Q)

Model-Based vs Model-Free RL

Types of Environments

Applications of RL

RL Dictionary (Key Terms)

Subscribe to my newsletter

shalin Shah

shalin Shah

Policy (`π`)