Chapter 1

Introduction

Discover the fundamentals of Reinforcement Learning — a computational approach to learning from interaction to achieve goals.

Agent

The learner and decision-maker

Environment

Everything the agent interacts with

Section 1.1

What is Reinforcement Learning?

Learning what to do — how to map situations to actions — to maximize reward.

Think of it like training a dog!

When you train a dog to sit, you don't explain the physics of sitting. Instead, whenever the dog sits, you give it a treat (reward). When it doesn't, no treat. Over time, the dog learns on its own that sitting = treat. That's reinforcement learning in action!

Reinforcement Learning (RL)

Reinforcement Learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told which actions to take, but must discover which actions yield the most reward by trying them.

In simple terms: RL is about an agent learning the best behavior through trial-and-error, getting feedback in the form of rewards or punishments.

Two Key Characteristics That Define RL

1. Trial-and-Error Search

The agent must try different actions and observe their consequences to learn which ones lead to good outcomes.

Example: When you first learned to ride a bike, nobody could tell you exactly how to balance. You had to try, fall, adjust, and try again until your brain figured out the right movements.

2. Delayed Reward

Actions may affect not just the immediate reward but also future situations and all subsequent rewards.

Example: In chess, a move that looks bad now (like sacrificing your queen) might lead to checkmate 5 moves later. The reward is delayed!

How is RL Different from Other Machine Learning?

There are three main types of machine learning. Understanding the differences helps you know when to use each one:

Supervised Learning

Learn from labeled examples with correct answers provided

Like: A teacher grading your test and showing you the right answers. You learn from the corrections.

Example: Email spam detection — you show the algorithm thousands of emails labeled "spam" or "not spam"

Unsupervised Learning

Find hidden structure in unlabeled data

Like: Organizing your closet by finding natural groupings — shirts here, pants there — without anyone telling you how.

Example: Customer segmentation — grouping customers with similar buying patterns

Reinforcement Learning

Learn to maximize reward through trial-and-error interaction

Like: Learning to play a video game without reading the manual — you just try things and see what gets you more points.

Example: Training an AI to play games, control robots, or make trading decisions

Why RL is Special

The key difference: in RL, there is no teacher providing correct answers. The agent must discover good behavior on its own through interaction with the environment. It's like learning to swim by actually getting in the water, not by reading a book about swimming!

Section 1.2

Examples of RL in Action

From chess to robots, RL principles appear everywhere agents learn from interaction.

RL is everywhere! Here are some real-world examples to help you understand how these concepts apply:

Chess Master

Planning moves ahead + position evaluation

Refinery Controller

Real-time optimization balancing multiple objectives

Gazelle Calf

Learning to run 20 mph in 30 minutes

Mobile Robot

Explore vs. recharge battery decisions

Breakfast Prep

Hierarchical goals and sensorimotor coordination

Let's Break Down One Example: The Chess Master

What's the Agent?

The chess-playing program (or the human player learning chess)

What's the Environment?

The chess board, the pieces, and the opponent

What's the State?

The current position of all pieces on the board

What are the Actions?

All legal moves the agent can make (e.g., "move knight to e4")

What's the Reward?

+1 for winning, -1 for losing, 0 for a draw. Notice how the reward only comes at the very end — this is "delayed reward"!

What's the Goal?

Learn to play moves that maximize the chance of winning

Common Pattern in All Examples

Notice that all these examples share a common structure: an agent interacts with an environment, takes actions, receives rewards, and tries to learn the best behavior. This agent-environment interaction is the heart of RL!

Section 1.3

Elements of Reinforcement Learning

Four main subelements of any RL system.

Every RL system has four key components. Understanding these is crucial for building or analyzing any RL application. Let's explore each one with simple explanations:

Policy (π)

π: S → A

The policy is the agent's decision-making rule — it tells the agent what action to take in each situation.

🎯 Think of it like this:

A policy is like a strategy guide or a rulebook. If you're playing Pac-Man, your policy might be: "If a ghost is nearby, run away. If a power pellet is close, eat it. Otherwise, eat the nearest dot."

The policy can be simple (a lookup table) or complex (a neural network). The goal of RL is usually to find the best policy.

Reward Signal (R)

Rt

The reward is the immediate feedback the agent receives after each action. It's a single number: positive = good, negative = bad.

🎯 Think of it like this:

Rewards are like the score in a video game. Eating a pellet in Pac-Man = +10 points. Getting caught by a ghost = -100 points (or game over!). The agent's goal is to maximize total points.

Important: The reward defines whatwe want the agent to achieve, but not how to achieve it. The agent figures out the "how".

Value Function (V)

V(s)

The value of a state is the total reward the agent expects to receive in the future, starting from that state.

🎯 Think of it like this:

Imagine you're at a fork in a hiking trail. The reward at each fork is how nice the current view is. But the value considers the entire hike ahead: one path might have a mediocre view now but lead to a spectacular waterfall!

Reward (immediate)

"How good is this moment right now?"

Value (long-term)

"How good is my future from here?"

Model of Environment (Optional)

P(s'|s,a)

A model predicts what the environment will do next — it's the agent'sinternal simulation of how the world works.

🎯 Think of it like this:

When you plan a chess move, you think "If I move here, they'll probably respond like this, then I can do that..." That mental simulation is your modelof the game.

Model-based vs Model-free: Some RL methods build a model to plan ahead. Others learn directly from experience without building a model.

Value Functions are Key!

Value estimation is the most important component of almost all RL algorithms. This is arguably the most important insight from 60 years of RL research!

Why? Because rewards tell us what's good right now, but values tell us what's good in the long run. Smart decisions require thinking ahead!

Section 1.4

The Exploration-Exploitation Dilemma

Should you try something new or stick with what works?

The Restaurant Dilemma

Imagine you're choosing where to eat dinner. You know a restaurant you like (7/10). There's a new place you've never tried — it might be amazing (10/10) or terrible (3/10). Do you exploit what you know works, or explore the unknown?

This is the exploration-exploitation tradeoff, and it's unique to RL. You must balance:

🔍 Exploration

Trying new actions to gather more information about the environment.

Pros: Might discover better options
Cons: Might waste time on bad options

🎯 Exploitation

Using current knowledge to make the best decision.

Pros: Reliable, known outcomes
Cons: Might miss better opportunities

Why This Matters
  • Too much exploration: You never benefit from what you've learned
  • Too much exploitation: You might get stuck with a suboptimal choice forever
  • The challenge: Finding the right balance is one of the core challenges in RL!

We'll explore specific strategies for handling this dilemma in Chapter 2 (ε-greedy, UCB, etc.).

Section 1.5

Temporal-Difference Learning

The key update rule that makes RL work.

TD learning is one of the most important ideas in RL. It allows an agent to learn before the final outcome is known, updating estimates step-by-step as it goes.

The Weather Prediction Analogy

Imagine you're trying to predict if it will rain on Saturday. It's Monday, and you predicted 30% chance of rain. By Wednesday, new information arrives and you update to 50%. You didn't wait until Saturday to update — you used your updated prediction(Wednesday) to improve your earlier prediction (Monday). That's TD learning!

The TD Update Rule

V(St)V(St)+α[V(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha[V(S_{t+1}) - V(S_t)]

"Update your estimate of the current state based on your estimate of the next state"

V(St)
Value of current state
"How good is where I am now?"
α (alpha)
Learning rate (0 to 1)
"How much should I adjust?"
V(St+1) - V(St)
TD Error
"Was the next state better or worse?"

Step-by-Step Example

Let's say you're learning to navigate a maze. You're at position A (estimated value = 0.5) and move to position B (estimated value = 0.8). With learning rate α = 0.1:

TD Error = V(B) - V(A) = 0.8 - 0.5 = 0.3

New V(A) = 0.5 + 0.1 × 0.3 = 0.5 + 0.03 = 0.53

Position A's value increased slightly because it led to a better position (B). Over time, values propagate backward from good outcomes!

Why TD Learning is Powerful
  • No need to wait: Unlike Monte Carlo methods, you don't need to wait until the episode ends to learn
  • Bootstrap: Uses estimated values to update other estimated values (learns from guesses!)
  • Foundation: This idea is the basis for powerful algorithms like Q-learning and SARSA (Chapter 6)
Section 1.6

Summary

Key takeaways from Chapter 1.

Congratulations on completing Chapter 1! Here's what you've learned:

RL is learning from interaction

An agent learns by trying actions and receiving feedback (rewards), not from labeled examples

Trial-and-error + delayed reward

The two key characteristics that define RL problems

Four elements: Policy, Reward, Value, Model

Every RL system has these components (model is optional)

Value functions look ahead

While rewards are immediate, values estimate long-term future success

Exploration vs exploitation

The fundamental dilemma: try new things or stick with what works?

TD learning updates incrementally

Learn from the difference between consecutive predictions, not just final outcomes

What's Next?

In Chapter 2, we'll dive into the multi-armed bandit problem — the simplest RL setting where we focus purely on the exploration-exploitation tradeoff. You'll learn concrete strategies like ε-greedy, UCB, and gradient bandits!