Introduction
Discover the fundamentals of Reinforcement Learning — a computational approach to learning from interaction to achieve goals.
Agent
The learner and decision-maker
Environment
Everything the agent interacts with
What is Reinforcement Learning?
Learning what to do — how to map situations to actions — to maximize reward.
Think of it like training a dog!
When you train a dog to sit, you don't explain the physics of sitting. Instead, whenever the dog sits, you give it a treat (reward). When it doesn't, no treat. Over time, the dog learns on its own that sitting = treat. That's reinforcement learning in action!
Reinforcement Learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told which actions to take, but must discover which actions yield the most reward by trying them.
In simple terms: RL is about an agent learning the best behavior through trial-and-error, getting feedback in the form of rewards or punishments.
Two Key Characteristics That Define RL
1. Trial-and-Error Search
The agent must try different actions and observe their consequences to learn which ones lead to good outcomes.
Example: When you first learned to ride a bike, nobody could tell you exactly how to balance. You had to try, fall, adjust, and try again until your brain figured out the right movements.
2. Delayed Reward
Actions may affect not just the immediate reward but also future situations and all subsequent rewards.
Example: In chess, a move that looks bad now (like sacrificing your queen) might lead to checkmate 5 moves later. The reward is delayed!
How is RL Different from Other Machine Learning?
There are three main types of machine learning. Understanding the differences helps you know when to use each one:
Supervised Learning
Learn from labeled examples with correct answers provided
Like: A teacher grading your test and showing you the right answers. You learn from the corrections.
Example: Email spam detection — you show the algorithm thousands of emails labeled "spam" or "not spam"
Unsupervised Learning
Find hidden structure in unlabeled data
Like: Organizing your closet by finding natural groupings — shirts here, pants there — without anyone telling you how.
Example: Customer segmentation — grouping customers with similar buying patterns
Reinforcement Learning
Learn to maximize reward through trial-and-error interaction
Like: Learning to play a video game without reading the manual — you just try things and see what gets you more points.
Example: Training an AI to play games, control robots, or make trading decisions
The key difference: in RL, there is no teacher providing correct answers. The agent must discover good behavior on its own through interaction with the environment. It's like learning to swim by actually getting in the water, not by reading a book about swimming!
Examples of RL in Action
From chess to robots, RL principles appear everywhere agents learn from interaction.
RL is everywhere! Here are some real-world examples to help you understand how these concepts apply:
Chess Master
Planning moves ahead + position evaluation
Refinery Controller
Real-time optimization balancing multiple objectives
Gazelle Calf
Learning to run 20 mph in 30 minutes
Mobile Robot
Explore vs. recharge battery decisions
Breakfast Prep
Hierarchical goals and sensorimotor coordination
Let's Break Down One Example: The Chess Master
What's the Agent?
The chess-playing program (or the human player learning chess)
What's the Environment?
The chess board, the pieces, and the opponent
What's the State?
The current position of all pieces on the board
What are the Actions?
All legal moves the agent can make (e.g., "move knight to e4")
What's the Reward?
+1 for winning, -1 for losing, 0 for a draw. Notice how the reward only comes at the very end — this is "delayed reward"!
What's the Goal?
Learn to play moves that maximize the chance of winning
Notice that all these examples share a common structure: an agent interacts with an environment, takes actions, receives rewards, and tries to learn the best behavior. This agent-environment interaction is the heart of RL!
Elements of Reinforcement Learning
Four main subelements of any RL system.
Every RL system has four key components. Understanding these is crucial for building or analyzing any RL application. Let's explore each one with simple explanations:
Policy (π)
π: S → AThe policy is the agent's decision-making rule — it tells the agent what action to take in each situation.
🎯 Think of it like this:
A policy is like a strategy guide or a rulebook. If you're playing Pac-Man, your policy might be: "If a ghost is nearby, run away. If a power pellet is close, eat it. Otherwise, eat the nearest dot."
The policy can be simple (a lookup table) or complex (a neural network). The goal of RL is usually to find the best policy.
Reward Signal (R)
RtThe reward is the immediate feedback the agent receives after each action. It's a single number: positive = good, negative = bad.
🎯 Think of it like this:
Rewards are like the score in a video game. Eating a pellet in Pac-Man = +10 points. Getting caught by a ghost = -100 points (or game over!). The agent's goal is to maximize total points.
Important: The reward defines whatwe want the agent to achieve, but not how to achieve it. The agent figures out the "how".
Value Function (V)
V(s)The value of a state is the total reward the agent expects to receive in the future, starting from that state.
🎯 Think of it like this:
Imagine you're at a fork in a hiking trail. The reward at each fork is how nice the current view is. But the value considers the entire hike ahead: one path might have a mediocre view now but lead to a spectacular waterfall!
Reward (immediate)
"How good is this moment right now?"
Value (long-term)
"How good is my future from here?"
Model of Environment (Optional)
P(s'|s,a)A model predicts what the environment will do next — it's the agent'sinternal simulation of how the world works.
🎯 Think of it like this:
When you plan a chess move, you think "If I move here, they'll probably respond like this, then I can do that..." That mental simulation is your modelof the game.
Model-based vs Model-free: Some RL methods build a model to plan ahead. Others learn directly from experience without building a model.
Value estimation is the most important component of almost all RL algorithms. This is arguably the most important insight from 60 years of RL research!
Why? Because rewards tell us what's good right now, but values tell us what's good in the long run. Smart decisions require thinking ahead!
The Exploration-Exploitation Dilemma
Should you try something new or stick with what works?
The Restaurant Dilemma
Imagine you're choosing where to eat dinner. You know a restaurant you like (7/10). There's a new place you've never tried — it might be amazing (10/10) or terrible (3/10). Do you exploit what you know works, or explore the unknown?
This is the exploration-exploitation tradeoff, and it's unique to RL. You must balance:
🔍 Exploration
Trying new actions to gather more information about the environment.
Pros: Might discover better options
Cons: Might waste time on bad options
🎯 Exploitation
Using current knowledge to make the best decision.
Pros: Reliable, known outcomes
Cons: Might miss better opportunities
- Too much exploration: You never benefit from what you've learned
- Too much exploitation: You might get stuck with a suboptimal choice forever
- The challenge: Finding the right balance is one of the core challenges in RL!
We'll explore specific strategies for handling this dilemma in Chapter 2 (ε-greedy, UCB, etc.).
Temporal-Difference Learning
The key update rule that makes RL work.
TD learning is one of the most important ideas in RL. It allows an agent to learn before the final outcome is known, updating estimates step-by-step as it goes.
The Weather Prediction Analogy
Imagine you're trying to predict if it will rain on Saturday. It's Monday, and you predicted 30% chance of rain. By Wednesday, new information arrives and you update to 50%. You didn't wait until Saturday to update — you used your updated prediction(Wednesday) to improve your earlier prediction (Monday). That's TD learning!
The TD Update Rule
"Update your estimate of the current state based on your estimate of the next state"
Step-by-Step Example
Let's say you're learning to navigate a maze. You're at position A (estimated value = 0.5) and move to position B (estimated value = 0.8). With learning rate α = 0.1:
TD Error = V(B) - V(A) = 0.8 - 0.5 = 0.3
New V(A) = 0.5 + 0.1 × 0.3 = 0.5 + 0.03 = 0.53
Position A's value increased slightly because it led to a better position (B). Over time, values propagate backward from good outcomes!
- No need to wait: Unlike Monte Carlo methods, you don't need to wait until the episode ends to learn
- Bootstrap: Uses estimated values to update other estimated values (learns from guesses!)
- Foundation: This idea is the basis for powerful algorithms like Q-learning and SARSA (Chapter 6)
Summary
Key takeaways from Chapter 1.
Congratulations on completing Chapter 1! Here's what you've learned:
RL is learning from interaction
An agent learns by trying actions and receiving feedback (rewards), not from labeled examples
Trial-and-error + delayed reward
The two key characteristics that define RL problems
Four elements: Policy, Reward, Value, Model
Every RL system has these components (model is optional)
Value functions look ahead
While rewards are immediate, values estimate long-term future success
Exploration vs exploitation
The fundamental dilemma: try new things or stick with what works?
TD learning updates incrementally
Learn from the difference between consecutive predictions, not just final outcomes
What's Next?
In Chapter 2, we'll dive into the multi-armed bandit problem — the simplest RL setting where we focus purely on the exploration-exploitation tradeoff. You'll learn concrete strategies like ε-greedy, UCB, and gradient bandits!