Chapter 3

Finite Markov Decision Processes

MDPs are the mathematical framework for sequential decision-making. They formalize the agent-environment interaction with states, actions, rewards, and transitions.

States

S - situation descriptions

Actions

A - available choices

Rewards

R - feedback signals

Section 3.1

The Agent-Environment Interface

At each time step, agent observes state, takes action, receives reward, transitions to new state.

Think of it like playing a video game!

You (the agent) see the game screen (the state), press buttons (take actions), and see your score change (receive rewards). The game then shows you the next screen (the new state). This loop continues until the game ends. An MDP is just a formal way to describe this interaction!

Markov Decision Process (MDP)

An MDP is defined by a tuple $(S, A, p, r, \gamma)$ where:

S - set of all possible states (where you can be)
A - set of all possible actions (what you can do)
p(s',r|s,a) - dynamics function (what happens when you do something)
r(s,a) - expected reward function (how good was that?)
γ - discount factor (how much do you care about future rewards?)

The Agent-Environment Loop

This is the heartbeat of RL. It happens at every time step t = 0, 1, 2, 3, ...

Agent

Environment

S_t, R_t

A_t

State

Reward

Action

At each step, the environment sends state S_t and reward R_t to the agent. The agent responds with action A_t, causing a transition to S_t+1.

The Dynamics Function (Don't Panic!)

p(s', r | s, a) = \Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a\}

📖 What Each Symbol Means:

p(...)= probability of something happening (a number between 0 and 1)

s'= the next state (the prime symbol ' means "next")

r= the reward received (a number, like +1, -10, or 0)

s= the current state (where you are now)

a= the action taken (what you chose to do)

|= "given that" or "assuming that" (conditional probability)

Pr{}= "the probability that..."

S_t, R_t= State and Reward at time step t (subscript t = time)

🎯 In plain English:

"If I'm in state s and take action a, what's the probability that I'll end up in state s' and receive reward r?"

📝 How to Read This Formula:

Read it left to right like a sentence:

"p(s', r | s, a)" = "The probability of ending up in state s' with reward r, given that we started in state s and took action a"

🎮 Concrete Example - Slippery Grid:

You're at position (2,3) and press "move right":

p((3,3), +1 | (2,3), right) = 0.9 → 90% chance: move to (3,3), get +1

p((2,3), 0 | (2,3), right) = 0.1 → 10% chance: slip, stay at (2,3), get 0

Notice: probabilities add up to 1.0 (something has to happen!)

Real-World MDP Examples

Let's see how the MDP framework applies to real problems. In each case, we identify States, Actions, and Rewards:

Self-Driving Car

States: Camera images, sensor readings, speed, position on road

Actions: Accelerate, brake, steer left/right, change lane

Rewards: +1 for safe driving, -100 for crash, +10 for reaching destination

Medical Treatment

States: Patient vitals, test results, medical history, current symptoms

Actions: Prescribe medication, order test, recommend surgery, wait and observe

Rewards: +10 for recovery, -5 for side effects, -50 for deterioration

Stock Trading

States: Stock prices, portfolio holdings, market indicators, news sentiment

Actions: Buy X shares, sell Y shares, hold, diversify

Rewards: Profit/loss from trades (positive or negative dollars)

Video Game AI

States: Game screen pixels, player position, enemy positions, health, score

Actions: Move up/down/left/right, jump, shoot, use power-up

Rewards: Points scored, +100 for level complete, -1 per second (encourages speed)

💡 Notice the pattern: Every decision-making problem can be framed as an MDP. The hard part is choosing the right states, actions, and rewards for your specific problem!

The Markov Property (The Key Insight!)

A state is Markov if the future depends only on the present, not the past.

🎯 Think of it like this:

In chess, the current board position tells you everything you need to make a decision. It doesn't matter how you got to that position — the best move is the same regardless of history. The current state "summarizes" all relevant history.

Common Misconceptions (Don't Make These Mistakes!)

Wrong: "The Markov property means we ignore history"

Right: The state encodes all relevant history. If you need history, make it part of your state! (e.g., "velocity" encodes recent positions)

Wrong: "The agent knows the dynamics function p(s',r|s,a)"

Right: Usually the agent does NOT know the dynamics! It must learn from experience. Only "model-based" RL assumes known dynamics.

Wrong: "Reward tells the agent HOW to do the task"

Right: Reward only says WHAT is good/bad. The agent must figure out HOW by itself — that's the whole point of learning!

Quick Self-Check: Do You Understand Section 3.1?

Can you answer these questions?

What are the 5 components of an MDP? (S, A, p, r, γ)

What does p(s',r|s,a) mean in plain English?

Why is the Markov property important for RL?

Can you frame a real-world problem as an MDP?

If you're unsure about any of these, re-read the section before moving on!

For the Curious Mind

What if the Markov property doesn't hold?

Real-world problems often violate Markov. The solution? Make the state include more history! This leads to POMDPs (Partially Observable MDPs) where the agent doesn't see the full state. Deep RL uses recurrent neural networks to remember history automatically.

Did you know? Life is an MDP!

Biologists use MDPs to model animal behavior. Your decisions about what to eat, where to live, who to befriend — they can all be framed as RL! Evolution essentially "trained" your brain to solve the MDP of survival.

Open Research Question

How do we design rewards that capture what we really want? This is called the reward design problem or AI alignment. Getting it wrong can lead to agents that technically maximize reward but do things we didn't intend!

Section 3.2

Goals and Rewards

The reward hypothesis: all goals can be described as maximizing expected cumulative reward.

The Reward Hypothesis

"All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (reward)."

In simple terms: Any goal can be expressed as "get the most total points."

Reward Design

Rewards should signal what you want achieved, not how to achieve it.

✓ Good: +1 for reaching the goal

✗ Bad: +1 for moving toward the goal

Let the agent discover HOW on its own!

Intrinsic vs Extrinsic

Two types of rewards in RL systems:

Extrinsic: From environment (score, win/loss)

Intrinsic: Internal motivation (curiosity, novelty)

Real-World Reward Examples

Chess

+1 win, -1 loss, 0 draw

Sparse, only at game end

Maze

-1 per step, +100 at goal

Encourages finding shortest path

Trading

Profit/loss per trade

Direct monetary signal

Robot

+1 upright, -100 fall

Continuous task

Section 3.3

Returns and Episodes

The return is the total discounted reward from time t onward.

What's a Return?

The return is your "total score" from now until the end. But there's a twist: future rewards are worth less than immediate rewards (we "discount" them). It's like money: $100 today is worth more than $100 next year!

Definition of Return

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

📖 What Each Symbol Means:

G_t= "Return at time t" - your total future score starting from time t

R_t+1= reward received at time t+1 (the immediate next reward)

γ (gamma)= discount factor, a number between 0 and 1 (like 0.9 or 0.99)

γ^k= γ multiplied by itself k times (γ² = γ×γ, γ³ = γ×γ×γ)

Σ= "sum of" - add up all the terms

k=0 to ∞= k goes from 0, 1, 2, 3, ... forever (or until episode ends)

📝 Expanding the Sum Step-by-Step:

When k=0: γ⁰R_t+0+1 = 1 × R_t+1 = R_t+1

When k=1: γ¹R_t+1+1 = γ × R_t+2

When k=2: γ²R_t+2+1 = γ² × R_t+3

... and so on

Note: γ⁰ = 1 (anything to the power of 0 equals 1)

🔢 Concrete Example (γ = 0.9):

Suppose you receive rewards: R₁=10, R₂=5, R₃=20, then the game ends.

G₀ = 10 + 0.9×5 + 0.9²×20

G₀ = 10 + 4.5 + 16.2 = 30.7

Without discounting (γ=1): 10 + 5 + 20 = 35. With discounting: 30.7 (future rewards worth less!)

🔑 Key Insight: The Recursive Property

Here's a beautiful fact that makes RL algorithms work:

G_t = R_{t+1} + \gamma G_{t+1}

Derivation:

G_t = R_t+1 + γR_t+2 + γ²R_t+3 + ...

// Factor out γ from all terms after the first:

G_t = R_t+1 + γ(R_t+2 + γR_t+3 + γ²R_t+4 + ...)

// But that thing in parentheses is just G_t+1!

G_t = R_t+1 + γG_t+1 ✓

💡 This means: "Total future reward = immediate reward + discounted future-of-future reward"

Episodic Tasks

Tasks that naturally end: games, mazes, trials.

Examples: Chess game ends with checkmate. A maze run ends when you reach the goal.

γ can be 1 (no discounting needed)

Continuing Tasks

Tasks that go on forever without a natural end.

Examples: A robot that operates 24/7. A thermostat controlling temperature continuously.

Must use γ < 1 to keep returns finite

Interactive Discount Factor Demo

Discount Factor (γ): 0.90

0 (myopic)1 (far-sighted)

Total Return (G₀)

10.00

G_t = R_t+1 + γR_t+2 + γ²R_t+3 + ...

Interpretation

Far-sighted agent - willing to wait for larger future rewards.

Reward Sequence with Discount Weights

t=0

1 × γ⁰= 1.000

γ⁰ = 1.000

t=1

1 × γ¹= 0.900

γ¹ = 0.900

t=2

1 × γ²= 0.810

γ² = 0.810

t=3

1 × γ³= 0.729

γ³ = 0.729

t=4

1 × γ⁴= 0.656

γ⁴ = 0.656

t=5

10 × γ⁵= 5.905

γ⁵ = 0.590

Customize Rewards:

R₁:

R₂:

R₃:

R₄:

R₅:

R₆:

Notice how the final reward of 10 at t=5 is significantly discounted when γ is low. Try γ=1 to see all rewards weighted equally, or γ=0 to see only immediate rewards matter.

Why Discount Future Rewards?

Mathematical: Ensures returns are finite for never-ending tasks
Uncertainty: Future is uncertain; a bird in hand is worth two in the bush
Time value: Like money, immediate rewards can be "invested"
Natural preference: Humans and animals prefer sooner rewards

Intuitive Analogy: Discounting is Like Money in the Bank

Imagine you have two choices:

$100 today

$100 next year

Most people prefer $100 today. Why?

You could invest today's $100 and have $110 next year
Next year's $100 is uncertain (inflation, risk)
You can enjoy the $100 sooner

💡 The discount factor γ captures exactly this intuition. If γ = 0.9, then $100 next year is only worth $90 today!

What Different γ Values Mean

γ = 0

"I only care about the immediate next reward. The future doesn't exist to me."

Extremely short-sighted

γ = 0.9

"I care about future rewards, but prefer sooner ones. 10 steps ahead is worth 0.9¹⁰ ≈ 0.35 of today."

Balanced (common choice)

γ = 0.99

"I care almost equally about near and far future. 100 steps ahead is worth 0.99¹⁰⁰ ≈ 0.37."

Far-sighted

For the Curious Mind

Neuroscience Connection: Hyperbolic Discounting

Humans don't actually discount exponentially (γ^t) — we use hyperbolic discounting! This means we're extra impatient about immediate rewards but more patient about distant ones. That's why you procrastinate even though you know you shouldn't!

What if γ = 1? Infinite returns?

With γ = 1 in continuing tasks, returns become infinite! Some researchers use average reward formulations instead, where you optimize the long-run average rather than the sum. This is common in operations research.

Fun Puzzle: What's the "right" discount factor?

If γ = 0.99 and your agent takes 1 action per second, then rewards 1 hour away are worth 0.99³⁶⁰⁰ ≈ 0 to it! For real-world robots, choosing γ is a subtle art.

Section 3.4

Unified Notation for Episodic and Continuing Tasks

A single notation that works for both episodic and continuing tasks.

The Absorbing State Trick

We can treat episodic tasks as continuing by adding a special "absorbing" terminal state that loops to itself forever with zero reward.

G_t = \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k

📖 What Each Symbol Means:

G_t= return (total discounted reward) from time t

Σ= sum from k = t+1 to T

T= terminal time step (can be ∞ for continuing tasks)

γ^k-t-1= discount factor raised to (k - t - 1)

R_k= reward at time step k

📝 Expanding This Sum:

When k = t+1: γ^(t+1)-t-1 = γ⁰ = 1, so we get: R_t+1

When k = t+2: γ^(t+2)-t-1 = γ¹ = γ, so we get: γR_t+2

When k = t+3: γ^(t+3)-t-1 = γ², so we get: γ²R_t+3

... continues until k = T

This gives us: G_t = R_t+1 + γR_t+2 + γ²R_t+3 + ... (same as before!)

🎯 Why this helps:

Now we can write the same formulas for both types of tasks! When T = ∞, it's continuing. When T is finite, it's episodic. Same math, different settings.

Episodic

T = terminal step (finite)
γ ∈ [0, 1] (any value OK)

Continuing

T = ∞ (never ends)
γ ∈ [0, 1) (must be less than 1!)

Section 3.5

Policies and Value Functions

Value functions estimate how good states and actions are under a given policy.

The Big Picture

A policy tells you what to do in each state. A value function tells you how good a state (or action) is. The goal of RL is to find the best policy — the one that leads to the highest values!

Policy (π)

A policy π is a mapping from states to action probabilities. $\pi(a|s)$ is the probability of taking action a in state s.

🎯 Example:

In a grid, a policy might say: "In cell (1,1), go right with 80% probability, go up with 20%"

State-Value Function v_π(s)

v_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]

📖 Symbol by Symbol:

v = value (how good)

π = the policy being followed

𝔼 = expected value (average)

G_t = return (total future reward)

| = "given that"

S_t=s = we're in state s

🎯 In plain English:

"If I start in state s and follow policy π, how much total reward can I expect on average?"

Action-Value Function q_π(s,a)

q_\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]

📖 Symbol by Symbol:

q = quality (of state-action pair)

π = policy followed after action

s = the state you're in

a = the action you take first

A_t=a = we take action a

🎯 In plain English:

"If I'm in state s, take action a, then follow π, how much total reward can I expect?"

🔗 How v and q Relate to Each Other

Getting v from q:

v_\pi(s) = \sum_a \pi(a|s) \cdot q_\pi(s,a)

"Average over all actions, weighted by how likely π is to choose each one"

Getting q from v:

q_\pi(s,a) = \sum_{s',r} p(s',r|s,a)[r + \gamma v_\pi(s')]

"Immediate reward + discounted value of wherever we end up"

Bellman Equations & Backup Diagrams

State

Action

Next State

Bellman Equation for v_π

v_π(s) = E_π[G_t | S_t=s]

= Σ_a π(a|s) Σ_s',r p(s',r|s,a)[r + γv_π(s')]

Key Insight

The value of a state is the weighted sum over all actions (weighted by policy) of the expected value of taking that action.

The backup diagram shows how value propagates backward from successor states. This recursive relationship is the foundation of all RL algorithms.

The Bellman Equation (The Heart of RL!)

v_\pi(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]

📖 Breaking Down Every Piece:

v_π(s)= value of state s under policy π (what we're calculating)

Σ_a= sum over all possible actions a

π(a|s)= probability that policy π chooses action a in state s

Σ_s',r= sum over all possible next states s' and rewards r

p(s',r|s,a)= probability of transitioning to s' with reward r

r= immediate reward received

γ= discount factor (how much we care about future)

v_π(s')= value of the next state (recursive!)

📐 Full Derivation (Step by Step):

Step 1: Start with the definition of v_π(s)

v_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]

Step 2: Use the recursive property of returns G_t = R_t+1 + γG_t+1

v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s]

Step 3: Split the expectation (linearity)

v_\pi(s) = \mathbb{E}_\pi[R_{t+1} | S_t = s] + \gamma \mathbb{E}_\pi[G_{t+1} | S_t = s]

Step 4: The expectation of G_t+1 given the next state is v_π(s')

\mathbb{E}_\pi[G_{t+1} | S_{t+1} = s'] = v_\pi(s')

Step 5: Account for all ways to reach different (s', r) pairs

v_\pi(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)[r + \gamma v_\pi(s')]

🧠 Reading It Like a Story:

The value of being in state s equals:

For each action a I might take (weighted by probability π(a|s))...
For each place I might end up (weighted by probability p(s',r|s,a))...
Add up: immediate reward + discount × value of where I land

🎯 The Key Insight:

The value of a state = expected immediate reward + discounted value of next state

This is recursive: the value of now depends on the value of later. It's like asking "how good is my position in chess?" — it depends on "how good will my future positions be?"

Why Bellman Equations Matter

The Bellman equation is the foundation of ALL RL algorithms. It gives us a way to compute values by "backing up" information from future states to current states. Dynamic Programming, TD learning, Q-learning — they all use this idea!

Why Do Value Functions Matter? Real Applications

🎮 Game Playing AI

AlphaGo uses value functions to evaluate board positions. Instead of searching millions of moves, it asks: "What's the value of this position?" and focuses on promising moves.

🤖 Robot Navigation

A robot can learn v(s) for each location in a building. High values near charging stations, low values near stairs. Then it just moves toward higher-value states!

📈 Trading Systems

q(market_state, buy) vs q(market_state, sell) tells you which action leads to more profit. The action-value function directly informs trading decisions.

🏥 Treatment Planning

Value functions can estimate expected patient outcomes for different treatment sequences, helping doctors choose optimal care paths.

Intuitive Difference: v(s) vs q(s,a)

Imagine you're standing at a crossroads in a video game:

v(crossroads) = 50

"On average, starting from this crossroads and following my policy, I'll earn 50 points."

This is an average over all actions I might take.

q(crossroads, go_left) = 70

"If I go LEFT specifically, then follow my policy, I'll earn 70 points."

This is for a specific first action.

💡 Key insight: If q(s, left) > v(s), then "left" is better than average! This is how we improve policies — find actions with q > v.

For the Curious Mind

The Bellman Equation is Everywhere!

Richard Bellman discovered this in the 1950s for optimal control, not AI. It's used in economics, operations research, finance, and even DNA sequence alignment! The same recursive structure appears whenever you optimize over time.

Why is solving the Bellman equation hard?

With n states and m actions, you have n equations with n unknowns — that's solvable! But real problems have billions of states. This is the "curse of dimensionality" that Bellman himself named. Deep RL uses neural networks to approximate values without storing them all.

Mind-Bending Thought Experiment

If you knew the true value function v_π(s) for every state, you could play perfectly with zero learning — just always move toward higher-value states! The whole challenge of RL is that we don't know the true values.

Section 3.6

Optimal Policies and Optimal Value Functions

Finding the policy that maximizes expected return from all states.

The Ultimate Goal

Find the optimal policy π* — the one that does better than (or at least as well as) every other policy in every state. The amazing thing? Such a policy always exists for finite MDPs!

Optimal Value Functions

Optimal state-value:

v_*(s) = \max_\pi v_\pi(s)

v_*(s)= optimal value of state s (the * means "optimal")

max_π= take the maximum over all possible policies

🎯 "The best possible value for state s, if we use the best possible policy"

Optimal action-value:

q_*(s, a) = \max_\pi q_\pi(s, a)

q_*(s,a)= optimal value of taking action a in state s

🎯 "The best return I can get if I take action a in state s, then act optimally"

🔑 Key Relationships

v_*(s) = \max_a q_*(s, a)

"The value of a state = value of the best action from that state"

q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t=a]

"Value of action = reward + discounted value of next state"

Bellman Optimality Equations

The optimal values satisfy special recursive equations (like regular Bellman, but with max instead of averaging over policy):

For state-value v_*:

v_*(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')]

📖 Each Symbol:

v_*(s)= optimal value of state s

max_a= pick the action that gives the highest value

Σ_s',r= average over possible outcomes (next states & rewards)

[r + γv_*(s')]= immediate reward + discounted optimal future value

📐 How We Get This:

1. Start with: v_*(s) = max_a q_*(s, a)

(optimal value = value of best action)

2. Substitute the formula for q_*(s, a):

q_*(s,a) = 𝔼[r + γv_*(s') | s, a] = Σ_s',r p(s',r|s,a)[r + γv_*(s')]

3. Result: v_*(s) = max_a Σ_s',r p(s',r|s,a)[r + γv_*(s')] ✓

🎯 "Optimal value = pick the best action, then average over outcomes of (reward + discounted future optimal value)"

For action-value q_*:

q_*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')]

📖 Each Symbol:

q_*(s,a)= optimal value of action a in state s

r= immediate reward from taking action a

max_a' q_*(s',a')= best action value in the next state = v_*(s')

📐 How We Get This:

1. q_*(s,a) = 𝔼[R_t+1 + γv_*(S_t+1) | S_t=s, A_t=a]

(value of action = reward + discounted value of next state)

2. But v_*(s') = max_a' q_*(s', a'), so substitute:

3. Result: q_*(s,a) = Σ_s',r p(s',r|s,a)[r + γ max_a' q_*(s',a')] ✓

🎯 "Value of action a = expected reward + discounted value of the best action in wherever I end up"

🔄 Regular Bellman vs Optimality Bellman:

Regular (for any policy π):

v_π(s) = Σ_a π(a|s) Σ_s',r p(...)[r + γv_π(s')]

Averages over actions (using π)

Optimality (for best policy):

v_*(s) = max_a Σ_s',r p(...)[r + γv_*(s')]

Picks the best action (using max)

💡 The difference: Regular Bellman averages over what policy π would do. Optimality Bellman maximizes to find what the best policy would do.

Interactive Gridworld: See Optimal Values

4x4 Gridworld

GOAL

0.0

GOAL

Steps taken

Total reward

Environment Details

• Reward: -1 per step (except terminal)
• Terminal states: Top-left, Bottom-right
• Actions: Up, Down, Left, Right
• Discount factor (γ): 0.9

Click "Run Policy Evaluation" to see the value function converge under a random policy. Values show expected returns from each state.

Step-by-Step: How to "Solve" an MDP

Here's the roadmap for going from an MDP definition to an optimal policy:

Define the MDP

Specify states S, actions A, rewards R, transitions p, and discount γ

Find Optimal Value Function

Compute v*(s) or q*(s,a) using Bellman optimality equations (Dynamic Programming, etc.)

Extract Optimal Policy

In each state, pick the action with the highest q* value: π*(s) = argmax_a q*(s,a)

Execute the Policy

Follow π*(s) in every state — you're now acting optimally!

💡 The hard part: Step 2! Computing optimal values is computationally expensive. The rest of this book explores efficient methods to do this.

From q* to π*: It's Easy!

Once we have q_*(s,a), finding the optimal policy is trivial:

\pi_*(s) = \arg\max_a q_*(s, a)

📖 What This Notation Means:

π_*(s)= the optimal action to take in state s

argmax_a= "the argument (value of a) that maximizes..."

q_*(s,a)= the optimal action-value function

🎮 Concrete Example:

Suppose you're in state s, and you have 3 actions with these q_* values:

q_*(s, left) = 2.5

q_*(s, right) = 7.8 ← highest!

q_*(s, up) = 4.2

Then: π_*(s) = right (because it has the highest q-value)

💡 In plain English: "In each state, just pick the action with the highest q_* value!"

Section 3.7

Optimality and Approximation

Why finding the true optimal policy is often impractical.

Reality Check: Why Can't We Always Find the Best?

In theory, we can solve the Bellman optimality equations to find π*. In practice, real problems have so many states that we can't even list them all, let alone compute values for each one!

The Curse of Dimensionality

State spaces grow exponentially with the number of variables.

A robot with 10 joints, each with 100 positions = 100¹⁰ = 10²⁰ states! That's more than atoms in a grain of sand.

The Solution: Approximation

Use function approximation (neural networks, etc.) to generalize.

Instead of storing V(s) for every state, learn a function V(s; θ) with parameters θ that estimates values for any state, even unseen ones.

State Space Sizes: A Reality Check

Problem	States	Can Store All?
4x4 Gridworld	16	Yes, easily
Tic-Tac-Toe	~5,000	Yes
Backgammon	10²⁰	No way!
Chess	10⁴⁷	Impossible
Go	10¹⁷⁰	More than atoms in universe

The Road Ahead

The rest of this book (and modern RL) is about finding approximate solutions efficiently. We'll learn methods that:

• Learn from limited experience (not all states)
• Generalize to unseen states
• Make good (not perfect) decisions

Section 3.8

Summary

Key takeaways from Chapter 3.

Quick Reference Glossary

Keep this handy! These are the key terms from Chapter 3:

MDP (Markov Decision Process)

The mathematical framework for sequential decision-making: (S, A, p, r, γ)

State (s)

A complete description of the current situation

Action (a)

A choice the agent can make in a state

Reward (r)

Immediate feedback signal (scalar number)

Return (G_t)

Total discounted future reward from time t

Discount Factor (γ)

How much to weight future vs immediate rewards (0 to 1)

Policy (π)

A strategy: mapping from states to action probabilities

State-Value Function v_π(s)

Expected return starting from state s, following policy π

Action-Value Function q_π(s,a)

Expected return taking action a in state s, then following π

Bellman Equation

Recursive formula: v(s) = E[r + γv(s')]

Optimal Policy (π*)

The policy that maximizes value in all states

Markov Property

Future depends only on present state, not history

Congratulations! You've learned the formal framework for RL. Here's what you should remember:

MDPs formalize sequential decision-making

States, actions, rewards, transitions — the complete framework

The agent-environment loop

State → Action → Reward → New State, repeated forever

Markov property

The future depends only on the present state, not on history

Returns are discounted cumulative rewards

γ controls how much we care about future vs. present

Value functions estimate expected returns

v(s) for states, q(s,a) for state-action pairs

Bellman equations are recursive

Value now = immediate reward + discounted value later

Optimal policies exist and are extractable from q*

Just pick argmax_a q*(s,a) in every state

Real problems require approximation

State spaces are too big for exact solutions

Formula Cheat Sheet

The essential formulas from Chapter 3:

Return (discounted sum of rewards):

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = R_{t+1} + \gamma G_{t+1}

Bellman Equation for v_π:

v_\pi(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]

Bellman Optimality Equation for v*:

v_*(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')]

Optimal Policy Extraction:

\pi_*(s) = \arg\max_a q_*(s, a)

What's Next?

Now that we have the framework, Chapter 4 introduces Dynamic Programming — methods for computing optimal policies when we know the full MDP model. It's the foundation for understanding all RL algorithms!

Chapter 2 Continue to Chapter 4

Finite Markov Decision Processes

States

Actions

Rewards

The Agent-Environment Interface

Think of it like playing a video game!

The Agent-Environment Loop

The Dynamics Function (Don't Panic!)

Real-World MDP Examples

Self-Driving Car

Medical Treatment

Stock Trading

Video Game AI

Common Misconceptions (Don't Make These Mistakes!)

Quick Self-Check: Do You Understand Section 3.1?

For the Curious Mind

Goals and Rewards

The Reward Hypothesis

Reward Design

Intrinsic vs Extrinsic

Real-World Reward Examples

Chess

Maze

Trading

Robot

Returns and Episodes

What's a Return?

Definition of Return

Episodic Tasks

Continuing Tasks

Interactive Discount Factor Demo

Total Return (G0)

Interpretation

Reward Sequence with Discount Weights

Customize Rewards:

Intuitive Analogy: Discounting is Like Money in the Bank

What Different γ Values Mean

For the Curious Mind

Unified Notation for Episodic and Continuing Tasks

The Absorbing State Trick

Episodic

Continuing

Policies and Value Functions

The Big Picture

State-Value Function vπ(s)

Action-Value Function qπ(s,a)

🔗 How v and q Relate to Each Other

Bellman Equations & Backup Diagrams

Bellman Equation for v_π

Key Insight

The Bellman Equation (The Heart of RL!)

Why Do Value Functions Matter? Real Applications

🎮 Game Playing AI

🤖 Robot Navigation

📈 Trading Systems

🏥 Treatment Planning

Intuitive Difference: v(s) vs q(s,a)

For the Curious Mind

Optimal Policies and Optimal Value Functions

The Ultimate Goal

🔑 Key Relationships

Bellman Optimality Equations

Interactive Gridworld: See Optimal Values

4x4 Gridworld

Environment Details

Step-by-Step: How to "Solve" an MDP

Optimality and Approximation

Reality Check: Why Can't We Always Find the Best?

The Curse of Dimensionality

The Solution: Approximation

State Space Sizes: A Reality Check

Summary

Quick Reference Glossary

Formula Cheat Sheet

What's Next?

Total Return (G₀)

State-Value Function v_π(s)

Action-Value Function q_π(s,a)