Chapter 3

Finite Markov Decision Processes

MDPs are the mathematical framework for sequential decision-making. They formalize the agent-environment interaction with states, actions, rewards, and transitions.

States

S - situation descriptions

Actions

A - available choices

Rewards

R - feedback signals

Section 3.1

The Agent-Environment Interface

At each time step, agent observes state, takes action, receives reward, transitions to new state.

Think of it like playing a video game!

You (the agent) see the game screen (the state), press buttons (take actions), and see your score change (receive rewards). The game then shows you the next screen (the new state). This loop continues until the game ends. An MDP is just a formal way to describe this interaction!

Markov Decision Process (MDP)

An MDP is defined by a tuple (S,A,p,r,γ)(S, A, p, r, \gamma) where:

  • S - set of all possible states (where you can be)
  • A - set of all possible actions (what you can do)
  • p(s',r|s,a) - dynamics function (what happens when you do something)
  • r(s,a) - expected reward function (how good was that?)
  • γ - discount factor (how much do you care about future rewards?)

The Agent-Environment Loop

This is the heartbeat of RL. It happens at every time step t = 0, 1, 2, 3, ...

Agent
Environment
St, Rt
At
State
Reward
Action

At each step, the environment sends state St and reward Rt to the agent. The agent responds with action At, causing a transition to St+1.

The Dynamics Function (Don't Panic!)

p(s,rs,a)=Pr{St=s,Rt=rSt1=s,At1=a}p(s', r | s, a) = \Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a\}

📖 What Each Symbol Means:

p(...)= probability of something happening (a number between 0 and 1)
s'= the next state (the prime symbol ' means "next")
r= the reward received (a number, like +1, -10, or 0)
s= the current state (where you are now)
a= the action taken (what you chose to do)
|= "given that" or "assuming that" (conditional probability)
Pr{}= "the probability that..."
St, Rt= State and Reward at time step t (subscript t = time)

🎯 In plain English:

"If I'm in state s and take action a, what's the probability that I'll end up in state s' and receive reward r?"

📝 How to Read This Formula:

Read it left to right like a sentence:

"p(s', r | s, a)" = "The probability of ending up in state s' with reward r, given that we started in state s and took action a"

🎮 Concrete Example - Slippery Grid:

You're at position (2,3) and press "move right":

p((3,3), +1 | (2,3), right) = 0.9 → 90% chance: move to (3,3), get +1
p((2,3), 0 | (2,3), right) = 0.1 → 10% chance: slip, stay at (2,3), get 0

Notice: probabilities add up to 1.0 (something has to happen!)

Real-World MDP Examples

Let's see how the MDP framework applies to real problems. In each case, we identify States, Actions, and Rewards:

Self-Driving Car
States: Camera images, sensor readings, speed, position on road
Actions: Accelerate, brake, steer left/right, change lane
Rewards: +1 for safe driving, -100 for crash, +10 for reaching destination
Medical Treatment
States: Patient vitals, test results, medical history, current symptoms
Actions: Prescribe medication, order test, recommend surgery, wait and observe
Rewards: +10 for recovery, -5 for side effects, -50 for deterioration
Stock Trading
States: Stock prices, portfolio holdings, market indicators, news sentiment
Actions: Buy X shares, sell Y shares, hold, diversify
Rewards: Profit/loss from trades (positive or negative dollars)
Video Game AI
States: Game screen pixels, player position, enemy positions, health, score
Actions: Move up/down/left/right, jump, shoot, use power-up
Rewards: Points scored, +100 for level complete, -1 per second (encourages speed)

💡 Notice the pattern: Every decision-making problem can be framed as an MDP. The hard part is choosing the right states, actions, and rewards for your specific problem!

The Markov Property (The Key Insight!)

A state is Markov if the future depends only on the present, not the past.

🎯 Think of it like this:

In chess, the current board position tells you everything you need to make a decision. It doesn't matter how you got to that position — the best move is the same regardless of history. The current state "summarizes" all relevant history.

Common Misconceptions (Don't Make These Mistakes!)

Wrong: "The Markov property means we ignore history"

Right: The state encodes all relevant history. If you need history, make it part of your state! (e.g., "velocity" encodes recent positions)

Wrong: "The agent knows the dynamics function p(s',r|s,a)"

Right: Usually the agent does NOT know the dynamics! It must learn from experience. Only "model-based" RL assumes known dynamics.

Wrong: "Reward tells the agent HOW to do the task"

Right: Reward only says WHAT is good/bad. The agent must figure out HOW by itself — that's the whole point of learning!

Quick Self-Check: Do You Understand Section 3.1?

Can you answer these questions?

What are the 5 components of an MDP? (S, A, p, r, γ)
What does p(s',r|s,a) mean in plain English?
Why is the Markov property important for RL?
Can you frame a real-world problem as an MDP?

If you're unsure about any of these, re-read the section before moving on!

For the Curious Mind

What if the Markov property doesn't hold?

Real-world problems often violate Markov. The solution? Make the state include more history! This leads to POMDPs (Partially Observable MDPs) where the agent doesn't see the full state. Deep RL uses recurrent neural networks to remember history automatically.

Did you know? Life is an MDP!

Biologists use MDPs to model animal behavior. Your decisions about what to eat, where to live, who to befriend — they can all be framed as RL! Evolution essentially "trained" your brain to solve the MDP of survival.

Open Research Question

How do we design rewards that capture what we really want? This is called the reward design problem or AI alignment. Getting it wrong can lead to agents that technically maximize reward but do things we didn't intend!

Section 3.2

Goals and Rewards

The reward hypothesis: all goals can be described as maximizing expected cumulative reward.

The Reward Hypothesis

"All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (reward)."

In simple terms: Any goal can be expressed as "get the most total points."

Reward Design

Rewards should signal what you want achieved, not how to achieve it.

✓ Good: +1 for reaching the goal

✗ Bad: +1 for moving toward the goal

Let the agent discover HOW on its own!

Intrinsic vs Extrinsic

Two types of rewards in RL systems:

Extrinsic: From environment (score, win/loss)

Intrinsic: Internal motivation (curiosity, novelty)

Real-World Reward Examples

Chess

+1 win, -1 loss, 0 draw

Sparse, only at game end

Maze

-1 per step, +100 at goal

Encourages finding shortest path

Trading

Profit/loss per trade

Direct monetary signal

Robot

+1 upright, -100 fall

Continuous task

Section 3.3

Returns and Episodes

The return is the total discounted reward from time t onward.

What's a Return?

The return is your "total score" from now until the end. But there's a twist: future rewards are worth less than immediate rewards (we "discount" them). It's like money: $100 today is worth more than $100 next year!

Definition of Return

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

📖 What Each Symbol Means:

Gt= "Return at time t" - your total future score starting from time t
Rt+1= reward received at time t+1 (the immediate next reward)
γ (gamma)= discount factor, a number between 0 and 1 (like 0.9 or 0.99)
γk= γ multiplied by itself k times (γ² = γ×γ, γ³ = γ×γ×γ)
Σ= "sum of" - add up all the terms
k=0 to ∞= k goes from 0, 1, 2, 3, ... forever (or until episode ends)

📝 Expanding the Sum Step-by-Step:

When k=0: γ0Rt+0+1 = 1 × Rt+1 = Rt+1
When k=1: γ1Rt+1+1 = γ × Rt+2
When k=2: γ2Rt+2+1 = γ² × Rt+3
... and so on

Note: γ0 = 1 (anything to the power of 0 equals 1)

🔢 Concrete Example (γ = 0.9):

Suppose you receive rewards: R1=10, R2=5, R3=20, then the game ends.

G0 = 10 + 0.9×5 + 0.9²×20
G0 = 10 + 4.5 + 16.2 = 30.7

Without discounting (γ=1): 10 + 5 + 20 = 35. With discounting: 30.7 (future rewards worth less!)

🔑 Key Insight: The Recursive Property

Here's a beautiful fact that makes RL algorithms work:

Gt=Rt+1+γGt+1G_t = R_{t+1} + \gamma G_{t+1}

Derivation:

Gt = Rt+1 + γRt+2 + γ²Rt+3 + ...
// Factor out γ from all terms after the first:
Gt = Rt+1 + γ(Rt+2 + γRt+3 + γ²Rt+4 + ...)
// But that thing in parentheses is just Gt+1!
Gt = Rt+1 + γGt+1

💡 This means: "Total future reward = immediate reward + discounted future-of-future reward"

Episodic Tasks

Tasks that naturally end: games, mazes, trials.

Examples: Chess game ends with checkmate. A maze run ends when you reach the goal.

γ can be 1 (no discounting needed)

Continuing Tasks

Tasks that go on forever without a natural end.

Examples: A robot that operates 24/7. A thermostat controlling temperature continuously.

Must use γ < 1 to keep returns finite

Interactive Discount Factor Demo

0 (myopic)1 (far-sighted)
Total Return (G0)
10.00

Gt = Rt+1 + γRt+2 + γ²Rt+3 + ...

Interpretation

Far-sighted agent - willing to wait for larger future rewards.

Reward Sequence with Discount Weights
t=0
1 × γ0= 1.000
γ0 = 1.000
t=1
1 × γ1= 0.900
γ1 = 0.900
t=2
1 × γ2= 0.810
γ2 = 0.810
t=3
1 × γ3= 0.729
γ3 = 0.729
t=4
1 × γ4= 0.656
γ4 = 0.656
t=5
10 × γ5= 5.905
γ5 = 0.590
Customize Rewards:
R1:
R2:
R3:
R4:
R5:
R6:

Notice how the final reward of 10 at t=5 is significantly discounted when γ is low. Try γ=1 to see all rewards weighted equally, or γ=0 to see only immediate rewards matter.

Why Discount Future Rewards?
  • Mathematical: Ensures returns are finite for never-ending tasks
  • Uncertainty: Future is uncertain; a bird in hand is worth two in the bush
  • Time value: Like money, immediate rewards can be "invested"
  • Natural preference: Humans and animals prefer sooner rewards

Intuitive Analogy: Discounting is Like Money in the Bank

Imagine you have two choices:

$100 today

$100 next year

Most people prefer $100 today. Why?

  • You could invest today's $100 and have $110 next year
  • Next year's $100 is uncertain (inflation, risk)
  • You can enjoy the $100 sooner

💡 The discount factor γ captures exactly this intuition. If γ = 0.9, then $100 next year is only worth $90 today!

What Different γ Values Mean

γ = 0

"I only care about the immediate next reward. The future doesn't exist to me."

Extremely short-sighted

γ = 0.9

"I care about future rewards, but prefer sooner ones. 10 steps ahead is worth 0.9¹⁰ ≈ 0.35 of today."

Balanced (common choice)

γ = 0.99

"I care almost equally about near and far future. 100 steps ahead is worth 0.99¹⁰⁰ ≈ 0.37."

Far-sighted

For the Curious Mind

Neuroscience Connection: Hyperbolic Discounting

Humans don't actually discount exponentially (γt) — we use hyperbolic discounting! This means we're extra impatient about immediate rewards but more patient about distant ones. That's why you procrastinate even though you know you shouldn't!

What if γ = 1? Infinite returns?

With γ = 1 in continuing tasks, returns become infinite! Some researchers use average reward formulations instead, where you optimize the long-run average rather than the sum. This is common in operations research.

Fun Puzzle: What's the "right" discount factor?

If γ = 0.99 and your agent takes 1 action per second, then rewards 1 hour away are worth 0.993600 ≈ 0 to it! For real-world robots, choosing γ is a subtle art.

Section 3.4

Unified Notation for Episodic and Continuing Tasks

A single notation that works for both episodic and continuing tasks.

The Absorbing State Trick

We can treat episodic tasks as continuing by adding a special "absorbing" terminal state that loops to itself forever with zero reward.

Gt=k=t+1Tγkt1RkG_t = \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k

📖 What Each Symbol Means:

Gt= return (total discounted reward) from time t
Σ= sum from k = t+1 to T
T= terminal time step (can be ∞ for continuing tasks)
γk-t-1= discount factor raised to (k - t - 1)
Rk= reward at time step k

📝 Expanding This Sum:

When k = t+1: γ(t+1)-t-1 = γ0 = 1, so we get: Rt+1
When k = t+2: γ(t+2)-t-1 = γ1 = γ, so we get: γRt+2
When k = t+3: γ(t+3)-t-1 = γ2, so we get: γ²Rt+3
... continues until k = T

This gives us: Gt = Rt+1 + γRt+2 + γ²Rt+3 + ... (same as before!)

🎯 Why this helps:

Now we can write the same formulas for both types of tasks! When T = ∞, it's continuing. When T is finite, it's episodic. Same math, different settings.

Episodic
T = terminal step (finite)
γ ∈ [0, 1] (any value OK)
Continuing
T = ∞ (never ends)
γ ∈ [0, 1) (must be less than 1!)
Section 3.5

Policies and Value Functions

Value functions estimate how good states and actions are under a given policy.

The Big Picture

A policy tells you what to do in each state. A value function tells you how good a state (or action) is. The goal of RL is to find the best policy — the one that leads to the highest values!

Policy (π)

A policy π is a mapping from states to action probabilities.π(as)\pi(a|s) is the probability of taking action a in state s.

🎯 Example:

In a grid, a policy might say: "In cell (1,1), go right with 80% probability, go up with 20%"

State-Value Function vπ(s)

vπ(s)=Eπ[GtSt=s]v_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]

📖 Symbol by Symbol:

v = value (how good)
π = the policy being followed
𝔼 = expected value (average)
Gt = return (total future reward)
| = "given that"
St=s = we're in state s

🎯 In plain English:

"If I start in state s and follow policy π, how much total reward can I expect on average?"

Action-Value Function qπ(s,a)

qπ(s,a)=Eπ[GtSt=s,At=a]q_\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]

📖 Symbol by Symbol:

q = quality (of state-action pair)
π = policy followed after action
s = the state you're in
a = the action you take first
At=a = we take action a

🎯 In plain English:

"If I'm in state s, take action a, then follow π, how much total reward can I expect?"

🔗 How v and q Relate to Each Other

Getting v from q:

vπ(s)=aπ(as)qπ(s,a)v_\pi(s) = \sum_a \pi(a|s) \cdot q_\pi(s,a)

"Average over all actions, weighted by how likely π is to choose each one"

Getting q from v:

qπ(s,a)=s,rp(s,rs,a)[r+γvπ(s)]q_\pi(s,a) = \sum_{s',r} p(s',r|s,a)[r + \gamma v_\pi(s')]

"Immediate reward + discounted value of wherever we end up"

Bellman Equations & Backup Diagrams

π(a₁|s)π(a₂|s)sa₁a₂s'₁s'₂s'₃s'₄
State
Action
Next State
Bellman Equation for v_π
vπ(s) = Eπ[Gt | St=s]
= Σa π(a|s) Σs',r p(s',r|s,a)[r + γvπ(s')]
Key Insight

The value of a state is the weighted sum over all actions (weighted by policy) of the expected value of taking that action.

The backup diagram shows how value propagates backward from successor states. This recursive relationship is the foundation of all RL algorithms.

The Bellman Equation (The Heart of RL!)

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]v_\pi(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]

📖 Breaking Down Every Piece:

vπ(s)= value of state s under policy π (what we're calculating)
Σa= sum over all possible actions a
π(a|s)= probability that policy π chooses action a in state s
Σs',r= sum over all possible next states s' and rewards r
p(s',r|s,a)= probability of transitioning to s' with reward r
r= immediate reward received
γ= discount factor (how much we care about future)
vπ(s')= value of the next state (recursive!)

📐 Full Derivation (Step by Step):

Step 1: Start with the definition of vπ(s)

vπ(s)=Eπ[GtSt=s]v_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]

Step 2: Use the recursive property of returns Gt = Rt+1 + γGt+1

vπ(s)=Eπ[Rt+1+γGt+1St=s]v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s]

Step 3: Split the expectation (linearity)

vπ(s)=Eπ[Rt+1St=s]+γEπ[Gt+1St=s]v_\pi(s) = \mathbb{E}_\pi[R_{t+1} | S_t = s] + \gamma \mathbb{E}_\pi[G_{t+1} | S_t = s]

Step 4: The expectation of Gt+1 given the next state is vπ(s')

Eπ[Gt+1St+1=s]=vπ(s)\mathbb{E}_\pi[G_{t+1} | S_{t+1} = s'] = v_\pi(s')

Step 5: Account for all ways to reach different (s', r) pairs

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]v_\pi(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)[r + \gamma v_\pi(s')]

🧠 Reading It Like a Story:

The value of being in state s equals:

  1. For each action a I might take (weighted by probability π(a|s))...
  2. For each place I might end up (weighted by probability p(s',r|s,a))...
  3. Add up: immediate reward + discount × value of where I land

🎯 The Key Insight:

The value of a state = expected immediate reward + discounted value of next state

This is recursive: the value of now depends on the value of later. It's like asking "how good is my position in chess?" — it depends on "how good will my future positions be?"

Why Bellman Equations Matter

The Bellman equation is the foundation of ALL RL algorithms. It gives us a way to compute values by "backing up" information from future states to current states. Dynamic Programming, TD learning, Q-learning — they all use this idea!

Why Do Value Functions Matter? Real Applications

🎮 Game Playing AI

AlphaGo uses value functions to evaluate board positions. Instead of searching millions of moves, it asks: "What's the value of this position?" and focuses on promising moves.

🤖 Robot Navigation

A robot can learn v(s) for each location in a building. High values near charging stations, low values near stairs. Then it just moves toward higher-value states!

📈 Trading Systems

q(market_state, buy) vs q(market_state, sell) tells you which action leads to more profit. The action-value function directly informs trading decisions.

🏥 Treatment Planning

Value functions can estimate expected patient outcomes for different treatment sequences, helping doctors choose optimal care paths.

Intuitive Difference: v(s) vs q(s,a)

Imagine you're standing at a crossroads in a video game:

v(crossroads) = 50

"On average, starting from this crossroads and following my policy, I'll earn 50 points."

This is an average over all actions I might take.

q(crossroads, go_left) = 70

"If I go LEFT specifically, then follow my policy, I'll earn 70 points."

This is for a specific first action.

💡 Key insight: If q(s, left) > v(s), then "left" is better than average! This is how we improve policies — find actions with q > v.

For the Curious Mind

The Bellman Equation is Everywhere!

Richard Bellman discovered this in the 1950s for optimal control, not AI. It's used in economics, operations research, finance, and even DNA sequence alignment! The same recursive structure appears whenever you optimize over time.

Why is solving the Bellman equation hard?

With n states and m actions, you have n equations with n unknowns — that's solvable! But real problems have billions of states. This is the "curse of dimensionality" that Bellman himself named. Deep RL uses neural networks to approximate values without storing them all.

Mind-Bending Thought Experiment

If you knew the true value function vπ(s) for every state, you could play perfectly with zero learning — just always move toward higher-value states! The whole challenge of RL is that we don't know the true values.

Section 3.6

Optimal Policies and Optimal Value Functions

Finding the policy that maximizes expected return from all states.

The Ultimate Goal

Find the optimal policy π* — the one that does better than (or at least as well as) every other policy in every state. The amazing thing? Such a policy always exists for finite MDPs!

Optimal Value Functions

Optimal state-value:

v(s)=maxπvπ(s)v_*(s) = \max_\pi v_\pi(s)
v*(s)= optimal value of state s (the * means "optimal")
maxπ= take the maximum over all possible policies

🎯 "The best possible value for state s, if we use the best possible policy"

Optimal action-value:

q(s,a)=maxπqπ(s,a)q_*(s, a) = \max_\pi q_\pi(s, a)
q*(s,a)= optimal value of taking action a in state s

🎯 "The best return I can get if I take action a in state s, then act optimally"

🔑 Key Relationships

v(s)=maxaq(s,a)v_*(s) = \max_a q_*(s, a)

"The value of a state = value of the best action from that state"

q(s,a)=E[Rt+1+γv(St+1)St=s,At=a]q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t=a]

"Value of action = reward + discounted value of next state"

Bellman Optimality Equations

The optimal values satisfy special recursive equations (like regular Bellman, but with max instead of averaging over policy):

For state-value v*:

v(s)=maxas,rp(s,rs,a)[r+γv(s)]v_*(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')]

📖 Each Symbol:

v*(s)= optimal value of state s
maxa= pick the action that gives the highest value
Σs',r= average over possible outcomes (next states & rewards)
[r + γv*(s')]= immediate reward + discounted optimal future value

📐 How We Get This:

1. Start with: v*(s) = maxa q*(s, a)
(optimal value = value of best action)
2. Substitute the formula for q*(s, a):
q*(s,a) = 𝔼[r + γv*(s') | s, a] = Σs',r p(s',r|s,a)[r + γv*(s')]
3. Result: v*(s) = maxa Σs',r p(s',r|s,a)[r + γv*(s')] ✓

🎯 "Optimal value = pick the best action, then average over outcomes of (reward + discounted future optimal value)"

For action-value q*:

q(s,a)=s,rp(s,rs,a)[r+γmaxaq(s,a)]q_*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')]

📖 Each Symbol:

q*(s,a)= optimal value of action a in state s
r= immediate reward from taking action a
maxa' q*(s',a')= best action value in the next state = v*(s')

📐 How We Get This:

1. q*(s,a) = 𝔼[Rt+1 + γv*(St+1) | St=s, At=a]
(value of action = reward + discounted value of next state)
2. But v*(s') = maxa' q*(s', a'), so substitute:
3. Result: q*(s,a) = Σs',r p(s',r|s,a)[r + γ maxa' q*(s',a')] ✓

🎯 "Value of action a = expected reward + discounted value of the best action in wherever I end up"

🔄 Regular Bellman vs Optimality Bellman:

Regular (for any policy π):

vπ(s) = Σa π(a|s) Σs',r p(...)[r + γvπ(s')]

Averages over actions (using π)

Optimality (for best policy):

v*(s) = maxa Σs',r p(...)[r + γv*(s')]

Picks the best action (using max)

💡 The difference: Regular Bellman averages over what policy π would do. Optimality Bellman maximizes to find what the best policy would do.

Interactive Gridworld: See Optimal Values

4x4 Gridworld

GOAL
0.0
0.0
0.0
0.0
A
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
GOAL
0
Steps taken
0
Total reward
Environment Details
  • • Reward: -1 per step (except terminal)
  • • Terminal states: Top-left, Bottom-right
  • • Actions: Up, Down, Left, Right
  • • Discount factor (γ): 0.9

Click "Run Policy Evaluation" to see the value function converge under a random policy. Values show expected returns from each state.

Step-by-Step: How to "Solve" an MDP

Here's the roadmap for going from an MDP definition to an optimal policy:

1

Define the MDP

Specify states S, actions A, rewards R, transitions p, and discount γ

2

Find Optimal Value Function

Compute v*(s) or q*(s,a) using Bellman optimality equations (Dynamic Programming, etc.)

3

Extract Optimal Policy

In each state, pick the action with the highest q* value: π*(s) = argmaxa q*(s,a)

4

Execute the Policy

Follow π*(s) in every state — you're now acting optimally!

💡 The hard part: Step 2! Computing optimal values is computationally expensive. The rest of this book explores efficient methods to do this.

From q* to π*: It's Easy!

Once we have q*(s,a), finding the optimal policy is trivial:

π(s)=argmaxaq(s,a)\pi_*(s) = \arg\max_a q_*(s, a)

📖 What This Notation Means:

π*(s)= the optimal action to take in state s
argmaxa= "the argument (value of a) that maximizes..."
q*(s,a)= the optimal action-value function

🎮 Concrete Example:

Suppose you're in state s, and you have 3 actions with these q* values:

q*(s, left) = 2.5
q*(s, right) = 7.8 ← highest!
q*(s, up) = 4.2

Then: π*(s) = right (because it has the highest q-value)

💡 In plain English: "In each state, just pick the action with the highest q* value!"

Section 3.7

Optimality and Approximation

Why finding the true optimal policy is often impractical.

Reality Check: Why Can't We Always Find the Best?

In theory, we can solve the Bellman optimality equations to find π*. In practice, real problems have so many states that we can't even list them all, let alone compute values for each one!

The Curse of Dimensionality

State spaces grow exponentially with the number of variables.

A robot with 10 joints, each with 100 positions = 10010 = 1020 states! That's more than atoms in a grain of sand.

The Solution: Approximation

Use function approximation (neural networks, etc.) to generalize.

Instead of storing V(s) for every state, learn a function V(s; θ) with parameters θ that estimates values for any state, even unseen ones.

State Space Sizes: A Reality Check

ProblemStatesCan Store All?
4x4 Gridworld16Yes, easily
Tic-Tac-Toe~5,000Yes
Backgammon1020No way!
Chess1047Impossible
Go10170More than atoms in universe
The Road Ahead

The rest of this book (and modern RL) is about finding approximate solutions efficiently. We'll learn methods that:

  • • Learn from limited experience (not all states)
  • • Generalize to unseen states
  • • Make good (not perfect) decisions
Section 3.8

Summary

Key takeaways from Chapter 3.

Quick Reference Glossary

Keep this handy! These are the key terms from Chapter 3:

MDP (Markov Decision Process)

The mathematical framework for sequential decision-making: (S, A, p, r, γ)

State (s)

A complete description of the current situation

Action (a)

A choice the agent can make in a state

Reward (r)

Immediate feedback signal (scalar number)

Return (Gt)

Total discounted future reward from time t

Discount Factor (γ)

How much to weight future vs immediate rewards (0 to 1)

Policy (π)

A strategy: mapping from states to action probabilities

State-Value Function vπ(s)

Expected return starting from state s, following policy π

Action-Value Function qπ(s,a)

Expected return taking action a in state s, then following π

Bellman Equation

Recursive formula: v(s) = E[r + γv(s')]

Optimal Policy (π*)

The policy that maximizes value in all states

Markov Property

Future depends only on present state, not history

Congratulations! You've learned the formal framework for RL. Here's what you should remember:

MDPs formalize sequential decision-making

States, actions, rewards, transitions — the complete framework

The agent-environment loop

State → Action → Reward → New State, repeated forever

Markov property

The future depends only on the present state, not on history

Returns are discounted cumulative rewards

γ controls how much we care about future vs. present

Value functions estimate expected returns

v(s) for states, q(s,a) for state-action pairs

Bellman equations are recursive

Value now = immediate reward + discounted value later

Optimal policies exist and are extractable from q*

Just pick argmax_a q*(s,a) in every state

Real problems require approximation

State spaces are too big for exact solutions

Formula Cheat Sheet

The essential formulas from Chapter 3:

Return (discounted sum of rewards):

Gt=Rt+1+γRt+2+γ2Rt+3+=Rt+1+γGt+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = R_{t+1} + \gamma G_{t+1}

Bellman Equation for vπ:

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]v_\pi(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]

Bellman Optimality Equation for v*:

v(s)=maxas,rp(s,rs,a)[r+γv(s)]v_*(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')]

Optimal Policy Extraction:

π(s)=argmaxaq(s,a)\pi_*(s) = \arg\max_a q_*(s, a)

What's Next?

Now that we have the framework, Chapter 4 introduces Dynamic Programming — methods for computing optimal policies when we know the full MDP model. It's the foundation for understanding all RL algorithms!