Finite Markov Decision Processes
MDPs are the mathematical framework for sequential decision-making. They formalize the agent-environment interaction with states, actions, rewards, and transitions.
States
S - situation descriptions
Actions
A - available choices
Rewards
R - feedback signals
The Agent-Environment Interface
At each time step, agent observes state, takes action, receives reward, transitions to new state.
Think of it like playing a video game!
You (the agent) see the game screen (the state), press buttons (take actions), and see your score change (receive rewards). The game then shows you the next screen (the new state). This loop continues until the game ends. An MDP is just a formal way to describe this interaction!
An MDP is defined by a tuple where:
- S - set of all possible states (where you can be)
- A - set of all possible actions (what you can do)
- p(s',r|s,a) - dynamics function (what happens when you do something)
- r(s,a) - expected reward function (how good was that?)
- γ - discount factor (how much do you care about future rewards?)
The Agent-Environment Loop
This is the heartbeat of RL. It happens at every time step t = 0, 1, 2, 3, ...
At each step, the environment sends state St and reward Rt to the agent. The agent responds with action At, causing a transition to St+1.
The Dynamics Function (Don't Panic!)
📖 What Each Symbol Means:
🎯 In plain English:
"If I'm in state s and take action a, what's the probability that I'll end up in state s' and receive reward r?"
📝 How to Read This Formula:
Read it left to right like a sentence:
🎮 Concrete Example - Slippery Grid:
You're at position (2,3) and press "move right":
Notice: probabilities add up to 1.0 (something has to happen!)
Real-World MDP Examples
Let's see how the MDP framework applies to real problems. In each case, we identify States, Actions, and Rewards:
Self-Driving Car
Medical Treatment
Stock Trading
Video Game AI
💡 Notice the pattern: Every decision-making problem can be framed as an MDP. The hard part is choosing the right states, actions, and rewards for your specific problem!
A state is Markov if the future depends only on the present, not the past.
🎯 Think of it like this:
In chess, the current board position tells you everything you need to make a decision. It doesn't matter how you got to that position — the best move is the same regardless of history. The current state "summarizes" all relevant history.
Common Misconceptions (Don't Make These Mistakes!)
Wrong: "The Markov property means we ignore history"
Right: The state encodes all relevant history. If you need history, make it part of your state! (e.g., "velocity" encodes recent positions)
Wrong: "The agent knows the dynamics function p(s',r|s,a)"
Right: Usually the agent does NOT know the dynamics! It must learn from experience. Only "model-based" RL assumes known dynamics.
Wrong: "Reward tells the agent HOW to do the task"
Right: Reward only says WHAT is good/bad. The agent must figure out HOW by itself — that's the whole point of learning!
Quick Self-Check: Do You Understand Section 3.1?
Can you answer these questions?
If you're unsure about any of these, re-read the section before moving on!
For the Curious Mind
What if the Markov property doesn't hold?
Real-world problems often violate Markov. The solution? Make the state include more history! This leads to POMDPs (Partially Observable MDPs) where the agent doesn't see the full state. Deep RL uses recurrent neural networks to remember history automatically.
Did you know? Life is an MDP!
Biologists use MDPs to model animal behavior. Your decisions about what to eat, where to live, who to befriend — they can all be framed as RL! Evolution essentially "trained" your brain to solve the MDP of survival.
Open Research Question
How do we design rewards that capture what we really want? This is called the reward design problem or AI alignment. Getting it wrong can lead to agents that technically maximize reward but do things we didn't intend!
Goals and Rewards
The reward hypothesis: all goals can be described as maximizing expected cumulative reward.
The Reward Hypothesis
"All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (reward)."
In simple terms: Any goal can be expressed as "get the most total points."
Reward Design
Rewards should signal what you want achieved, not how to achieve it.
✓ Good: +1 for reaching the goal
✗ Bad: +1 for moving toward the goal
Let the agent discover HOW on its own!
Intrinsic vs Extrinsic
Two types of rewards in RL systems:
Extrinsic: From environment (score, win/loss)
Intrinsic: Internal motivation (curiosity, novelty)
Real-World Reward Examples
Chess
+1 win, -1 loss, 0 draw
Sparse, only at game end
Maze
-1 per step, +100 at goal
Encourages finding shortest path
Trading
Profit/loss per trade
Direct monetary signal
Robot
+1 upright, -100 fall
Continuous task
Returns and Episodes
The return is the total discounted reward from time t onward.
What's a Return?
The return is your "total score" from now until the end. But there's a twist: future rewards are worth less than immediate rewards (we "discount" them). It's like money: $100 today is worth more than $100 next year!
Definition of Return
📖 What Each Symbol Means:
📝 Expanding the Sum Step-by-Step:
Note: γ0 = 1 (anything to the power of 0 equals 1)
🔢 Concrete Example (γ = 0.9):
Suppose you receive rewards: R1=10, R2=5, R3=20, then the game ends.
Without discounting (γ=1): 10 + 5 + 20 = 35. With discounting: 30.7 (future rewards worth less!)
🔑 Key Insight: The Recursive Property
Here's a beautiful fact that makes RL algorithms work:
Derivation:
💡 This means: "Total future reward = immediate reward + discounted future-of-future reward"
Episodic Tasks
Tasks that naturally end: games, mazes, trials.
Examples: Chess game ends with checkmate. A maze run ends when you reach the goal.
γ can be 1 (no discounting needed)
Continuing Tasks
Tasks that go on forever without a natural end.
Examples: A robot that operates 24/7. A thermostat controlling temperature continuously.
Must use γ < 1 to keep returns finite
Interactive Discount Factor Demo
Total Return (G0)
Gt = Rt+1 + γRt+2 + γ²Rt+3 + ...
Interpretation
Far-sighted agent - willing to wait for larger future rewards.
Reward Sequence with Discount Weights
Customize Rewards:
Notice how the final reward of 10 at t=5 is significantly discounted when γ is low. Try γ=1 to see all rewards weighted equally, or γ=0 to see only immediate rewards matter.
- Mathematical: Ensures returns are finite for never-ending tasks
- Uncertainty: Future is uncertain; a bird in hand is worth two in the bush
- Time value: Like money, immediate rewards can be "invested"
- Natural preference: Humans and animals prefer sooner rewards
Intuitive Analogy: Discounting is Like Money in the Bank
Imagine you have two choices:
$100 today
$100 next year
Most people prefer $100 today. Why?
- You could invest today's $100 and have $110 next year
- Next year's $100 is uncertain (inflation, risk)
- You can enjoy the $100 sooner
💡 The discount factor γ captures exactly this intuition. If γ = 0.9, then $100 next year is only worth $90 today!
What Different γ Values Mean
γ = 0
"I only care about the immediate next reward. The future doesn't exist to me."
Extremely short-sighted
γ = 0.9
"I care about future rewards, but prefer sooner ones. 10 steps ahead is worth 0.9¹⁰ ≈ 0.35 of today."
Balanced (common choice)
γ = 0.99
"I care almost equally about near and far future. 100 steps ahead is worth 0.99¹⁰⁰ ≈ 0.37."
Far-sighted
For the Curious Mind
Neuroscience Connection: Hyperbolic Discounting
Humans don't actually discount exponentially (γt) — we use hyperbolic discounting! This means we're extra impatient about immediate rewards but more patient about distant ones. That's why you procrastinate even though you know you shouldn't!
What if γ = 1? Infinite returns?
With γ = 1 in continuing tasks, returns become infinite! Some researchers use average reward formulations instead, where you optimize the long-run average rather than the sum. This is common in operations research.
Fun Puzzle: What's the "right" discount factor?
If γ = 0.99 and your agent takes 1 action per second, then rewards 1 hour away are worth 0.993600 ≈ 0 to it! For real-world robots, choosing γ is a subtle art.
Unified Notation for Episodic and Continuing Tasks
A single notation that works for both episodic and continuing tasks.
The Absorbing State Trick
We can treat episodic tasks as continuing by adding a special "absorbing" terminal state that loops to itself forever with zero reward.
📖 What Each Symbol Means:
📝 Expanding This Sum:
This gives us: Gt = Rt+1 + γRt+2 + γ²Rt+3 + ... (same as before!)
🎯 Why this helps:
Now we can write the same formulas for both types of tasks! When T = ∞, it's continuing. When T is finite, it's episodic. Same math, different settings.
Episodic
γ ∈ [0, 1] (any value OK)
Continuing
γ ∈ [0, 1) (must be less than 1!)
Policies and Value Functions
Value functions estimate how good states and actions are under a given policy.
The Big Picture
A policy tells you what to do in each state. A value function tells you how good a state (or action) is. The goal of RL is to find the best policy — the one that leads to the highest values!
A policy π is a mapping from states to action probabilities. is the probability of taking action a in state s.
🎯 Example:
In a grid, a policy might say: "In cell (1,1), go right with 80% probability, go up with 20%"
State-Value Function vπ(s)
📖 Symbol by Symbol:
🎯 In plain English:
"If I start in state s and follow policy π, how much total reward can I expect on average?"
Action-Value Function qπ(s,a)
📖 Symbol by Symbol:
🎯 In plain English:
"If I'm in state s, take action a, then follow π, how much total reward can I expect?"
🔗 How v and q Relate to Each Other
Getting v from q:
"Average over all actions, weighted by how likely π is to choose each one"
Getting q from v:
"Immediate reward + discounted value of wherever we end up"
Bellman Equations & Backup Diagrams
Bellman Equation for v_π
Key Insight
The value of a state is the weighted sum over all actions (weighted by policy) of the expected value of taking that action.
The backup diagram shows how value propagates backward from successor states. This recursive relationship is the foundation of all RL algorithms.
The Bellman Equation (The Heart of RL!)
📖 Breaking Down Every Piece:
📐 Full Derivation (Step by Step):
Step 1: Start with the definition of vπ(s)
Step 2: Use the recursive property of returns Gt = Rt+1 + γGt+1
Step 3: Split the expectation (linearity)
Step 4: The expectation of Gt+1 given the next state is vπ(s')
Step 5: Account for all ways to reach different (s', r) pairs
🧠 Reading It Like a Story:
The value of being in state s equals:
- For each action a I might take (weighted by probability π(a|s))...
- For each place I might end up (weighted by probability p(s',r|s,a))...
- Add up: immediate reward + discount × value of where I land
🎯 The Key Insight:
The value of a state = expected immediate reward + discounted value of next state
This is recursive: the value of now depends on the value of later. It's like asking "how good is my position in chess?" — it depends on "how good will my future positions be?"
The Bellman equation is the foundation of ALL RL algorithms. It gives us a way to compute values by "backing up" information from future states to current states. Dynamic Programming, TD learning, Q-learning — they all use this idea!
Why Do Value Functions Matter? Real Applications
🎮 Game Playing AI
AlphaGo uses value functions to evaluate board positions. Instead of searching millions of moves, it asks: "What's the value of this position?" and focuses on promising moves.
🤖 Robot Navigation
A robot can learn v(s) for each location in a building. High values near charging stations, low values near stairs. Then it just moves toward higher-value states!
📈 Trading Systems
q(market_state, buy) vs q(market_state, sell) tells you which action leads to more profit. The action-value function directly informs trading decisions.
🏥 Treatment Planning
Value functions can estimate expected patient outcomes for different treatment sequences, helping doctors choose optimal care paths.
Intuitive Difference: v(s) vs q(s,a)
Imagine you're standing at a crossroads in a video game:
v(crossroads) = 50
"On average, starting from this crossroads and following my policy, I'll earn 50 points."
This is an average over all actions I might take.
q(crossroads, go_left) = 70
"If I go LEFT specifically, then follow my policy, I'll earn 70 points."
This is for a specific first action.
💡 Key insight: If q(s, left) > v(s), then "left" is better than average! This is how we improve policies — find actions with q > v.
For the Curious Mind
The Bellman Equation is Everywhere!
Richard Bellman discovered this in the 1950s for optimal control, not AI. It's used in economics, operations research, finance, and even DNA sequence alignment! The same recursive structure appears whenever you optimize over time.
Why is solving the Bellman equation hard?
With n states and m actions, you have n equations with n unknowns — that's solvable! But real problems have billions of states. This is the "curse of dimensionality" that Bellman himself named. Deep RL uses neural networks to approximate values without storing them all.
Mind-Bending Thought Experiment
If you knew the true value function vπ(s) for every state, you could play perfectly with zero learning — just always move toward higher-value states! The whole challenge of RL is that we don't know the true values.
Optimal Policies and Optimal Value Functions
Finding the policy that maximizes expected return from all states.
The Ultimate Goal
Find the optimal policy π* — the one that does better than (or at least as well as) every other policy in every state. The amazing thing? Such a policy always exists for finite MDPs!
Optimal state-value:
🎯 "The best possible value for state s, if we use the best possible policy"
Optimal action-value:
🎯 "The best return I can get if I take action a in state s, then act optimally"
🔑 Key Relationships
"The value of a state = value of the best action from that state"
"Value of action = reward + discounted value of next state"
Bellman Optimality Equations
The optimal values satisfy special recursive equations (like regular Bellman, but with max instead of averaging over policy):
For state-value v*:
📖 Each Symbol:
📐 How We Get This:
🎯 "Optimal value = pick the best action, then average over outcomes of (reward + discounted future optimal value)"
For action-value q*:
📖 Each Symbol:
📐 How We Get This:
🎯 "Value of action a = expected reward + discounted value of the best action in wherever I end up"
🔄 Regular Bellman vs Optimality Bellman:
Regular (for any policy π):
vπ(s) = Σa π(a|s) Σs',r p(...)[r + γvπ(s')]
Averages over actions (using π)
Optimality (for best policy):
v*(s) = maxa Σs',r p(...)[r + γv*(s')]
Picks the best action (using max)
💡 The difference: Regular Bellman averages over what policy π would do. Optimality Bellman maximizes to find what the best policy would do.
Interactive Gridworld: See Optimal Values
4x4 Gridworld
Environment Details
- • Reward: -1 per step (except terminal)
- • Terminal states: Top-left, Bottom-right
- • Actions: Up, Down, Left, Right
- • Discount factor (γ): 0.9
Click "Run Policy Evaluation" to see the value function converge under a random policy. Values show expected returns from each state.
Step-by-Step: How to "Solve" an MDP
Here's the roadmap for going from an MDP definition to an optimal policy:
Define the MDP
Specify states S, actions A, rewards R, transitions p, and discount γ
Find Optimal Value Function
Compute v*(s) or q*(s,a) using Bellman optimality equations (Dynamic Programming, etc.)
Extract Optimal Policy
In each state, pick the action with the highest q* value: π*(s) = argmaxa q*(s,a)
Execute the Policy
Follow π*(s) in every state — you're now acting optimally!
💡 The hard part: Step 2! Computing optimal values is computationally expensive. The rest of this book explores efficient methods to do this.
Once we have q*(s,a), finding the optimal policy is trivial:
📖 What This Notation Means:
🎮 Concrete Example:
Suppose you're in state s, and you have 3 actions with these q* values:
Then: π*(s) = right (because it has the highest q-value)
💡 In plain English: "In each state, just pick the action with the highest q* value!"
Optimality and Approximation
Why finding the true optimal policy is often impractical.
Reality Check: Why Can't We Always Find the Best?
In theory, we can solve the Bellman optimality equations to find π*. In practice, real problems have so many states that we can't even list them all, let alone compute values for each one!
The Curse of Dimensionality
State spaces grow exponentially with the number of variables.
A robot with 10 joints, each with 100 positions = 10010 = 1020 states! That's more than atoms in a grain of sand.
The Solution: Approximation
Use function approximation (neural networks, etc.) to generalize.
Instead of storing V(s) for every state, learn a function V(s; θ) with parameters θ that estimates values for any state, even unseen ones.
State Space Sizes: A Reality Check
| Problem | States | Can Store All? |
|---|---|---|
| 4x4 Gridworld | 16 | Yes, easily |
| Tic-Tac-Toe | ~5,000 | Yes |
| Backgammon | 1020 | No way! |
| Chess | 1047 | Impossible |
| Go | 10170 | More than atoms in universe |
The rest of this book (and modern RL) is about finding approximate solutions efficiently. We'll learn methods that:
- • Learn from limited experience (not all states)
- • Generalize to unseen states
- • Make good (not perfect) decisions
Summary
Key takeaways from Chapter 3.
Quick Reference Glossary
Keep this handy! These are the key terms from Chapter 3:
MDP (Markov Decision Process)
The mathematical framework for sequential decision-making: (S, A, p, r, γ)
State (s)
A complete description of the current situation
Action (a)
A choice the agent can make in a state
Reward (r)
Immediate feedback signal (scalar number)
Return (Gt)
Total discounted future reward from time t
Discount Factor (γ)
How much to weight future vs immediate rewards (0 to 1)
Policy (π)
A strategy: mapping from states to action probabilities
State-Value Function vπ(s)
Expected return starting from state s, following policy π
Action-Value Function qπ(s,a)
Expected return taking action a in state s, then following π
Bellman Equation
Recursive formula: v(s) = E[r + γv(s')]
Optimal Policy (π*)
The policy that maximizes value in all states
Markov Property
Future depends only on present state, not history
Congratulations! You've learned the formal framework for RL. Here's what you should remember:
MDPs formalize sequential decision-making
States, actions, rewards, transitions — the complete framework
The agent-environment loop
State → Action → Reward → New State, repeated forever
Markov property
The future depends only on the present state, not on history
Returns are discounted cumulative rewards
γ controls how much we care about future vs. present
Value functions estimate expected returns
v(s) for states, q(s,a) for state-action pairs
Bellman equations are recursive
Value now = immediate reward + discounted value later
Optimal policies exist and are extractable from q*
Just pick argmax_a q*(s,a) in every state
Real problems require approximation
State spaces are too big for exact solutions
Formula Cheat Sheet
The essential formulas from Chapter 3:
Return (discounted sum of rewards):
Bellman Equation for vπ:
Bellman Optimality Equation for v*:
Optimal Policy Extraction:
What's Next?
Now that we have the framework, Chapter 4 introduces Dynamic Programming — methods for computing optimal policies when we know the full MDP model. It's the foundation for understanding all RL algorithms!