Dynamic Programming
When we know the environment model, we can compute optimal policies exactly. DP methods bootstrap: they update estimates based on other estimates.
Think of it like having a perfect map!
Imagine you're planning a road trip with a complete map that shows every road, every intersection, and exactly how long each route takes. With this perfect information, you can sit down and calculate the best route before you even start driving. That's what Dynamic Programming does β when you know everything about how the environment works, you can compute the optimal strategy through pure calculation!
Model-Based
Requires p(s',r|s,a)
Bootstrapping
Update from estimates
Guaranteed
Converges to optimal
The Big Picture: What is Dynamic Programming?
Dynamic Programming (DP) is a method for solving complex problems by breaking them down into simpler subproblems. In RL, we use DP to find optimal policies when we have complete knowledge of the environment.
π The Key Requirement
DP needs a model β complete knowledge of p(s',r|s,a), meaning we know exactly what happens for every action in every state.
π― The Two Main Problems DP Solves
Prediction (Policy Evaluation)
"How good is this policy?" β Compute vΟ
Control (Finding Optimal Policy)
"What's the best policy?" β Compute Ο* and v*
Policy Evaluation (Prediction)
Computing the state-value function for a given policy.
Given a policy Ο, compute the state-value function vΟ. This is the prediction problem β predicting expected returns under the policy.
Think of it like rating neighborhoods!
Imagine you're a delivery driver with a fixed route plan (your policy). You want to know: "If I follow this route plan, how much will I earn starting from each neighborhood?"
Policy Evaluation answers this question β it tells you the "value" of each starting point when you follow a specific plan.
Iterative Policy Evaluation
Start with arbitrary v0, iterate until convergence. The sequence v0, v1, v2, ... converges to vΟ as k β β.
π What Each Symbol Means
π Where This Formula Comes From
Step 1: Start with the Bellman equation for vΟ
Step 2: We don't know vΟ, so use our current estimate vk
Step 3: Repeat until vk+1 β vk (convergence!)
When nothing changes, we've found vΟ β
Reading the Formula Like a Story
For each action a: Consider how likely Ο is to pick it β Ο(a|s)
For each outcome: Consider how likely it is β p(s',r|s,a)
For each outcome: Calculate reward now + discounted future β [r + Ξ³vk(s')]
Combine: Weighted average of all these = new value estimate
π― In plain English: "The value of being in a state equals the weighted average of (immediate reward + discounted future value), across all the actions you might take and all the places you might end up."
Interactive Policy Evaluation Demo
State Values V(s)
Policy: Random (equal probability for all 4 actions)
Iterative Policy Evaluation
Each iteration updates all state values simultaneously using the Bellman equation. Under the random policy, corner states have the lowest values (furthest from terminals).
Why Does Iteration Work?
Think of it like a game of telephone, but backwards and with numbers:
First iteration: Values are random guesses
Each update: States "ask their neighbors" what they're worth
Over time: True values "propagate" from terminal states backwards
Eventually: All values stabilize at their true values
DP uses expected updates β we average over all possible next states weighted by their probabilities. This requires knowing the model p(s',r|s,a).
For the Curious Mind
DP Powered the Space Race!
NASA used Dynamic Programming to compute optimal trajectories for the Apollo missions. The same math that evaluates your video game policy helped land humans on the moon!
What if the state space is infinite?
DP needs to store a value for every state. With infinite states (like continuous positions), this is impossible! Chapter 9 introduces function approximationwhere neural networks learn to generalize across states.
Connection to Google Maps!
Dijkstra's algorithm for shortest paths is actually a special case of value iteration with Ξ³ = 1 and rewards = -1 per step. When you use GPS navigation, you're benefiting from DP!
Policy Improvement
Making a policy better using its value function.
Think of it like upgrading your strategy!
Now that we know how good each state is under our current plan (from policy evaluation), we can ask: "What if I made different choices?"If switching to a different action would lead to a better outcome, we should switch! That's policy improvement β using what we've learned to make better decisions.
Policy Improvement Theorem
If qΟ(s, Ο'(s)) β₯ vΟ(s) for all states s, then Ο' is at least as good as Ο.
What This Really Means
In everyday language:
qΟ(s, a) = "If I'm in state s and take action a, then follow my usual policy Ο, what's my expected return?"
vΟ(s) = "If I'm in state s and just follow policy Ο from the start, what's my expected return?"
The insight: If trying a different action first gives you at least as much value as your current choice, then you should switch to that action!
Greedy Improvement
Select the action that maximizes the action-value under the current value function.
π― In plain English: "Always pick the action that looks best right now."
Why It Works
By acting greedily w.r.t. vΟ, we ensure the new policy Ο' is at least as good in every state. If strictly better in any state, the overall policy improves.
π― Think of it as: "If you can't make things worse, and might make them better, do it!"
For finite MDPs, there always exists a deterministic optimal policy. Policy improvement always produces a deterministic greedy policy.
Policy Iteration
Alternating evaluation and improvement until convergence.
Think of it like iterative recipe testing!
Imagine you're a chef trying to perfect a recipe:
- Evaluate: Taste your current dish and rate every ingredient's contribution
- Improve: Replace any ingredient with a better alternative you discovered
- Repeat: Keep evaluating and improving until you can't make it better
That's policy iteration β evaluate your strategy, improve it, repeat!
Policy Iteration Algorithm
Initialize
V(s) and Ο(s) arbitrarily for all s β S
π‘ Just start with random guesses β they'll get better!
Policy Evaluation
Compute VΟ exactly (or approximately)
π‘ Figure out how good each state is under your current strategy
Policy Improvement
Ο(s) β argmaxa Ξ£ p(s',r|s,a)[r + Ξ³V(s')]
π‘ For each state, switch to whichever action looks best now
Check Convergence
If policy unchanged, stop. Otherwise, go to step 2.
π‘ If nothing changed, you've found the optimal policy!
Interactive Policy Iteration Demo
Values & Policy
Policy Iteration Algorithm
- 1. Evaluate current policy (find VΟ)
- 2. Improve policy greedily w.r.t. VΟ
- 3. Repeat until policy is stable
Policy iteration alternates between evaluating the current policy and improving it. The arrows show the greedy action under the current policy.
The Policy Iteration Dance
Evaluate
How good is Ο?
Improve
Make Ο better
Optimal!
Ο can't improve
Repeat until convergence
Policy iteration converges in a finite number of steps because there are only finitely many deterministic policies. Each improvement is strict (or the policy is optimal).
Value Iteration
Combining evaluation and improvement into a single update.
Think of it like speed-dating for algorithms!
Policy Iteration does full evaluation before each improvement β like spending months getting to know someone before deciding if you should date them.
Value Iteration is more impatient: after just ONE update, it immediately asks "what's the best action?" It's faster per iteration, though it might need more iterations overall.
Instead of fully evaluating a policy, do just one sweep of evaluation and immediately improve. This is equivalent to turning the Bellman optimality equation into an update rule.
Value Iteration Update
π What Each Symbol Means:
π Where This Comes From:
Start with the Bellman Optimality Equation:
Turn it into an iterative update (use vk instead of v*):
As k β β, vk β v* (converges to the optimal value function!)
This single update combines one sweep of policy evaluation with policy improvement. Converges to v* under the same conditions as policy iteration.
The Key Difference: max vs Ξ£
Policy Evaluation (uses Ξ£)
Ξ£a Ο(a|s) Γ [...]
"Average over all actions according to the policy"
Value Iteration (uses max)
maxa [...]
"Take the best action" β implicitly improving!
π― The insight: By using max instead of averaging, we're simultaneously figuring out values AND picking the best actions!
Interactive Value Iteration Demo
Optimal Values V*(s)
Value Iteration Update
Combines one sweep of policy evaluation with policy improvement.
Parameters
Value iteration finds V* directly by taking the max over actions at each state. The arrows show the optimal policy extracted from V*.
Policy Iteration
- β’ Full evaluation each iteration
- β’ Fewer outer iterations
- β’ More work per iteration
Like doing thorough research before each decision
Value Iteration
- β’ One sweep per iteration
- β’ More outer iterations
- β’ Less work per iteration
Like making quick decisions and adjusting on the fly
Common Mistakes (Avoid These!)
Wrong: "DP can be used when we don't know the model"
Right: DP requires knowing p(s',r|s,a). Without a model, use Monte Carlo or TD methods instead.
Wrong: "Value Iteration and Policy Iteration give different answers"
Right: Both converge to the same optimal solution! They're just different paths to get there.
Wrong: "We need to initialize V(s) correctly for DP to work"
Right: You can initialize with any values! DP will still converge (though good initialization can speed things up).
For the Curious Mind
Why does DP matter if it needs the model?
Even though RL often doesn't know the model, DP provides the theoretical foundation. Algorithms like Q-learning are essentially doing DP "in the dark" β sampling the model rather than knowing it. Understanding DP helps you understand what all RL algorithms are trying to approximate!
The Contraction Mapping Theorem
Why does DP always converge? There's beautiful math behind it! The Bellman operator is a contraction β each iteration brings values closer to the true answer. Like repeatedly pressing "=" on a calculator when computing β2 β it always converges!
AlphaGo Used DP!
AlphaGo combined Monte Carlo Tree Search (a form of simulation-based DP) with neural networks. During self-play, it performed millions of "mini DP updates" to evaluate board positions. The fundamentals you're learning here powered the AI that beat the world champion!
Asynchronous Dynamic Programming
Updating states in any order, not necessarily in sweeps.
Think of it like painting a house!
Synchronous DP is like painting every room at once, waiting for all rooms to dry, then applying a second coat to every room.
Asynchronous DP is like painting rooms one by one, or focusing on the rooms that need the most work first. As long as you eventually paint every room, you'll get the job done β and sometimes faster!
Flexibility in Updates
Asynchronous DP algorithms update states individually, in any order. They can focus computation on the most relevant states and still converge to the optimal solution.
In-Place DP
Update values immediately, using new values as soon as they're computed. Often converges faster than synchronous sweeps.
π‘ Like using your latest test score immediately rather than waiting for all tests
Prioritized Sweeping
Update states with the largest Bellman errors first. Focuses computation where it's most needed.
π‘ Like studying the subjects you're worst at first
Real-Time DP
Update only states the agent actually visits. Interleaves computation with action selection.
π‘ Like learning about places only when you travel there
Asynchronous DP converges as long as all states continue to be updated (no state is "starved" of updates indefinitely). The order doesn't matter for correctness.
Generalized Policy Iteration
The unifying principle behind all DP and RL algorithms.
This is one of the most important ideas in RL!
GPI (Generalized Policy Iteration) is the big picture that ties everything together. No matter what specific algorithm you use, they all share the same fundamental pattern:
- Evaluation: Estimate how good your current strategy is
- Improvement: Make your strategy greedier based on those estimates
These two processes "dance" together, pushing each other toward the optimal solution!
GPI refers to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.
The GPI Dance
Generalized Policy Iteration
GPI is the idea of letting policy evaluation and policy improvement interact, independent of the granularity of each step.
Moves toward being greedy w.r.t. current V
Moves toward being consistent with Ο
The Dance of GPI
Watch how policy and value chase each other, spiraling toward the optimal solution. This interaction is the essence of almost all RL algorithms.
Key Insight
Evaluation and improvement are both driving toward the same goal from different directions:
Makes V consistent with current Ο
"Given this strategy, how good is each state?"
Makes Ο greedy w.r.t. current V
"Given these values, what's the best action?"
When both processes stabilize, we have found Ο* and v*. This is the fixed point where both conditions are satisfied simultaneously.
Why GPI Matters for the Rest of the Book
Every RL algorithm you'll learn follows this GPI pattern:
They differ in how they evaluate and improve, but the fundamental dance is always the same!
Efficiency of Dynamic Programming
DP is surprisingly efficient compared to alternatives.
Is DP slow? Actually, it's remarkably fast!
You might think: "Iterating over every state sounds slow!" But compare it to the alternative β trying every possible policy. If you have 10 states and 3 actions, there are 310 = 59,049 possible policies! DP is polynomial time, which is much faster than checking every possibility.
Polynomial Time
DP methods find optimal policies in time polynomial in |S| and |A|. This is much better than the exponential time required by direct search.
π‘ O(|S|Β² Γ |A|) vs O(|A||S|) β a massive difference!
Curse of Dimensionality
The main limitation is the size of the state space, which grows exponentially with the number of state variables. This is unavoidable for any method.
π‘ A robot with 10 joints, each with 100 positions = 10010 states!
Comparison with Alternatives
| Method | Time Complexity | Requires Model? |
|---|---|---|
| Policy Iteration | O(|A| Β· |S|Β²) per iter | Yes |
| Value Iteration | O(|A| Β· |S|Β²) per iter | Yes |
| Linear Programming | O(|S|Β³) one-shot | Yes |
| Direct Search | O(|A|^|S|) | Yes |
π― Key insight: Even though DP visits every state, it cleverly reuses computation. Each state's value informs its neighbors, so we don't redo work unnecessarily.
In practice, asynchronous methods and good state ordering can dramatically speed up convergence. Real-world DP implementations often converge much faster than worst-case bounds suggest.
Summary
Key takeaways from Chapter 4.
Quick Reference Glossary
Dynamic Programming
A method for solving complex problems by breaking them into simpler subproblems
Policy Evaluation
Computing vΟ β how good is this policy?
Policy Improvement
Making a policy better by acting greedily w.r.t. current values
Policy Iteration
Alternating full evaluation and improvement until optimal
Value Iteration
One sweep of evaluation + improvement combined using max
GPI (Generalized Policy Iteration)
The unifying framework: evaluation and improvement interacting
Bootstrapping
Updating estimates based on other estimates (not final answers)
Bellman Equation
The recursive relationship that DP exploits
Formula Cheat Sheet
Policy Evaluation (iterative update):
Policy Improvement (greedy action):
Value Iteration (one-step update):
π― Key difference: Policy Evaluation uses Ξ£a Ο(a|s) (average over policy), Value Iteration uses maxa (pick the best).
What We Learned in This Chapter
Dynamic Programming is like having a superpower for planning β but only when you have complete knowledge of how the world works. We learned three main algorithms:
- π Policy Evaluation β Figure out how good a strategy is
- π Policy Iteration β Evaluate β Improve β Repeat until optimal
- β‘ Value Iteration β A faster version that combines both steps
The GPI framework unifies all of these, showing that RL is fundamentally about evaluation and improvement working together.
DP methods require a complete model of the environment (p(s',r|s,a))
Policy evaluation computes v_Ο by iteratively applying the Bellman equation
Policy improvement makes a policy greedy with respect to its value function
Policy iteration alternates evaluation and improvement until convergence
Value iteration combines both into a single update using the max operator
Asynchronous DP allows flexible update ordering while maintaining convergence
GPI is the unifying principle: evaluation and improvement chase each other to optimality
DP is polynomial in |S| and |A|, but state spaces can be exponentially large
The Catch: You Need a Model!
Everything we learned in this chapter requires knowing p(s',r|s,a) β the complete dynamics of the environment. But in the real world, we often don't know this!
Next up: Methods that learn from experience without needing a model!
Looking Ahead
DP requires knowing the environment model. In the next chapters, we'll explore Monte Carlo and Temporal-Difference methods that learn directly from experience without a model. These are the foundations of modern RL.