Chapter 4

Dynamic Programming

When we know the environment model, we can compute optimal policies exactly. DP methods bootstrap: they update estimates based on other estimates.

Think of it like having a perfect map!

Imagine you're planning a road trip with a complete map that shows every road, every intersection, and exactly how long each route takes. With this perfect information, you can sit down and calculate the best route before you even start driving. That's what Dynamic Programming does β€” when you know everything about how the environment works, you can compute the optimal strategy through pure calculation!

Model-Based

Requires p(s',r|s,a)

Bootstrapping

Update from estimates

Guaranteed

Converges to optimal

The Big Picture: What is Dynamic Programming?

Dynamic Programming (DP) is a method for solving complex problems by breaking them down into simpler subproblems. In RL, we use DP to find optimal policies when we have complete knowledge of the environment.

πŸ”‘ The Key Requirement

DP needs a model β€” complete knowledge of p(s',r|s,a), meaning we know exactly what happens for every action in every state.

🎯 The Two Main Problems DP Solves

1

Prediction (Policy Evaluation)

"How good is this policy?" β€” Compute vΟ€

2

Control (Finding Optimal Policy)

"What's the best policy?" β€” Compute Ο€* and v*

Section 4.1

Policy Evaluation (Prediction)

Computing the state-value function for a given policy.

Policy Evaluation

Given a policy Ο€, compute the state-value function vΟ€. This is the prediction problem β€” predicting expected returns under the policy.

Think of it like rating neighborhoods!

Imagine you're a delivery driver with a fixed route plan (your policy). You want to know: "If I follow this route plan, how much will I earn starting from each neighborhood?"

Policy Evaluation answers this question β€” it tells you the "value" of each starting point when you follow a specific plan.

Iterative Policy Evaluation

vk+1(s)=βˆ‘aΟ€(a∣s)βˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³vk(sβ€²)]v_{k+1}(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]

Start with arbitrary v0, iterate until convergence. The sequence v0, v1, v2, ... converges to vΟ€ as k β†’ ∞.

πŸ“– What Each Symbol Means

vk+1(s)= new value estimate for state s at iteration k+1 (what we're calculating)
Ξ£a= sum over all possible actions a (try every action)
Ο€(a|s)= probability that policy Ο€ takes action a in state s (how likely is this action?)
Ξ£s',r= sum over all possible next states s' and rewards r
p(s',r|s,a)= dynamics function β€” probability of landing in s' with reward r given we're in s and take a
r= immediate reward received
Ξ³ (gamma)= discount factor (0 to 1) β€” how much we care about future vs present
vk(s')= our current estimate (from iteration k) of the next state's value

πŸ“ Where This Formula Comes From

Step 1: Start with the Bellman equation for vΟ€

vΟ€(s)=βˆ‘aΟ€(a∣s)βˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³vΟ€(sβ€²)]v_\pi(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)[r + \gamma v_\pi(s')]

Step 2: We don't know vΟ€, so use our current estimate vk

vk+1(s)=βˆ‘aΟ€(a∣s)βˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³vk(sβ€²)]v_{k+1}(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)[r + \gamma v_k(s')]

Step 3: Repeat until vk+1 β‰ˆ vk (convergence!)

When nothing changes, we've found vΟ€ βœ“

Reading the Formula Like a Story

For each action a: Consider how likely Ο€ is to pick it β†’ Ο€(a|s)

For each outcome: Consider how likely it is β†’ p(s',r|s,a)

For each outcome: Calculate reward now + discounted future β†’ [r + Ξ³vk(s')]

Combine: Weighted average of all these = new value estimate

🎯 In plain English: "The value of being in a state equals the weighted average of (immediate reward + discounted future value), across all the actions you might take and all the places you might end up."

Interactive Policy Evaluation Demo

State Values V(s)

TERM
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
TERM

Policy: Random (equal probability for all 4 actions)

0
Iterations
0.0000
Max Ξ”
Iterative Policy Evaluation
V(s) ← Ξ£a Ο€(a|s) Ξ£s',r p(s',r|s,a)[r + Ξ³V(s')]

Each iteration updates all state values simultaneously using the Bellman equation. Under the random policy, corner states have the lowest values (furthest from terminals).

Why Does Iteration Work?

Think of it like a game of telephone, but backwards and with numbers:

1

First iteration: Values are random guesses

2

Each update: States "ask their neighbors" what they're worth

3

Over time: True values "propagate" from terminal states backwards

4

Eventually: All values stabilize at their true values

Expected Update

DP uses expected updates β€” we average over all possible next states weighted by their probabilities. This requires knowing the model p(s',r|s,a).

For the Curious Mind

DP Powered the Space Race!

NASA used Dynamic Programming to compute optimal trajectories for the Apollo missions. The same math that evaluates your video game policy helped land humans on the moon!

What if the state space is infinite?

DP needs to store a value for every state. With infinite states (like continuous positions), this is impossible! Chapter 9 introduces function approximationwhere neural networks learn to generalize across states.

Connection to Google Maps!

Dijkstra's algorithm for shortest paths is actually a special case of value iteration with Ξ³ = 1 and rewards = -1 per step. When you use GPS navigation, you're benefiting from DP!

Section 4.2

Policy Improvement

Making a policy better using its value function.

Think of it like upgrading your strategy!

Now that we know how good each state is under our current plan (from policy evaluation), we can ask: "What if I made different choices?"If switching to a different action would lead to a better outcome, we should switch! That's policy improvement β€” using what we've learned to make better decisions.

Policy Improvement Theorem

If qΟ€(s, Ο€'(s)) β‰₯ vΟ€(s) for all states s, then Ο€' is at least as good as Ο€.

vΟ€β€²(s)β‰₯vΟ€(s)forΒ allΒ s∈Sv_{\pi'}(s) \geq v_\pi(s) \quad \text{for all } s \in S

What This Really Means

In everyday language:

qΟ€(s, a) = "If I'm in state s and take action a, then follow my usual policy Ο€, what's my expected return?"

vΟ€(s) = "If I'm in state s and just follow policy Ο€ from the start, what's my expected return?"

The insight: If trying a different action first gives you at least as much value as your current choice, then you should switch to that action!

Greedy Improvement

Ο€β€²(s)=arg⁑max⁑aqΟ€(s,a)\pi'(s) = \arg\max_a q_\pi(s, a)

Select the action that maximizes the action-value under the current value function.

🎯 In plain English: "Always pick the action that looks best right now."

Why It Works

By acting greedily w.r.t. vΟ€, we ensure the new policy Ο€' is at least as good in every state. If strictly better in any state, the overall policy improves.

🎯 Think of it as: "If you can't make things worse, and might make them better, do it!"

Deterministic Policies

For finite MDPs, there always exists a deterministic optimal policy. Policy improvement always produces a deterministic greedy policy.

Section 4.3

Policy Iteration

Alternating evaluation and improvement until convergence.

Think of it like iterative recipe testing!

Imagine you're a chef trying to perfect a recipe:

  1. Evaluate: Taste your current dish and rate every ingredient's contribution
  2. Improve: Replace any ingredient with a better alternative you discovered
  3. Repeat: Keep evaluating and improving until you can't make it better

That's policy iteration β€” evaluate your strategy, improve it, repeat!

Policy Iteration Algorithm

1

Initialize

V(s) and Ο€(s) arbitrarily for all s ∈ S

πŸ’‘ Just start with random guesses β€” they'll get better!

2

Policy Evaluation

Compute VΟ€ exactly (or approximately)

πŸ’‘ Figure out how good each state is under your current strategy

3

Policy Improvement

Ο€(s) ← argmaxa Ξ£ p(s',r|s,a)[r + Ξ³V(s')]

πŸ’‘ For each state, switch to whichever action looks best now

4

Check Convergence

If policy unchanged, stop. Otherwise, go to step 2.

πŸ’‘ If nothing changed, you've found the optimal policy!

Interactive Policy Iteration Demo

Values & Policy

GOAL
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
GOAL
0
Policy Iterations
evaluation
Current Phase
Policy Iteration Algorithm
  1. 1. Evaluate current policy (find VΟ€)
  2. 2. Improve policy greedily w.r.t. VΟ€
  3. 3. Repeat until policy is stable

Policy iteration alternates between evaluating the current policy and improving it. The arrows show the greedy action under the current policy.

The Policy Iteration Dance

Evaluate

How good is Ο€?

Improve

Make Ο€ better

Optimal!

Ο€ can't improve

Repeat until convergence

Finite Convergence

Policy iteration converges in a finite number of steps because there are only finitely many deterministic policies. Each improvement is strict (or the policy is optimal).

Section 4.4

Value Iteration

Combining evaluation and improvement into a single update.

Think of it like speed-dating for algorithms!

Policy Iteration does full evaluation before each improvement β€” like spending months getting to know someone before deciding if you should date them.

Value Iteration is more impatient: after just ONE update, it immediately asks "what's the best action?" It's faster per iteration, though it might need more iterations overall.

Value Iteration

Instead of fully evaluating a policy, do just one sweep of evaluation and immediately improve. This is equivalent to turning the Bellman optimality equation into an update rule.

Value Iteration Update

vk+1(s)=max⁑aβˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³vk(sβ€²)]v_{k+1}(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]

πŸ“– What Each Symbol Means:

vk+1(s)= new value estimate for state s at iteration k+1
maxa= take the maximum over all possible actions a (pick the best!)
Ξ£s',r= sum over all possible next states and rewards (expected value)
p(s',r|s,a)= probability of getting (s', r) when taking action a in state s
[r + Ξ³vk(s')]= immediate reward + discounted future value

πŸ“ Where This Comes From:

Start with the Bellman Optimality Equation:

vβˆ—(s)=max⁑aβˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³vβˆ—(sβ€²)]v_*(s) = \max_a \sum_{s',r} p(s',r|s,a)[r + \gamma v_*(s')]

Turn it into an iterative update (use vk instead of v*):

vk+1(s)=max⁑aβˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³vk(sβ€²)]v_{k+1}(s) = \max_a \sum_{s',r} p(s',r|s,a)[r + \gamma v_k(s')]

As k β†’ ∞, vk β†’ v* (converges to the optimal value function!)

This single update combines one sweep of policy evaluation with policy improvement. Converges to v* under the same conditions as policy iteration.

The Key Difference: max vs Ξ£

Policy Evaluation (uses Ξ£)

Ξ£a Ο€(a|s) Γ— [...]

"Average over all actions according to the policy"

Value Iteration (uses max)

maxa [...]

"Take the best action" β€” implicitly improving!

🎯 The insight: By using max instead of averaging, we're simultaneously figuring out values AND picking the best actions!

Interactive Value Iteration Demo

Optimal Values V*(s)

GOAL
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
GOAL
0
Sweeps
0.000000
Max Ξ”
Value Iteration Update
V(s) ← maxa Ξ£s',r p(s',r|s,a)[r + Ξ³V(s')]

Combines one sweep of policy evaluation with policy improvement.

Parameters
Discount Ξ³:0.9
Reward:-1/step

Value iteration finds V* directly by taking the max over actions at each state. The arrows show the optimal policy extracted from V*.

Policy Iteration
  • β€’ Full evaluation each iteration
  • β€’ Fewer outer iterations
  • β€’ More work per iteration

Like doing thorough research before each decision

Value Iteration
  • β€’ One sweep per iteration
  • β€’ More outer iterations
  • β€’ Less work per iteration

Like making quick decisions and adjusting on the fly

Common Mistakes (Avoid These!)

Wrong: "DP can be used when we don't know the model"

Right: DP requires knowing p(s',r|s,a). Without a model, use Monte Carlo or TD methods instead.

Wrong: "Value Iteration and Policy Iteration give different answers"

Right: Both converge to the same optimal solution! They're just different paths to get there.

Wrong: "We need to initialize V(s) correctly for DP to work"

Right: You can initialize with any values! DP will still converge (though good initialization can speed things up).

For the Curious Mind

Why does DP matter if it needs the model?

Even though RL often doesn't know the model, DP provides the theoretical foundation. Algorithms like Q-learning are essentially doing DP "in the dark" β€” sampling the model rather than knowing it. Understanding DP helps you understand what all RL algorithms are trying to approximate!

The Contraction Mapping Theorem

Why does DP always converge? There's beautiful math behind it! The Bellman operator is a contraction β€” each iteration brings values closer to the true answer. Like repeatedly pressing "=" on a calculator when computing √2 β€” it always converges!

AlphaGo Used DP!

AlphaGo combined Monte Carlo Tree Search (a form of simulation-based DP) with neural networks. During self-play, it performed millions of "mini DP updates" to evaluate board positions. The fundamentals you're learning here powered the AI that beat the world champion!

Section 4.5

Asynchronous Dynamic Programming

Updating states in any order, not necessarily in sweeps.

Think of it like painting a house!

Synchronous DP is like painting every room at once, waiting for all rooms to dry, then applying a second coat to every room.

Asynchronous DP is like painting rooms one by one, or focusing on the rooms that need the most work first. As long as you eventually paint every room, you'll get the job done β€” and sometimes faster!

Flexibility in Updates

Asynchronous DP algorithms update states individually, in any order. They can focus computation on the most relevant states and still converge to the optimal solution.

In-Place DP

Update values immediately, using new values as soon as they're computed. Often converges faster than synchronous sweeps.

πŸ’‘ Like using your latest test score immediately rather than waiting for all tests

Prioritized Sweeping

Update states with the largest Bellman errors first. Focuses computation where it's most needed.

πŸ’‘ Like studying the subjects you're worst at first

Real-Time DP

Update only states the agent actually visits. Interleaves computation with action selection.

πŸ’‘ Like learning about places only when you travel there

Convergence Guarantee

Asynchronous DP converges as long as all states continue to be updated (no state is "starved" of updates indefinitely). The order doesn't matter for correctness.

Section 4.6

Generalized Policy Iteration

The unifying principle behind all DP and RL algorithms.

This is one of the most important ideas in RL!

GPI (Generalized Policy Iteration) is the big picture that ties everything together. No matter what specific algorithm you use, they all share the same fundamental pattern:

  • Evaluation: Estimate how good your current strategy is
  • Improvement: Make your strategy greedier based on those estimates

These two processes "dance" together, pushing each other toward the optimal solution!

Generalized Policy Iteration (GPI)

GPI refers to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.

The GPI Dance

Policy SpaceValue Function SpaceΟ€*, V*Ο€V

Generalized Policy Iteration

GPI is the idea of letting policy evaluation and policy improvement interact, independent of the granularity of each step.

Policy (Ο€)

Moves toward being greedy w.r.t. current V

Value (V)

Moves toward being consistent with Ο€

The Dance of GPI
β†’Evaluation: V moves toward VΟ€
β†’Improvement: Ο€ moves toward greedy(V)
β˜…They spiral together toward the optimal fixed point

Watch how policy and value chase each other, spiraling toward the optimal solution. This interaction is the essence of almost all RL algorithms.

Key Insight

Evaluation and improvement are both driving toward the same goal from different directions:

Evaluation

Makes V consistent with current Ο€

"Given this strategy, how good is each state?"

Improvement

Makes Ο€ greedy w.r.t. current V

"Given these values, what's the best action?"

When both processes stabilize, we have found Ο€* and v*. This is the fixed point where both conditions are satisfied simultaneously.

Why GPI Matters for the Rest of the Book

Every RL algorithm you'll learn follows this GPI pattern:

Monte Carlo→ Evaluates by averaging complete episode returns
TD Learning→ Evaluates by bootstrapping from next state
Q-Learning→ Combines evaluation and improvement in one update

They differ in how they evaluate and improve, but the fundamental dance is always the same!

Section 4.7

Efficiency of Dynamic Programming

DP is surprisingly efficient compared to alternatives.

Is DP slow? Actually, it's remarkably fast!

You might think: "Iterating over every state sounds slow!" But compare it to the alternative β€” trying every possible policy. If you have 10 states and 3 actions, there are 310 = 59,049 possible policies! DP is polynomial time, which is much faster than checking every possibility.

Polynomial Time

DP methods find optimal policies in time polynomial in |S| and |A|. This is much better than the exponential time required by direct search.

πŸ’‘ O(|S|Β² Γ— |A|) vs O(|A||S|) β€” a massive difference!

Curse of Dimensionality

The main limitation is the size of the state space, which grows exponentially with the number of state variables. This is unavoidable for any method.

πŸ’‘ A robot with 10 joints, each with 100 positions = 10010 states!

Comparison with Alternatives

MethodTime ComplexityRequires Model?
Policy IterationO(|A| Β· |S|Β²) per iterYes
Value IterationO(|A| Β· |S|Β²) per iterYes
Linear ProgrammingO(|S|Β³) one-shotYes
Direct SearchO(|A|^|S|)Yes

🎯 Key insight: Even though DP visits every state, it cleverly reuses computation. Each state's value informs its neighbors, so we don't redo work unnecessarily.

Practical Considerations

In practice, asynchronous methods and good state ordering can dramatically speed up convergence. Real-world DP implementations often converge much faster than worst-case bounds suggest.

Section 4.8

Summary

Key takeaways from Chapter 4.

Quick Reference Glossary

Dynamic Programming

A method for solving complex problems by breaking them into simpler subproblems

Policy Evaluation

Computing vΟ€ β€” how good is this policy?

Policy Improvement

Making a policy better by acting greedily w.r.t. current values

Policy Iteration

Alternating full evaluation and improvement until optimal

Value Iteration

One sweep of evaluation + improvement combined using max

GPI (Generalized Policy Iteration)

The unifying framework: evaluation and improvement interacting

Bootstrapping

Updating estimates based on other estimates (not final answers)

Bellman Equation

The recursive relationship that DP exploits

Formula Cheat Sheet

Policy Evaluation (iterative update):

vk+1(s)=βˆ‘aΟ€(a∣s)βˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³vk(sβ€²)]v_{k+1}(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]

Policy Improvement (greedy action):

Ο€β€²(s)=arg⁑max⁑aβˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³vΟ€(sβ€²)]\pi'(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)[r + \gamma v_\pi(s')]

Value Iteration (one-step update):

vk+1(s)=max⁑aβˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³vk(sβ€²)]v_{k+1}(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]

🎯 Key difference: Policy Evaluation uses Ξ£a Ο€(a|s) (average over policy), Value Iteration uses maxa (pick the best).

What We Learned in This Chapter

Dynamic Programming is like having a superpower for planning β€” but only when you have complete knowledge of how the world works. We learned three main algorithms:

  • πŸ“Š Policy Evaluation β€” Figure out how good a strategy is
  • πŸ“ˆ Policy Iteration β€” Evaluate β†’ Improve β†’ Repeat until optimal
  • ⚑ Value Iteration β€” A faster version that combines both steps

The GPI framework unifies all of these, showing that RL is fundamentally about evaluation and improvement working together.

DP methods require a complete model of the environment (p(s',r|s,a))

Policy evaluation computes v_Ο€ by iteratively applying the Bellman equation

Policy improvement makes a policy greedy with respect to its value function

Policy iteration alternates evaluation and improvement until convergence

Value iteration combines both into a single update using the max operator

Asynchronous DP allows flexible update ordering while maintaining convergence

GPI is the unifying principle: evaluation and improvement chase each other to optimality

DP is polynomial in |S| and |A|, but state spaces can be exponentially large

The Catch: You Need a Model!

Everything we learned in this chapter requires knowing p(s',r|s,a) β€” the complete dynamics of the environment. But in the real world, we often don't know this!

❌A robot doesn't know exactly how slippery each surface is
❌A trader doesn't know exactly how the market will react
❌A game-playing agent doesn't always know the opponent's strategy

Next up: Methods that learn from experience without needing a model!

Looking Ahead

DP requires knowing the environment model. In the next chapters, we'll explore Monte Carlo and Temporal-Difference methods that learn directly from experience without a model. These are the foundations of modern RL.