Chapter 8

Planning and Learning

Combine model-based planning with model-free learning. Use experience to build models, then use models to accelerate learning.

What if you could practice in your imagination?

So far, we've learned from real experience — actually trying things and seeing what happens. But what if you could build a mental model of how the world works, then practice in your imagination? That's planning! This chapter shows how to combine learning (from real experience) with planning (from simulated experience) to learn faster than either approach alone.

Models

Predict the environment

Dyna Architecture

Learn + Plan together

MCTS

AlphaGo's secret

The Big Picture: Two Ways to Improve a Policy

Model-Free (Direct RL)

Learn directly from real experience. Trial and error in the actual environment.

Chapters 5-7: MC, TD, n-step

Model-Based (Indirect RL)

Build a model, then plan using simulated experience from the model.

Chapter 4: DP (with known model)

This chapter's insight: Why choose? We can do both! Learn a model from experience AND use that model to plan. This is the Dyna architecture.

Section 8.1

Models and Planning

What is a model, and how do we use it?

Think of it like a mental simulation!

A model is your internal representation of how the world works. If you throw a ball, you can imagine where it will land before you throw it. That's your model of physics! In RL, a model predicts what state and reward you'll get if you take a certain action. Planning means using this model to figure out a good policy — like mentally rehearsing before you act.

What is a Model?

A model of the environment is anything an agent can use to predict how the environment will respond to its actions.

\text{Model}: (S, A) \rightarrow (S', R)

Two Types of Models:

Distribution Model

Gives the full probability distribution: p(s', r | s, a)

Example: "From state 5 with action LEFT, there's 80% chance of state 4 and 20% chance of state 5"

Sample Model

Generates sample transitions: (s, a) → one (s', r) according to probabilities

Example: "From state 5 with action LEFT → state 4, reward 0" (one sample)

Distribution Models vs. Sample Models

Distribution Model

• Gives probabilities p(s', r | s, a)
• Can do expected updates
• More powerful but harder to learn
• Needed for dynamic programming

Sample Model

• Produces sample transitions
• Can only do sample updates
• Easier to learn and use
• Often sufficient in practice

What is Planning?

Planning is using a model to improve a policy without interacting with the real environment.

The Key Insight:

Planning and learning are the same computation applied to different data sources:

Learning: real experience → update values

Planning: simulated experience → update values

For the Curious Mind

Humans Plan Constantly!

When you consider whether to take an umbrella, you're running a mental simulation: "If it rains and I don't have an umbrella, I'll get wet." Your brain has a model of weather and its consequences. Chess grandmasters "look ahead" many moves — that's planning!

Model Learning in Deep RL

Modern systems like MuZero learn a model from scratch and use it to plan, achieving superhuman performance in Go, chess, and Atari games — without even knowing the rules! World models are a hot research topic.

Section 8.2

Dyna: Integrated Planning, Acting, and Learning

The architecture that combines everything.

The best of both worlds!

Dyna is an elegant architecture that does three things at once: (1) learns from real experience, (2) builds a model from that experience, and (3) plans using the model. After each real step, it does many simulated steps, essentially getting "free" practice. This makes learning much faster!

The Dyna Architecture

Real Experience

S, A → R, S'

→

Model Learning

Update model

↓ also

Direct RL

Update Q(s,a)

Planning (n times)

Simulated updates

After each real step, Dyna does n simulated steps (often n=5 to n=50), dramatically accelerating learning.

Dyna-Q Algorithm

# Initialize Q(s,a) and Model(s,a) for all s, a

Loop forever:

(a) S ← current state

(b) A ← ε-greedy(S, Q)

(d) Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)] # Direct RL

(e) Model(S,A) ← R, S' # Model learning

(f) Repeat n times: # Planning

S ← random previously observed state

A ← random action previously taken in S

R, S' ← Model(S, A)

Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]

Why Dyna Works So Well

Each real experience is used multiple times:

1.Once for direct RL (immediate update)
2.To update the model (stored for future use)
3.Many times via planning (replayed from the model)

Result: Learning is much more sample efficient — you need fewer real interactions!

Example: Dyna in a Maze

In a simple maze task, Dyna-Q with n=50 planning steps learns to find the goal in about 3 episodes. Without planning (n=0), it takes about 25 episodes!

n = 0

~25 episodes

No planning

n = 5

~6 episodes

Modest planning

n = 50

~3 episodes

Heavy planning

For the Curious Mind

Is Dyna like dreaming?

Maybe! Some neuroscientists believe sleep helps consolidate memories by "replaying" experiences — similar to Dyna's planning phase. Your brain might be doing model-based RL while you dream!

Experience Replay is Everywhere!

The idea of storing and replaying experience is central to modern deep RL. DQN (DeepMind's Atari-playing agent) uses a "replay buffer" — essentially Dyna without an explicit model. Prioritized Experience Replaytakes this further, as we'll see in section 8.4!

Section 8.3

When the Model Is Wrong

Dealing with inaccurate or changing environments.

What if the world changes?

Models are learned from past experience. But what if the environment changes? A new shortcut might open up, or an old path might become blocked. If your model is outdated, planning from it can actually hurt you! This section explores how to detect and handle model errors.

Blocking Maze

The environment becomes worse — a path gets blocked.

The agent has learned the optimal path. Then a wall appears! The old model still says that path works. The agent must unlearnand find a new route.

Real experience quickly corrects the model — exploration helps!

Shortcut Maze

The environment becomes better — a shortcut opens up.

The agent has learned a good path. Then a better shortcut opens! The model doesn't know about it. The agent must exploreto discover the improvement.

This is harder — ε-greedy might never find it!

Dyna-Q+ Solution

Dyna-Q+ adds a bonus for visiting states that haven't been tried recently.

R' = R + \kappa\sqrt{\tau}

Term Breakdown:

R'= modified reward used in planning

R= actual reward from the model

κ= small constant (curiosity bonus weight)

τ= time steps since (s, a) was last tried in the real environment

States that haven't been visited for a long time get a curiosity bonus, encouraging the agent to explore and discover if things have changed.

The Exploration-Exploitation Dilemma Returns!

This is another form of the exploration-exploitation tradeoff! Should you exploit your current (possibly outdated) model, or exploreto see if things have changed? Dyna-Q+ adds systematic exploration by making unvisited transitions more attractive.

Section 8.4

Prioritized Sweeping

Focus planning where it matters most.

Not all updates are equally useful!

Basic Dyna picks random state-action pairs to update during planning. But that's wasteful! If we just discovered a big reward, we should prioritize updating states that lead to it. Prioritized Sweeping keeps a queue of updates sorted by their expected impact, so we always work on the most important ones first.

Prioritized Sweeping

Maintain a priority queue of state-action pairs based on the magnitude of their TD errors. Update the highest-priority pairs first, then add predecessors to the queue.

Priority = |TD Error|

P = |R + \gamma \max_a Q(S', a) - Q(S, A)|

Big TD error → big priority → update first!

How Prioritized Sweeping Works

After a real transition (S, A → R, S'), compute the TD error.

If the error is above threshold θ, add (S, A) to the priority queue.

Pop the highest-priority pair, update it, then check all its predecessors.

If a predecessor would have a large TD error, add it to the queue too.

Why Update Predecessors?

If the value of state S' just changed significantly, then states that lead to S' are probably also wrong! By working backwards, we propagate value changes efficiently.

Think of it like this: you discover treasure at location X. Now all paths to X become more valuable. Prioritized sweeping automatically updates them!

For the Curious Mind

Prioritized Experience Replay in Deep RL

This idea evolved into Prioritized Experience Replay (PER), used in modern deep RL. Instead of sampling uniformly from a replay buffer, PER samples transitions with high TD errors more often. It's a key component of state-of-the-art systems like Rainbow DQN!

Section 8.5

Expected vs. Sample Updates

When to use each type of update.

Average over all outcomes vs. sample one

When updating a value, we have two choices: expected updatesaverage over all possible next states (like DP), while sample updatesuse just one sampled next state (like TD). Expected updates are more accurate but expensive. Sample updates are cheap but noisy. Which is better?

Expected Update

Q(s,a) \leftarrow \sum_{s',r} p(s',r|s,a)[r + \gamma \max_{a'} Q(s',a')]

• Averages over all possible outcomes
• No sampling error (zero variance)
• Requires distribution model
• Cost: O(branching factor)

Sample Update

Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]

• Uses one sampled transition
• Has sampling variance
• Only needs sample model
• Cost: O(1)

When to Use Each

Expected Updates Win When:

• Low branching factor (few possible next states)
• You have a distribution model anyway
• You want to minimize total updates

Sample Updates Win When:

• High branching factor (many possible outcomes)
• State space is huge (can't afford full sweeps)
• You only have a sample model

Key Insight

For large branching factors (b), sample updates become more efficient. With b possible next states, one sample update costs 1/b of an expected update, so you can do b sample updates for the same computation — often better!

Section 8.6

Trajectory Sampling

Focusing computation on likely states.

Why update states you'll never visit?

Traditional DP does exhaustive sweeps over all states. But in a huge state space, most states might never be reached! Trajectory samplingfocuses on states that the current policy actually visits, ignoring irrelevant states. This can be much more efficient!

Exhaustive Sweeps

Update every state in some order. Good for small state spaces, but wastes time on unreachable states.

Trajectory Sampling

Simulate trajectories from the start state. Only update states that are actually visited under the policy.

The On-Policy Distribution Advantage

Trajectory sampling naturally focuses on states according to their on-policy distribution — states you're likely to actually encounter. This means:

✓Common states get many updates (important to get right!)
✓Rare states get few updates (less critical)
✓Unreachable states get zero updates (no waste!)

Section 8.7

Real-time Dynamic Programming

DP that only updates visited states.

DP, but only where you actually go!

Real-time Dynamic Programming (RTDP) is value iteration applied only to states visited during real-time interaction. Instead of sweeping all states, we do asynchronous DP updates only at states we actually encounter. This can find optimal policies much faster when we don't need values for all states.

RTDP Algorithm

Repeat:

(a) S ← start state

(b) While S is not terminal:

V(S) ← max_a Σ p(s',r|S,a)[r + γV(s')] # DP update

A ← greedy action from S

Take A, observe S'

S ← S'

Key Properties of RTDP

Only updates states on optimal trajectories (for stochastic shortest path problems)
Converges to optimal policy without visiting all states
Natural way to interleave planning and acting

Section 8.8

Planning at Decision Time

Using planning to select individual actions.

Two ways to use a model

So far, we've used models for background planning — building up a value function or policy over time. But there's another approach: decision-time planning. Here, we plan only when we need to act, searching ahead from the current state to find the best action. This is what chess engines do!

Background Planning

Gradually improves a policy/value function over many states.

• Dyna, RTDP
• Good when computation is cheap
• Value function can be reused

Decision-Time Planning

Deep search from current state only. Results typically discarded after.

• Heuristic search, MCTS
• Good when focus on current state matters
• Fresh computation each step

When to Use Decision-Time Planning

• When you need high-quality decisions in specific states (like games)
• When there's time to think between actions
• When states rarely repeat (no point caching values)
• When the action space is small enough to search

Section 8.9

Heuristic Search

Classical AI planning meets RL.

Look ahead in a tree of possibilities!

Heuristic search builds a tree of possible futures from the current state: "If I do X, then Y might happen, then I could do Z..." At the leaves, we use a heuristic (estimate) of the value. Then we back up these values to find the best action now. This is the basis of chess engines!

How Heuristic Search Works

From current state, build a tree of possible action sequences

At leaf nodes (depth limit), evaluate with heuristic V(s)

Back up values toward root: max over actions, weighted sum over states

Choose the action at root with highest backed-up value

Connection to RL

Heuristic search is essentially decision-time planning with expected updates:

• The "heuristic" at leaves is our value function V(s)
• Backing up is just doing Bellman equations in the search tree
• The deeper we search, the less we rely on the heuristic

Section 8.10

Rollout Algorithms

Monte Carlo estimation for decision-time planning.

Simulate games to estimate action values!

A rollout simulates a complete episode from the current state using some simple policy (the "rollout policy"). By doing many rollouts for each action and averaging the returns, we get Monte Carlo estimates of action values. This is simpler than building a full search tree!

Rollout Algorithm

# At state s, estimate Q(s, a) for each action a:

For each action a:

Simulate k episodes starting with action a

Use rollout policy π for subsequent actions

Q(s,a) ← average return of k episodes

Choose action with highest Q(s, a)

Key Properties

Policy improvement: Rollout policy + decision-time search ≥ rollout policy alone
Simple: No tree structure needed, just simulate and average
Parallelizable: Different rollouts can run on different processors

For the Curious Mind

How can averaging over a mediocre policy help?

Great question! The rollout policy might be weak, but by trying each action first and then following the rollout policy, we're essentially doing one step of policy improvement. This is guaranteed to be at least as good as the rollout policy! Do this at every step, and you get significant improvement.

Section 8.11

Monte Carlo Tree Search

The algorithm behind AlphaGo.

The algorithm that conquered Go!

Monte Carlo Tree Search (MCTS) combines the best of heuristic search and rollouts. It builds a search tree gradually, focusing computation on the most promising parts. Each iteration: (1) select a path through the tree, (2) expandby adding a new node, (3) simulate a rollout from there, (4) backup the result. This is the core of AlphaGo and AlphaZero!

The Four Phases of MCTS

1. Selection

Start at root, follow the tree by selecting actions (using UCB or similar) until reaching a leaf or unexpanded node.

2. Expansion

Add one or more children to the tree, representing new states that haven't been explored yet.

3. Simulation

From the new node, do a rollout to episode end using a fast rollout policy. Get a return.

4. Backup

Propagate the result back up the tree, updating visit counts and value estimates for all nodes on the path.

Selection with UCB

During selection, MCTS often uses Upper Confidence Bound (UCB):

\text{UCB}(s, a) = Q(s, a) + c \sqrt{\frac{\ln N(s)}{N(s, a)}}

Term Breakdown:

Q(s, a)= exploitation term (average value so far)

c= exploration constant (hyperparameter)

N(s)= times state s has been visited

N(s, a)= times action a has been tried from s

This balances exploitation (high Q) with exploration (rarely-tried actions get bonus).

Why MCTS is Powerful

✓Asymmetric tree: Spends more effort on promising branches
✓Anytime: Can stop anytime and return best action so far
✓Improves with time: More iterations = better decisions
✓Model-free core: Rollout policy doesn't need a model

For the Curious Mind

AlphaGo and AlphaZero

DeepMind's AlphaGo (2016) used MCTS with deep neural networks as the value function and rollout policy. AlphaZero (2017) simplified this by dropping the rollout entirely — the neural network is so good it replaces simulation! This mastered Go, Chess, and Shogi from scratch.

MuZero: Learning Without Rules

MuZero (2019) went even further: it learns its own model of the environment dynamics! It doesn't even need to know the rules of the game. It learns to predict relevant aspects of the future, then uses MCTS to plan. This works on Atari games, Go, chess, and more — all with the same algorithm!

Section 8.12

Summary

Key takeaways from Chapter 8.

Quick Reference Glossary

Model

Anything that predicts (s, a) → (s', r)

Planning

Using a model to improve policy without real interaction

Dyna

Architecture combining learning, model learning, and planning

Prioritized Sweeping

Focus updates on states with high TD error

RTDP

Real-time DP that only updates visited states

Decision-Time Planning

Search from current state to select action

Rollout

Simulate episode with rollout policy to estimate value

MCTS

Select → Expand → Simulate → Backup (AlphaGo)

Formula Cheat Sheet

Dyna-Q Update (Direct RL):

Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_a Q(S',a) - Q(S,A)]

Dyna-Q+ Curiosity Bonus:

R' = R + \kappa\sqrt{\tau}

Priority for Prioritized Sweeping:

P = |R + \gamma \max_a Q(S', a) - Q(S, A)|

UCB for MCTS Selection:

\text{UCB}(s, a) = Q(s, a) + c \sqrt{\frac{\ln N(s)}{N(s, a)}}

Common Mistakes (Avoid These!)

Wrong: "Model-based RL is always better because it's more sample efficient"

Right: Model-based is more sample efficient, but learning accurate models can be very hard in complex domains. Model errors can hurt! Choose based on your problem.

Wrong: "MCTS replaces the value function"

Right: MCTS uses a value function (or rollouts) at the leaves! In AlphaZero, the neural network provides both policy and value estimates. MCTS refines these through search.

What We Learned in This Chapter

We can combine learning (from real experience) with planning (from simulated experience) for faster, more efficient RL. Models let us "practice in our imagination."

🏗️ Models — Predict what happens when we act
🔄 Dyna — Learn model + use it for planning
📊 Prioritized Sweeping — Update what matters most
🌳 MCTS — Build search tree with rollouts (AlphaGo!)

Models predict environment responses; planning uses models to improve policies

Dyna integrates direct RL, model learning, and planning in one architecture

Model errors can hurt; Dyna-Q+ encourages exploration to find changes

Prioritized sweeping focuses computation on high-impact updates

Expected updates are accurate but expensive; sample updates are cheap but noisy

Trajectory sampling focuses on states that actually matter to the policy

Decision-time planning (heuristic search, MCTS) searches at action selection

MCTS builds asymmetric trees, focusing on promising branches

Rollout algorithms improve any base policy through Monte Carlo estimation

The Key Insight

Planning and learning are the same computation applied to different data sources. Learning uses real experience; planning uses simulated experience from a model. By doing both, we get the sample efficiency of model-based methods with the robustness of model-free learning.

Looking Ahead

This chapter completes Part I of the book — tabular methods! In Part II, we'll extend these ideas to handle continuous state spaces using function approximation. The same principles apply, but now our value functions will be neural networks, linear functions, and more!

Chapter 7 Back to Home