Chapter 5

Monte Carlo Methods

Learn from complete episodes without knowing the environment model. MC methods average sample returns to estimate value functions.

Think of it like learning from experience!

Remember how Dynamic Programming needed a complete map of the environment? That's not realistic in the real world. Monte Carlo methods are different β€” they learn by actually doing things and seeing what happens. It's like learning to cook by actually cooking, not by reading a recipe book! You try something, see the result, and update your understanding.

Model-Free

No dynamics needed

Episodic

Complete episodes

Sample Averages

Unbiased estimates

The Big Picture: Learning from Episodes

Monte Carlo (MC) methods learn by simulating or experiencing complete episodes, then using the actual returns to update value estimates.

🎲 Why "Monte Carlo"?

The name comes from the famous casino in Monaco! Just like gambling involves randomness and averaging outcomes, MC methods use random sampling to estimate values.

🎯 The Core Idea

1Play through a complete episode
2Record what states you visited and what return you got
3Update each state's value as the average of observed returns
Section 5.1

Monte Carlo Prediction

Estimating state values by averaging sample returns.

Monte Carlo Prediction

Given a policy Ο€, estimate vΟ€(s) by averaging returns observed after visiting state s. No model of the environment is neededβ€”just sample episodes.

Think of it like tracking restaurant visits!

Imagine you're trying to estimate which neighborhood has the best restaurants. Instead of reading reviews (like DP uses a model), you actually eat at restaurantsand track your satisfaction:

  • β€’ Visit neighborhood A, have a great meal: +10 satisfaction
  • β€’ Visit again later, mediocre meal: +5 satisfaction
  • β€’ After many visits, average = (10+5+8+...)/n = 7.5

That average (7.5) is your value estimate for that state!

First-Visit MC

Average returns only from the first time s is visited in each episode.

Unbiased, simpler analysis

πŸ’‘ Like counting your first impression of a restaurant only

Every-Visit MC

Average returns from every visit to s in each episode.

Uses more data, also converges

πŸ’‘ Like counting every meal, even if you visit twice in one day

Interactive First-Visit MC Demo

State Values (First-Visit MC)

TERM
0.0n=0
0.0n=0
0.0n=0
0.0n=0
0.0n=0
0.0n=0
0.0n=0
0.0n=0
0.0n=0
0.0n=0
0.0n=0
0.0n=0
0.0n=0
0.0n=0
TERM
0
Episodes
0
Max Visits
First-Visit MC Update
V(s) ← average(Returns(s))

Each state's value is the average of all returns observed after first visiting that state.

Recent Episodes

No episodes yet

Watch how values converge as more episodes are sampled. States visited more often have more accurate estimates.

MC vs DP

Dynamic Programming
  • β€’ Requires complete model
  • β€’ Bootstraps (uses estimates)
  • β€’ Updates all states

Like planning with a map

Monte Carlo
  • β€’ Model-free (learns from samples)
  • β€’ No bootstrapping (uses actual returns)
  • β€’ Updates visited states only

Like learning by exploring

Why This Works (Law of Large Numbers)

As you collect more and more samples, the average of those samples converges to the true expected value. It's the same principle behind opinion polls β€” ask enough people, and you'll get a good estimate of what everyone thinks!

For the Curious Mind

Monte Carlo Methods Predate Computers!

The Monte Carlo method was invented during the Manhattan Project in the 1940s. Scientists like Stanislaw Ulam and John von Neumann used random sampling to simulate neutron diffusion β€” problems too complex for analytical solutions.

Why "Monte Carlo"?

Named after the famous casino in Monaco! The method involves random sampling, like rolls of dice or spins of a roulette wheel. Ulam's uncle frequently gambled there, inspiring the name.

MC + Neural Networks = Modern Game AI

Monte Carlo Tree Search (MCTS) powered AlphaGo. It combines MC sampling with tree search: "Simulate many random games from this position, then pick the move that wins most often." This is the heart of game-playing AI!

Section 5.2

Monte Carlo Estimation of Action Values

When we don't have a model, we need action values for control.

Why do we need action values?

In DP, knowing state values was enough because we could "look ahead" with the model. But without a model, we're stuck! We need to know: "How good is it to take action A in state S?" β€” that's what Q(s,a) tells us directly.

Why Action Values?

Without a model, state values alone are not enough for control. We need q(s,a) to know which action to take without computing expected values over unknown transitions.

The Exploration Problem

If the policy is deterministic, many state-action pairs may never be visited. We need to ensure exploration.

🎯 The Problem: If you always eat at your favorite restaurant, you'll never discover an even better one! You need to explore sometimes to learn about all your options.

Exploring Starts

Every state-action pair has nonzero probability of being the starting state.

πŸ’‘ Like randomly teleporting to different restaurants to start

Soft Policies

All actions have nonzero probability: Ο€(a|s) > 0 for all s, a.

πŸ’‘ Like occasionally trying a random menu item

Blackjack Example

Blackjack is a perfect example for MC methods! We don't need to know the exact probabilities of each card β€” we just play many hands and average the results.

Blackjack

Click to start a new game

0
Episodes
0%
Win Rate
Results
0
Wins
0
Losses
0
Draws
State Space
  • β€’ Player sum: 12-21 (10 values)
  • β€’ Dealer showing: 1-10 (10 values)
  • β€’ Usable ace: yes/no (2 values)
  • β€’ Total: 200 states

MC methods learn the value of each state by averaging returns from many episodes. The state is (player sum, dealer showing, usable ace).

Section 5.3

Monte Carlo Control

Using MC for both prediction and improvement.

From prediction to control!

So far we've learned to evaluate policies. Now we want to improve them. The recipe is the same as DP's GPI: evaluate β†’ improve β†’ repeat. The difference is we use sampled returns instead of expected values.

MC Control with Exploring Starts

1

Generate Episode

Starting from random (s, a), follow Ο€

πŸ’‘ Randomly pick a starting state and action, then play it out

2

Update Q Values

Q(s,a) ← average(Returns(s,a))

πŸ’‘ For each state-action pair visited, update its average return

3

Improve Policy

Ο€(s) ← argmaxa Q(s,a)

πŸ’‘ In each state, switch to the action with the highest Q value

GPI in Monte Carlo

MC control follows the same GPI pattern as DP: evaluate the policy, then improve it greedily. The difference is we use sampled returns instead of expected values.

The MC Control Loop

Generate

Play episode

Evaluate

Update Q

Improve

Greedy Ο€

Repeat many times

Convergence

MC control with exploring starts converges to the optimal policy Ο€*under mild conditions. The key is that all state-action pairs must be visited infinitely often.

⚠️ The catch: "Exploring starts" is unrealistic in many real applications. You can't always teleport to arbitrary states!

Section 5.4

Monte Carlo Control without Exploring Starts

Using soft policies to ensure exploration.

A more practical solution!

Since we can't always start from random states, we need another way to explore. The solution: sometimes take random actions!This is called an Ξ΅-greedy policy β€” mostly do what you think is best, but occasionally try something random.

Ξ΅-Soft Policies

A policy is Ξ΅-soft if Ο€(a|s) β‰₯ Ξ΅/|A| for all states and actions. This guarantees all actions are tried with at least probability Ξ΅/|A|.

Ξ΅-Greedy Policy

Ο€(a∣s)={1βˆ’Ξ΅+Ξ΅/∣A∣ifΒ a=arg⁑max⁑aQ(s,a)Ξ΅/∣A∣otherwise\pi(a|s) = \begin{cases} 1 - \varepsilon + \varepsilon/|A| & \text{if } a = \arg\max_a Q(s,a) \\ \varepsilon/|A| & \text{otherwise} \end{cases}

πŸ“– What Each Symbol Means:

Ο€(a|s)= probability of taking action a in state s
Ξ΅ (epsilon)= exploration rate (e.g., 0.1 means 10% exploration)
|A|= number of possible actions (e.g., 4 for up/down/left/right)
argmaxa= the action with the highest Q-value (the "greedy" action)

πŸ”’ Concrete Example (Ξ΅ = 0.1, 4 actions):

Best action: 1 - 0.1 + 0.1/4 = 0.925 (92.5% chance)

Other actions: 0.1/4 = 0.025 each (2.5% chance)

Check: 0.925 + 3Γ—0.025 = 1.0 βœ“ (probabilities sum to 1)

With probability 1-Ξ΅, take the greedy action. With probability Ξ΅, take a random action. This ensures continued exploration while mostly exploiting.

In Plain English

Let's say Ξ΅ = 0.1 (10%) and you have 4 possible actions:

90% of the time

Take the action that currently looks best (exploit your knowledge)

10% of the time

Pick a completely random action (explore alternatives)

This way, every action gets tried eventually, but you still mostly act intelligently!

Exploration

Random actions (Ξ΅) ensure we keep visiting all state-action pairs.

Exploitation

Greedy actions (1-Ξ΅) use our current knowledge to maximize reward.

Suboptimality

On-policy Ξ΅-greedy MC converges to the best Ξ΅-soft policy, not the optimal policy. We trade optimality for guaranteed exploration. To find the true optimal policy, we need off-policy methods.

The Trade-off

We can't find the truly optimal policy with Ξ΅-greedy because we always have that Ξ΅ chance of doing something random. But we'll find the best policy among all policies that have at least Ξ΅ exploration. Often this is good enough in practice!

For the Curious Mind

The Explore-Exploit Dilemma is Everywhere!

Should you go to your favorite restaurant (exploit) or try a new one (explore)? Should a company invest in existing products or R&D? This dilemma appears in dating, job searching, A/B testing, and even how bacteria search for food!

Better Exploration: UCB and Thompson Sampling

Ξ΅-greedy is simple but "dumb" β€” it explores randomly even for actions it's sure about. Advanced methods like UCB (Upper Confidence Bound) and Thompson Sampling explore smartly β€” more for uncertain actions, less for well-known ones.

Section 5.5

Off-policy Prediction via Importance Sampling

Learning about one policy while following another.

The elegant solution: separate learning from exploring!

What if we could learn about the optimal policy (which might be greedy) while following an exploratory policy? That's exactly what off-policy learning does! We use one policy to collect data (behavior policy) and learn about a different policy (target policy).

On-Policy

Learn from own actions

Ο€
Generate dataImprove Ο€

Same policy is used for both:

  • β€’ Generating experience (behavior)
  • β€’ Being evaluated/improved (target)
Simpler implementation
Lower variance
Must explore (Ξ΅-greedy)
Cannot learn from other agents

Example: SARSA, Ξ΅-greedy MC Control

Off-Policy

Learn from others' actions

b
Generate data
Ο€

Different policies:

  • β€’ Behavior (b): generates data
  • β€’ Target (Ο€): being evaluated
Learn optimal policy directly
Can reuse experience from any policy
Learn from human demonstrations
Higher variance (importance sampling)

Example: Q-learning, Off-policy MC Control

Coverage Requirement

For off-policy learning to work, the behavior policy must have coverage: it must take all actions that the target policy might take. Otherwise, we can't learn about those actions.

Ο€(a|s) > 0 ⟹ b(a|s) > 0

The Key Idea

We want to estimate vΟ€ or qΟ€ (target policy) but we only have data from b (behavior policy). We use importance sampling to reweight the returns.

Behavior Policy (b)

The policy we actually follow to collect data. Usually exploratory (like Ξ΅-greedy).

Target Policy (Ο€)

The policy we want to learn about. Can be greedy (optimal)!

Real-World Analogy: Learning from Others

Imagine you're learning to trade stocks, but you're too cautious to make risky trades yourself. Instead, you watch a friend who takes more risks (behavior policy) and learn what would have happened if you had made those trades following your strategy (target policy). You're learning about your policy using someone else's experiences!

Section 5.6

Importance Sampling

Correcting for the difference between behavior and target policies.

Correcting for "wrong" experiences!

The problem: data from policy b doesn't directly tell us about policy Ο€. Some trajectories that b often follows might be rare under Ο€ (and vice versa). Importance sampling fixes this by reweighting each experience based on how likely it would be under the target policy vs. behavior policy.

Importance Sampling Ratio

ρt:Tβˆ’1=∏k=tTβˆ’1Ο€(Ak∣Sk)b(Ak∣Sk)\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}

πŸ“– What Each Symbol Means:

ρt:T-1= importance sampling ratio from time t to T-1 (a single number)
Ξ  (product)= multiply all the terms together (like Ξ£ but with multiplication)
Ο€(Ak|Sk)= probability that target policy Ο€ would take action Ak in state Sk
b(Ak|Sk)= probability that behavior policy b took action Ak (what actually happened)
k = t to T-1= from current time step until one before terminal state

πŸ“ Expanding the Product (for 3 steps):

ρ = Ο€(At|St)/b(At|St) Γ— Ο€(At+1|St+1)/b(At+1|St+1) Γ— Ο€(At+2|St+2)/b(At+2|St+2)

πŸ”’ Concrete Example:

Step 1: Ο€ would choose this action with prob 0.8, b chose with prob 0.4 β†’ ratio = 2.0

Step 2: Ο€ would choose with prob 0.6, b chose with prob 0.3 β†’ ratio = 2.0

Step 3: Ο€ would choose with prob 0.1, b chose with prob 0.5 β†’ ratio = 0.2

Total ρ = 2.0 Γ— 2.0 Γ— 0.2 = 0.8

This trajectory is slightly less likely under Ο€ than under b (ρ < 1), so we down-weight its return.

The ratio tells us: "How much more or less likely would the target policy Ο€ have been to generate this exact sequence of actions compared to behavior policy b?"

Breaking It Down

For each action along a trajectory:

Ο€(A|S) / b(A|S) = How much more/less likely would target Ο€ take this action compared to behavior b?

∏ (product) = Multiply all these ratios along the trajectory

🎯 Intuition: If Ο€ is more likely to take the same actions as b, the ratio is > 1 (upweight). If less likely, ratio is < 1 (downweight).

Interactive Importance Sampling Demo

Policy we use to collect data

Policy we want to evaluate

Importance Sampling Ratio
ρ = Ο€(a|s) / b(a|s)

The ratio corrects for the difference between the two policies. When Ο€(a) > b(a), we upweight that sample. When Ο€(a) < b(a), we downweight it.

0.800
True Value
1.280
Ordinary IS
0.941
Weighted IS
Sample Data
Ar=1ρ=1.60ρr=1.60
Ar=1ρ=1.60ρr=1.60
Br=0ρ=0.40ρr=0.00
Ar=1ρ=1.60ρr=1.60
Ar=1ρ=1.60ρr=1.60
Ar=1ρ=1.60ρr=1.60
Ar=1ρ=1.60ρr=1.60
Ar=1ρ=1.60ρr=1.60
Br=0ρ=0.40ρr=0.00
Ar=1ρ=1.60ρr=1.60
Ordinary IS

Simple average of weighted returns. Unbiased but high variance.

Weighted IS

Normalize by sum of ratios. Biased but lower variance.

Try setting the policies very different (e.g., b=0.2, Ο€=0.8) to see high variance in ordinary IS. Weighted IS is more stable but slightly biased.

Ordinary IS
V(s)=βˆ‘t∈T(s)ρt:T(t)βˆ’1Gt∣T(s)∣V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{|\mathcal{T}(s)|}

T(s) = set of all times we visited state s

|T(s)| = count of visits (number of samples)

Ξ£ ρG = sum of (ratio Γ— return) for each visit

Unbiased but can have high variance

🎯 "Average the weighted returns" β€” divide by sample count

Weighted IS
V(s)=βˆ‘t∈T(s)ρt:T(t)βˆ’1Gtβˆ‘t∈T(s)ρt:T(t)βˆ’1V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1}}

Ξ£ ρG = sum of (ratio Γ— return) β€” same as ordinary

Σ ρ = sum of the ratios themselves

Biased but lower variance

🎯 "Weighted average" β€” divide by sum of weights, not count

πŸ“ Why These Formulas Work

The key insight is from probability theory. If we have samples from distribution b but want the expected value under Ο€:

EΟ€[G]=Eb[Ο€(trajectory)b(trajectory)β‹…G]=Eb[ρ⋅G]\mathbb{E}_\pi[G] = \mathbb{E}_b\left[\frac{\pi(trajectory)}{b(trajectory)} \cdot G\right] = \mathbb{E}_b[\rho \cdot G]

Multiplying by ρ "converts" expectations from b to Ο€!

Bias vs. Variance Trade-off

Ordinary IS
  • βœ… Unbiased (correct on average)
  • ❌ High variance (estimates can swing wildly)
  • β†’ Better when you have lots of data
Weighted IS
  • ❌ Biased (systematically off)
  • βœ… Lower variance (more stable estimates)
  • β†’ Better when data is limited
Section 5.7

Off-policy Monte Carlo Control

Learning the optimal policy while following an exploratory policy.

The best of both worlds!

Off-policy MC control lets us explore freely (using Ξ΅-greedy or any soft policy) while learning the truly optimal policy (which can be greedy/deterministic). We get guaranteed exploration AND can find the real optimal solution!

Off-policy MC Control Algorithm

1

Behavior Policy b

Any soft policy (e.g., Ξ΅-greedy)

πŸ’‘ This is what we actually do β€” exploring included!

2

Target Policy Ο€

Greedy with respect to Q

πŸ’‘ This is what we learn about β€” the optimal policy!

3

Update Q with Importance Sampling

Weight returns by ρ to estimate qΟ€

πŸ’‘ Correct our experiences to learn about the target policy

Advantage of Off-policy

Off-policy MC control can learn the optimal policy directly, even while following an exploratory behavior policy. The target policy can be deterministic (greedy) because exploration is handled by the behavior policy.

Challenges

High Variance

Importance ratios can be very large, especially for long episodes.

πŸ’‘ If Ο€ and b differ a lot, ρ can explode!

Slow Learning

Many episodes may be needed for stable estimates.

πŸ’‘ Waiting for episodes to finish is inefficient

Spoiler: TD Methods Will Fix This!

The main drawback of MC is waiting for complete episodes. In the next chapter, you'll learn Temporal-Difference (TD) learning, which combines MC's model-free nature with DP's bootstrapping. Updates happen after every step, not just at episode end!

Section 5.8

Summary

Key takeaways from Chapter 5.

Quick Reference Glossary

Monte Carlo

Learning from complete episodes by averaging actual returns

First-Visit MC

Only use the first return for each state per episode

Every-Visit MC

Use every return from all visits to each state

Ξ΅-Greedy Policy

Pick best action with prob 1-Ξ΅, random with prob Ξ΅

On-Policy

Learn about the policy you're following

Off-Policy

Learn about one policy while following another

Behavior Policy (b)

The policy we follow to collect data

Target Policy (Ο€)

The policy we want to learn about

Importance Sampling

Reweighting returns to correct for policy mismatch

Exploring Starts

Every state-action pair has nonzero starting probability

Formula Cheat Sheet

Ξ΅-Greedy Policy:

Ο€(a∣s)={1βˆ’Ξ΅+Ξ΅/∣A∣a=arg⁑max⁑QΞ΅/∣A∣otherwise\pi(a|s) = \begin{cases} 1 - \varepsilon + \varepsilon/|A| & a = \arg\max Q \\ \varepsilon/|A| & \text{otherwise} \end{cases}

Importance Sampling Ratio:

ρt:Tβˆ’1=∏k=tTβˆ’1Ο€(Ak∣Sk)b(Ak∣Sk)\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}

Ordinary IS Value Estimate:

V(s)=βˆ‘t∈T(s)ρtGt∣T(s)∣V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_t G_t}{|\mathcal{T}(s)|}

Weighted IS Value Estimate:

V(s)=βˆ‘t∈T(s)ρtGtβˆ‘t∈T(s)ρtV(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_t G_t}{\sum_{t \in \mathcal{T}(s)} \rho_t}

Common Mistakes (Avoid These!)

Wrong: "MC can be used for continuing tasks"

Right: MC requires episodes to end. For continuing tasks, use TD methods instead.

Wrong: "On-policy MC finds the optimal policy"

Right: On-policy Ξ΅-greedy MC finds the best Ξ΅-soft policy. For the true optimal policy, use off-policy methods.

What We Learned in This Chapter

Monte Carlo methods are our first model-free approach! We can learn optimal behavior just by experiencing episodes β€” no need to know the environment's dynamics.

  • 🎲 MC Prediction β€” Average returns to estimate values
  • 🎯 MC Control β€” Use GPI with sampled returns
  • πŸ”„ Ξ΅-Greedy β€” Balance exploration and exploitation
  • πŸ“Š Importance Sampling β€” Learn about one policy while following another

Monte Carlo methods learn from complete episodes without a model

MC prediction averages sample returns to estimate value functions

First-visit and every-visit MC both converge to the true values

MC control requires exploration: exploring starts or soft policies

On-policy methods learn about the policy being followed

Off-policy methods learn about one policy using data from another

Importance sampling corrects for the difference between policies

Weighted IS has lower variance but is biased; ordinary IS is unbiased

The Limitation: Waiting for Episodes to End

MC methods must wait until the end of an episode to update values. This is problematic when:

⏳Episodes are very long (slow learning)
♾️Tasks are continuing (no natural end point)
πŸ“‰Importance sampling ratios become huge for long trajectories

Looking Ahead

MC methods require complete episodes, which can be slow. In the next chapter, we'll explore Temporal-Difference learning, which combines the model-free nature of MC with the bootstrapping of DP, allowing updates after every step.