Monte Carlo Methods
Learn from complete episodes without knowing the environment model. MC methods average sample returns to estimate value functions.
Think of it like learning from experience!
Remember how Dynamic Programming needed a complete map of the environment? That's not realistic in the real world. Monte Carlo methods are different β they learn by actually doing things and seeing what happens. It's like learning to cook by actually cooking, not by reading a recipe book! You try something, see the result, and update your understanding.
Model-Free
No dynamics needed
Episodic
Complete episodes
Sample Averages
Unbiased estimates
The Big Picture: Learning from Episodes
Monte Carlo (MC) methods learn by simulating or experiencing complete episodes, then using the actual returns to update value estimates.
π² Why "Monte Carlo"?
The name comes from the famous casino in Monaco! Just like gambling involves randomness and averaging outcomes, MC methods use random sampling to estimate values.
π― The Core Idea
Monte Carlo Prediction
Estimating state values by averaging sample returns.
Given a policy Ο, estimate vΟ(s) by averaging returns observed after visiting state s. No model of the environment is neededβjust sample episodes.
Think of it like tracking restaurant visits!
Imagine you're trying to estimate which neighborhood has the best restaurants. Instead of reading reviews (like DP uses a model), you actually eat at restaurantsand track your satisfaction:
- β’ Visit neighborhood A, have a great meal: +10 satisfaction
- β’ Visit again later, mediocre meal: +5 satisfaction
- β’ After many visits, average = (10+5+8+...)/n = 7.5
That average (7.5) is your value estimate for that state!
First-Visit MC
Average returns only from the first time s is visited in each episode.
π‘ Like counting your first impression of a restaurant only
Every-Visit MC
Average returns from every visit to s in each episode.
π‘ Like counting every meal, even if you visit twice in one day
Interactive First-Visit MC Demo
State Values (First-Visit MC)
First-Visit MC Update
Each state's value is the average of all returns observed after first visiting that state.
Recent Episodes
No episodes yet
Watch how values converge as more episodes are sampled. States visited more often have more accurate estimates.
MC vs DP
Dynamic Programming
- β’ Requires complete model
- β’ Bootstraps (uses estimates)
- β’ Updates all states
Like planning with a map
Monte Carlo
- β’ Model-free (learns from samples)
- β’ No bootstrapping (uses actual returns)
- β’ Updates visited states only
Like learning by exploring
Why This Works (Law of Large Numbers)
As you collect more and more samples, the average of those samples converges to the true expected value. It's the same principle behind opinion polls β ask enough people, and you'll get a good estimate of what everyone thinks!
For the Curious Mind
Monte Carlo Methods Predate Computers!
The Monte Carlo method was invented during the Manhattan Project in the 1940s. Scientists like Stanislaw Ulam and John von Neumann used random sampling to simulate neutron diffusion β problems too complex for analytical solutions.
Why "Monte Carlo"?
Named after the famous casino in Monaco! The method involves random sampling, like rolls of dice or spins of a roulette wheel. Ulam's uncle frequently gambled there, inspiring the name.
MC + Neural Networks = Modern Game AI
Monte Carlo Tree Search (MCTS) powered AlphaGo. It combines MC sampling with tree search: "Simulate many random games from this position, then pick the move that wins most often." This is the heart of game-playing AI!
Monte Carlo Estimation of Action Values
When we don't have a model, we need action values for control.
Why do we need action values?
In DP, knowing state values was enough because we could "look ahead" with the model. But without a model, we're stuck! We need to know: "How good is it to take action A in state S?" β that's what Q(s,a) tells us directly.
Without a model, state values alone are not enough for control. We need q(s,a) to know which action to take without computing expected values over unknown transitions.
The Exploration Problem
If the policy is deterministic, many state-action pairs may never be visited. We need to ensure exploration.
π― The Problem: If you always eat at your favorite restaurant, you'll never discover an even better one! You need to explore sometimes to learn about all your options.
Exploring Starts
Every state-action pair has nonzero probability of being the starting state.
π‘ Like randomly teleporting to different restaurants to start
Soft Policies
All actions have nonzero probability: Ο(a|s) > 0 for all s, a.
π‘ Like occasionally trying a random menu item
Blackjack Example
Blackjack is a perfect example for MC methods! We don't need to know the exact probabilities of each card β we just play many hands and average the results.
Blackjack
Click to start a new game
Results
State Space
- β’ Player sum: 12-21 (10 values)
- β’ Dealer showing: 1-10 (10 values)
- β’ Usable ace: yes/no (2 values)
- β’ Total: 200 states
MC methods learn the value of each state by averaging returns from many episodes. The state is (player sum, dealer showing, usable ace).
Monte Carlo Control
Using MC for both prediction and improvement.
From prediction to control!
So far we've learned to evaluate policies. Now we want to improve them. The recipe is the same as DP's GPI: evaluate β improve β repeat. The difference is we use sampled returns instead of expected values.
MC Control with Exploring Starts
Generate Episode
Starting from random (s, a), follow Ο
π‘ Randomly pick a starting state and action, then play it out
Update Q Values
Q(s,a) β average(Returns(s,a))
π‘ For each state-action pair visited, update its average return
Improve Policy
Ο(s) β argmaxa Q(s,a)
π‘ In each state, switch to the action with the highest Q value
MC control follows the same GPI pattern as DP: evaluate the policy, then improve it greedily. The difference is we use sampled returns instead of expected values.
The MC Control Loop
Generate
Play episode
Evaluate
Update Q
Improve
Greedy Ο
Repeat many times
Convergence
MC control with exploring starts converges to the optimal policy Ο*under mild conditions. The key is that all state-action pairs must be visited infinitely often.
β οΈ The catch: "Exploring starts" is unrealistic in many real applications. You can't always teleport to arbitrary states!
Monte Carlo Control without Exploring Starts
Using soft policies to ensure exploration.
A more practical solution!
Since we can't always start from random states, we need another way to explore. The solution: sometimes take random actions!This is called an Ξ΅-greedy policy β mostly do what you think is best, but occasionally try something random.
A policy is Ξ΅-soft if Ο(a|s) β₯ Ξ΅/|A| for all states and actions. This guarantees all actions are tried with at least probability Ξ΅/|A|.
Ξ΅-Greedy Policy
π What Each Symbol Means:
π’ Concrete Example (Ξ΅ = 0.1, 4 actions):
Best action: 1 - 0.1 + 0.1/4 = 0.925 (92.5% chance)
Other actions: 0.1/4 = 0.025 each (2.5% chance)
Check: 0.925 + 3Γ0.025 = 1.0 β (probabilities sum to 1)
With probability 1-Ξ΅, take the greedy action. With probability Ξ΅, take a random action. This ensures continued exploration while mostly exploiting.
In Plain English
Let's say Ξ΅ = 0.1 (10%) and you have 4 possible actions:
90% of the time
Take the action that currently looks best (exploit your knowledge)
10% of the time
Pick a completely random action (explore alternatives)
This way, every action gets tried eventually, but you still mostly act intelligently!
Exploration
Random actions (Ξ΅) ensure we keep visiting all state-action pairs.
Exploitation
Greedy actions (1-Ξ΅) use our current knowledge to maximize reward.
On-policy Ξ΅-greedy MC converges to the best Ξ΅-soft policy, not the optimal policy. We trade optimality for guaranteed exploration. To find the true optimal policy, we need off-policy methods.
The Trade-off
We can't find the truly optimal policy with Ξ΅-greedy because we always have that Ξ΅ chance of doing something random. But we'll find the best policy among all policies that have at least Ξ΅ exploration. Often this is good enough in practice!
For the Curious Mind
The Explore-Exploit Dilemma is Everywhere!
Should you go to your favorite restaurant (exploit) or try a new one (explore)? Should a company invest in existing products or R&D? This dilemma appears in dating, job searching, A/B testing, and even how bacteria search for food!
Better Exploration: UCB and Thompson Sampling
Ξ΅-greedy is simple but "dumb" β it explores randomly even for actions it's sure about. Advanced methods like UCB (Upper Confidence Bound) and Thompson Sampling explore smartly β more for uncertain actions, less for well-known ones.
Off-policy Prediction via Importance Sampling
Learning about one policy while following another.
The elegant solution: separate learning from exploring!
What if we could learn about the optimal policy (which might be greedy) while following an exploratory policy? That's exactly what off-policy learning does! We use one policy to collect data (behavior policy) and learn about a different policy (target policy).
On-Policy
Learn from own actions
Same policy is used for both:
- β’ Generating experience (behavior)
- β’ Being evaluated/improved (target)
Example: SARSA, Ξ΅-greedy MC Control
Off-Policy
Learn from others' actions
Different policies:
- β’ Behavior (b): generates data
- β’ Target (Ο): being evaluated
Example: Q-learning, Off-policy MC Control
Coverage Requirement
For off-policy learning to work, the behavior policy must have coverage: it must take all actions that the target policy might take. Otherwise, we can't learn about those actions.
The Key Idea
We want to estimate vΟ or qΟ (target policy) but we only have data from b (behavior policy). We use importance sampling to reweight the returns.
Behavior Policy (b)
The policy we actually follow to collect data. Usually exploratory (like Ξ΅-greedy).
Target Policy (Ο)
The policy we want to learn about. Can be greedy (optimal)!
Real-World Analogy: Learning from Others
Imagine you're learning to trade stocks, but you're too cautious to make risky trades yourself. Instead, you watch a friend who takes more risks (behavior policy) and learn what would have happened if you had made those trades following your strategy (target policy). You're learning about your policy using someone else's experiences!
Importance Sampling
Correcting for the difference between behavior and target policies.
Correcting for "wrong" experiences!
The problem: data from policy b doesn't directly tell us about policy Ο. Some trajectories that b often follows might be rare under Ο (and vice versa). Importance sampling fixes this by reweighting each experience based on how likely it would be under the target policy vs. behavior policy.
Importance Sampling Ratio
π What Each Symbol Means:
π Expanding the Product (for 3 steps):
Ο = Ο(At|St)/b(At|St) Γ Ο(At+1|St+1)/b(At+1|St+1) Γ Ο(At+2|St+2)/b(At+2|St+2)
π’ Concrete Example:
Step 1: Ο would choose this action with prob 0.8, b chose with prob 0.4 β ratio = 2.0
Step 2: Ο would choose with prob 0.6, b chose with prob 0.3 β ratio = 2.0
Step 3: Ο would choose with prob 0.1, b chose with prob 0.5 β ratio = 0.2
Total Ο = 2.0 Γ 2.0 Γ 0.2 = 0.8
This trajectory is slightly less likely under Ο than under b (Ο < 1), so we down-weight its return.
The ratio tells us: "How much more or less likely would the target policy Ο have been to generate this exact sequence of actions compared to behavior policy b?"
Breaking It Down
For each action along a trajectory:
Ο(A|S) / b(A|S) = How much more/less likely would target Ο take this action compared to behavior b?
β (product) = Multiply all these ratios along the trajectory
π― Intuition: If Ο is more likely to take the same actions as b, the ratio is > 1 (upweight). If less likely, ratio is < 1 (downweight).
Interactive Importance Sampling Demo
Policy we use to collect data
Policy we want to evaluate
Importance Sampling Ratio
The ratio corrects for the difference between the two policies. When Ο(a) > b(a), we upweight that sample. When Ο(a) < b(a), we downweight it.
Sample Data
Ordinary IS
Simple average of weighted returns. Unbiased but high variance.
Weighted IS
Normalize by sum of ratios. Biased but lower variance.
Try setting the policies very different (e.g., b=0.2, Ο=0.8) to see high variance in ordinary IS. Weighted IS is more stable but slightly biased.
Ordinary IS
T(s) = set of all times we visited state s
|T(s)| = count of visits (number of samples)
Ξ£ ΟG = sum of (ratio Γ return) for each visit
Unbiased but can have high variance
π― "Average the weighted returns" β divide by sample count
Weighted IS
Ξ£ ΟG = sum of (ratio Γ return) β same as ordinary
Ξ£ Ο = sum of the ratios themselves
Biased but lower variance
π― "Weighted average" β divide by sum of weights, not count
π Why These Formulas Work
The key insight is from probability theory. If we have samples from distribution b but want the expected value under Ο:
Multiplying by Ο "converts" expectations from b to Ο!
Bias vs. Variance Trade-off
Ordinary IS
- β Unbiased (correct on average)
- β High variance (estimates can swing wildly)
- β Better when you have lots of data
Weighted IS
- β Biased (systematically off)
- β Lower variance (more stable estimates)
- β Better when data is limited
Off-policy Monte Carlo Control
Learning the optimal policy while following an exploratory policy.
The best of both worlds!
Off-policy MC control lets us explore freely (using Ξ΅-greedy or any soft policy) while learning the truly optimal policy (which can be greedy/deterministic). We get guaranteed exploration AND can find the real optimal solution!
Off-policy MC Control Algorithm
Behavior Policy b
Any soft policy (e.g., Ξ΅-greedy)
π‘ This is what we actually do β exploring included!
Target Policy Ο
Greedy with respect to Q
π‘ This is what we learn about β the optimal policy!
Update Q with Importance Sampling
Weight returns by Ο to estimate qΟ
π‘ Correct our experiences to learn about the target policy
Off-policy MC control can learn the optimal policy directly, even while following an exploratory behavior policy. The target policy can be deterministic (greedy) because exploration is handled by the behavior policy.
Challenges
High Variance
Importance ratios can be very large, especially for long episodes.
π‘ If Ο and b differ a lot, Ο can explode!
Slow Learning
Many episodes may be needed for stable estimates.
π‘ Waiting for episodes to finish is inefficient
Spoiler: TD Methods Will Fix This!
The main drawback of MC is waiting for complete episodes. In the next chapter, you'll learn Temporal-Difference (TD) learning, which combines MC's model-free nature with DP's bootstrapping. Updates happen after every step, not just at episode end!
Summary
Key takeaways from Chapter 5.
Quick Reference Glossary
Monte Carlo
Learning from complete episodes by averaging actual returns
First-Visit MC
Only use the first return for each state per episode
Every-Visit MC
Use every return from all visits to each state
Ξ΅-Greedy Policy
Pick best action with prob 1-Ξ΅, random with prob Ξ΅
On-Policy
Learn about the policy you're following
Off-Policy
Learn about one policy while following another
Behavior Policy (b)
The policy we follow to collect data
Target Policy (Ο)
The policy we want to learn about
Importance Sampling
Reweighting returns to correct for policy mismatch
Exploring Starts
Every state-action pair has nonzero starting probability
Formula Cheat Sheet
Ξ΅-Greedy Policy:
Importance Sampling Ratio:
Ordinary IS Value Estimate:
Weighted IS Value Estimate:
Common Mistakes (Avoid These!)
Wrong: "MC can be used for continuing tasks"
Right: MC requires episodes to end. For continuing tasks, use TD methods instead.
Wrong: "On-policy MC finds the optimal policy"
Right: On-policy Ξ΅-greedy MC finds the best Ξ΅-soft policy. For the true optimal policy, use off-policy methods.
What We Learned in This Chapter
Monte Carlo methods are our first model-free approach! We can learn optimal behavior just by experiencing episodes β no need to know the environment's dynamics.
- π² MC Prediction β Average returns to estimate values
- π― MC Control β Use GPI with sampled returns
- π Ξ΅-Greedy β Balance exploration and exploitation
- π Importance Sampling β Learn about one policy while following another
Monte Carlo methods learn from complete episodes without a model
MC prediction averages sample returns to estimate value functions
First-visit and every-visit MC both converge to the true values
MC control requires exploration: exploring starts or soft policies
On-policy methods learn about the policy being followed
Off-policy methods learn about one policy using data from another
Importance sampling corrects for the difference between policies
Weighted IS has lower variance but is biased; ordinary IS is unbiased
The Limitation: Waiting for Episodes to End
MC methods must wait until the end of an episode to update values. This is problematic when:
Looking Ahead
MC methods require complete episodes, which can be slow. In the next chapter, we'll explore Temporal-Difference learning, which combines the model-free nature of MC with the bootstrapping of DP, allowing updates after every step.