n-step Bootstrapping
Bridge the gap between TD(0) and Monte Carlo. Choose how many steps to look ahead, trading off bias and variance.
What if we could get the best of both worlds?
TD(0) updates after 1 step — fast but potentially biased. Monte Carlo waits for the entire episode — unbiased but slow and high variance. What if we could pick any number in between? That's exactly what n-step methods do! They let you choose how far to look ahead: 2 steps, 5 steps, 10 steps... it's like having a dial that lets you tune the bias-variance tradeoff.
n-step Returns
Flexible lookahead
Bias-Variance
Tunable tradeoff
Unified View
TD ↔ MC spectrum
The Big Picture: A Spectrum of Algorithms
TD(0)
n = 1
Monte Carlo
n = ∞
High bias
Low variance
n-step (sweet spot)
Balanced
No bias
High variance
n-step TD Prediction
Using multiple steps of reward before bootstrapping.
Think of it like checking your GPS!
TD(0): You check your GPS every block to see if you're on track. Quick updates, but you might overcorrect.
Monte Carlo: You wait until you arrive at your destination to see how the trip went. Very accurate, but you don't learn until you're done.
n-step: You check every n blocks — maybe every 3 or 5 blocks. Not too frequent, not too late. Just right!
📖 What Each Symbol Means:
📐 For n=3, This Expands To:
Gt:t+3 = Rt+1 + γRt+2 + γ²Rt+3 + γ³V(St+3)
3 actual rewards + discounted bootstrap from 3 steps ahead
Use the first n rewards, then bootstrap from V(St+n). When n=1, this is TD(0). When n=∞, this is Monte Carlo.
Breaking Down the n-step Return
Rt+1 + γRt+2 + ... + γn-1Rt+n = Sum of the actual rewards for n steps
γnV(St+n) = Then bootstrap (estimate) from where you end up
🎯 In plain English: "Add up the actual rewards for n steps, then use your estimate of how much more you'll get from that point on."
The n-step Spectrum
Bias
High bias from bootstrapping
Variance
Low variance, stable updates
n-step Return Target
As n → ∞, the bootstrap term vanishes (reaches terminal), becoming pure MC.
Concrete Examples
n=1 (TD(0))
G = R1 + γV(S1)
Just one reward, then bootstrap immediately
n=3
G = R1 + γR2 + γ²R3 + γ³V(S3)
Three actual rewards, then bootstrap
n=∞ (Monte Carlo)
G = R1 + γR2 + γ²R3 + ... (until episode ends)
All rewards, no bootstrapping
Interactive n-step TD Demo
n-step TD Values (n=3)
n-step Return
Try different n values. n=1 is TD(0), larger n approaches Monte Carlo. Higher n has more variance but less bias.
For the Curious Mind
The "Sweet Spot" Depends on Your Problem!
In Atari games, n=3 to n=5 often works best. In robotics with delayed rewards, larger n helps. There's no universal answer — that's why hyperparameter tuning matters!
What if we use ALL values of n at once?
Great question! That's exactly what TD(λ) does in Chapter 12. It computes a weighted average of 1-step, 2-step, 3-step... returns. The parameter λ controls the weighting. It's one of the most elegant ideas in RL!
n-step Ideas in Modern Deep RL
Google DeepMind's A3C algorithm uses n-step returns (typically n=5 or n=20). OpenAI's GAE (Generalized Advantage Estimation) in PPO is essentially a sophisticated n-step method. These fundamentals power state-of-the-art AI!
n-step Sarsa
Extending Sarsa to use n-step returns.
Same idea, but for action values!
Remember SARSA from Chapter 6? It updated Q(s, a) after each step. n-step Sarsa is the same idea, but now we wait n steps before making an update. This gives us the same bias-variance tradeoff, but for learning which actions are good.
n-step Sarsa Update
📖 What Each Symbol Means:
where the n-step return for action values is:
📐 Comparing to Regular (1-step) Sarsa:
1-step: G = Rt+1 + γQ(St+1, At+1)
n-step: G = Rt+1 + γRt+2 + ... + γnQ(St+n, At+n)
More actual rewards, bootstrap further into the future!
Standard Sarsa
Low variance, high bias
Updates quickly, but might be wrong
Balanced
Often works well
Sweet spot for many problems!
Monte Carlo
No bias, high variance
Accurate but slow to converge
The optimal n depends on the problem. Intermediate values (n=4 to n=8) often work best, balancing the benefits of multiple reward samples against increasing variance.
Practical Advice
- •Start with n=4 or n=8 — they work well for many problems
- •If learning is unstable, try smaller n (more bias, but more stable)
- •If learning is slow, try larger n (less bias, faster learning)
- •You can even try different n values and average them (coming up in Chapter 12!)
n-step Off-policy Learning
Importance sampling for n-step methods.
Combining n-step with off-policy learning
Remember how off-policy learning lets us learn about one policy while following another? We can do that with n-step methods too! But there's a catch: the importance sampling ratios get multiplied over n steps, which can make them really big or really small.
Off-policy n-step TD
📖 What Each Symbol Means:
The importance sampling ratio for n steps is:
📐 For n=3, This Expands To:
Product of 3 ratios — one for each action in the n-step trajectory
⚠️ Why This Can Be Problematic:
Each ratio might be 2 or 0.5 or any number. When you multiply n of them together, the result can become very large (ρ = 32) or very small (ρ = 0.03). This causes high variance!
The Variance Problem
When we multiply n importance sampling ratios together, things can get out of hand:
Example with n=5, where each ratio is 2:
ρ = 2 × 2 × 2 × 2 × 2 = 32
The update gets scaled by 32! This causes huge variance.
The Problem
Importance sampling ratios can grow exponentially with n, leading to very high variance. The product of n ratios can be extremely large or small.
Solutions
- • Per-decision importance sampling
- • Control variates
- • Tree backup (no IS needed!)
Per-decision Methods with Control Variates
Reducing variance in off-policy learning.
A smarter way to apply importance sampling!
Instead of multiplying all the ratios together at once (which causes huge variance), we can be smarter: apply each ratio only to the rewards that come after that action. This way, later ratios don't affect earlier rewards, and variance stays under control.
Instead of applying a single ratio ρ to the entire return, apply the appropriate ratio at each step. This reduces variance because later ratios only affect later rewards.
Control Variates
We can further reduce variance by using the action-value function as a control variate. The idea is to subtract a baseline that has expectation zero but reduces variance.
Standard IS
ρ × (full return)
High variance when ρ is large
With Control Variate
ρ × (return - Q) + Q
Lower variance (ρ only scales the deviation)
Why Control Variates Help
Think of it like this: instead of scaling the entire return by ρ, we only scale the difference between the return and our estimate Q. When Q is accurate, this difference is small, so even a large ρ doesn't cause huge updates.
Tree Backup Algorithm
Off-policy learning without importance sampling.
What if we could avoid importance sampling entirely?
The Tree Backup algorithm is clever: instead of sampling actions and then correcting with importance sampling, it takes expectations over all possible actions at each step. Since we're averaging rather than sampling, we don't need to correct for anything — no importance sampling required!
Tree backup avoids importance sampling entirely by taking expectations over all possible actions at each step, weighted by the target policy. Only the state transitions are sampled.
Tree Backup Visualization
3-step Tree Backup Diagram
Tree Backup Algorithm
Tree backup doesn't require importance sampling for off-policy learning. Instead, it backs up the expected value of all actions at each state, weighted by the target policy probabilities.
Key Insight
- ●Only sample state transitions (environment)
- ●Average over all actions (policy expectation)
- ●No importance sampling needed!
3-step Tree Backup Target
The tree backup extends to the full tree of possible futures, using the target policy to weight each branch.
Why No Importance Sampling?
Tree backup uses the target policy π to weight all action values at each state. Since we're taking an expectation over actions (not sampling), we don't need to correct for the behavior policy. The variance is much lower than n-step Sarsa with importance sampling.
Comparison: Sampling vs. Expectation
n-step Sarsa (Sampling)
- • Uses the action we actually took
- • Needs importance sampling for off-policy
- • Can have high variance
Tree Backup (Expectation)
- • Averages over all possible actions
- • No importance sampling needed!
- • Lower variance
A Unifying Algorithm: n-step Q(σ)
Interpolating between Sarsa and Tree Backup.
The ultimate unification!
What if you could smoothly blend between sampling (Sarsa-style) and taking expectations (Tree Backup-style)? Q(σ) does exactly that! The parameter σ at each step controls how much to sample vs. average. It's like having yet another dial to tune your algorithm.
The σ Parameter
At each step, σ ∈ [0, 1] controls the degree of sampling vs. expectation:
Full sampling (Sarsa)
Use the action we actually took
Mixture
Blend of both approaches
Full expectation (Tree Backup)
Average over all actions
Unification
Q(σ) unifies several algorithms:
| Algorithm | σ | Property |
|---|---|---|
| n-step Sarsa | 1 | Sample all actions |
| n-step Tree Backup | 0 | Expect all actions |
| n-step Expected Sarsa | 1...1,0 | Sample, then expect at end |
| Q(σ) | any | Flexible per-step choice |
The Elegance of Q(σ)
Q(σ) shows that all these algorithms are really the same algorithm with different settings of σ. You can even vary σ step by step, mixing sampling and expectations within a single episode!
Summary
Key takeaways from Chapter 7.
Quick Reference Glossary
n-step Return
Sum of n rewards + bootstrap from step n
n-step TD
TD learning that looks n steps ahead before bootstrapping
n-step Sarsa
Sarsa with n-step returns for action values
Tree Backup
Off-policy n-step using expectations (no importance sampling)
Importance Sampling
Correcting returns from behavior to target policy
Q(σ)
Unified algorithm: σ=1 is Sarsa, σ=0 is Tree Backup
Bias-Variance Tradeoff
Small n = high bias, low variance; Large n = low bias, high variance
Bootstrapping
Using current estimates to update estimates
Formula Cheat Sheet
n-step Return:
n-step TD Update:
n-step Sarsa:
Importance Sampling Ratio:
🎯 Key pattern: All n-step methods use n actual rewards + γn × bootstrap
Common Mistakes (Avoid These!)
Wrong: "Larger n is always better"
Right: Larger n reduces bias but increases variance. The optimal n depends on your problem — often n=4 to n=8 works well.
Wrong: "n-step methods must wait n steps to update"
Right: At episode end, you can use shorter returns for the last n-1 states. You don't "lose" those updates!
What We Learned in This Chapter
n-step methods give us a tunable dial between TD(0) and Monte Carlo. We can choose how many steps to look ahead, balancing bias and variance for our specific problem.
- 📏 n-step returns — Look ahead n steps, then bootstrap
- ⚖️ Bias-variance tradeoff — Small n = high bias, large n = high variance
- 🌳 Tree Backup — Off-policy without importance sampling
- 🎛️ Q(σ) — Unifies everything with one parameter
n-step methods bridge TD(0) and Monte Carlo, offering a spectrum of algorithms
Larger n means less bias but more variance; smaller n means more bias but less variance
n-step Sarsa extends Sarsa by using n-step returns instead of 1-step
Off-policy n-step methods use importance sampling, which can have high variance
Per-decision importance sampling and control variates reduce variance
Tree backup avoids importance sampling by using expectations over actions
Q(σ) unifies sampling and expectation-based methods with a continuous parameter
The Key Insight
We now have two dials to tune our RL algorithms:
n (number of steps)
How far to look ahead before bootstrapping
σ (sampling degree)
How much to sample vs. take expectations
Looking Ahead
In the next chapter, we'll explore eligibility traces, which provide an even more elegant way to unify TD and MC. Traces allow efficient implementation of n-step methods and introduce λ-returns for weighted averages of different n-step returns.