Chapter 7

n-step Bootstrapping

Bridge the gap between TD(0) and Monte Carlo. Choose how many steps to look ahead, trading off bias and variance.

What if we could get the best of both worlds?

TD(0) updates after 1 step — fast but potentially biased. Monte Carlo waits for the entire episode — unbiased but slow and high variance. What if we could pick any number in between? That's exactly what n-step methods do! They let you choose how far to look ahead: 2 steps, 5 steps, 10 steps... it's like having a dial that lets you tune the bias-variance tradeoff.

n-step Returns

Flexible lookahead

Bias-Variance

Tunable tradeoff

Unified View

TD ↔ MC spectrum

The Big Picture: A Spectrum of Algorithms

TD(0)

n = 1

Monte Carlo

n = ∞

High bias

Low variance

n-step (sweet spot)

Balanced

No bias

High variance

Section 7.1

n-step TD Prediction

Using multiple steps of reward before bootstrapping.

Think of it like checking your GPS!

TD(0): You check your GPS every block to see if you're on track. Quick updates, but you might overcorrect.

Monte Carlo: You wait until you arrive at your destination to see how the trip went. Very accurate, but you don't learn until you're done.

n-step: You check every n blocks — maybe every 3 or 5 blocks. Not too frequent, not too late. Just right!

n-step Return

G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})

📖 What Each Symbol Means:

G_t:t+n= n-step return starting at time t (subscript shows the time range)

R_t+1...R_t+n= actual rewards received over the next n steps

γ, γ², ... γ^n-1= discount factors for each reward (rewards further away are discounted more)

γⁿV(S_t+n)= bootstrapped estimate of everything after n steps

📐 For n=3, This Expands To:

G_t:t+3 = R_t+1 + γR_t+2 + γ²R_t+3 + γ³V(S_t+3)

3 actual rewards + discounted bootstrap from 3 steps ahead

Use the first n rewards, then bootstrap from V(S_t+n). When n=1, this is TD(0). When n=∞, this is Monte Carlo.

Breaking Down the n-step Return

R_t+1 + γR_t+2 + ... + γ^n-1R_t+n = Sum of the actual rewards for n steps

γⁿV(S_t+n) = Then bootstrap (estimate) from where you end up

🎯 In plain English: "Add up the actual rewards for n steps, then use your estimate of how much more you'll get from that point on."

The n-step Spectrum

TD(0)

n=2

n=3

n=5

n=10

Select n: 3

Bias

Low (MC)High (TD)

High bias from bootstrapping

Variance

Low (TD)High (MC)

Low variance, stable updates

n-step Return Target

R_t+1 + γR_t+2 + ... + γ^n-1R_t+n

Sampled rewards

γⁿV(S_t+n)

Bootstrap

As n → ∞, the bootstrap term vanishes (reaches terminal), becoming pure MC.

Concrete Examples

n=1 (TD(0))

G = R₁ + γV(S₁)

Just one reward, then bootstrap immediately

n=3

G = R₁ + γR₂ + γ²R₃ + γ³V(S₃)

Three actual rewards, then bootstrap

n=∞ (Monte Carlo)

G = R₁ + γR₂ + γ²R₃ + ... (until episode ends)

All rewards, no bootstrapping

Interactive n-step TD Demo

n-step TD Values (n=3)

0.0

GOAL

0.0

n-step: 3

1 (TD)10 (→MC)

Episodes

Steps ahead

n-step Return

G_t:t+n = R_t+1 + γR_t+2 + ... + γ^n-1R_t+n + γⁿV(S_t+n)

Try different n values. n=1 is TD(0), larger n approaches Monte Carlo. Higher n has more variance but less bias.

For the Curious Mind

The "Sweet Spot" Depends on Your Problem!

In Atari games, n=3 to n=5 often works best. In robotics with delayed rewards, larger n helps. There's no universal answer — that's why hyperparameter tuning matters!

What if we use ALL values of n at once?

Great question! That's exactly what TD(λ) does in Chapter 12. It computes a weighted average of 1-step, 2-step, 3-step... returns. The parameter λ controls the weighting. It's one of the most elegant ideas in RL!

n-step Ideas in Modern Deep RL

Google DeepMind's A3C algorithm uses n-step returns (typically n=5 or n=20). OpenAI's GAE (Generalized Advantage Estimation) in PPO is essentially a sophisticated n-step method. These fundamentals power state-of-the-art AI!

Section 7.2

n-step Sarsa

Extending Sarsa to use n-step returns.

Same idea, but for action values!

Remember SARSA from Chapter 6? It updated Q(s, a) after each step. n-step Sarsa is the same idea, but now we wait n steps before making an update. This gives us the same bias-variance tradeoff, but for learning which actions are good.

n-step Sarsa Update

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [G_{t:t+n} - Q(S_t, A_t)]

📖 What Each Symbol Means:

Q(S_t, A_t)= current estimate of action-value (what we're updating)

α= learning rate (step size)

G_t:t+n= n-step return (target we're moving toward)

[G - Q]= n-step TD error (how far off we were)

where the n-step return for action values is:

G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n})

📐 Comparing to Regular (1-step) Sarsa:

1-step: G = R_t+1 + γQ(S_t+1, A_t+1)

n-step: G = R_t+1 + γR_t+2 + ... + γⁿQ(S_t+n, A_t+n)

More actual rewards, bootstrap further into the future!

n=1

Standard Sarsa

Low variance, high bias

Updates quickly, but might be wrong

n=5

Balanced

Often works well

Sweet spot for many problems!

n=∞

Monte Carlo

No bias, high variance

Accurate but slow to converge

Choosing n

The optimal n depends on the problem. Intermediate values (n=4 to n=8) often work best, balancing the benefits of multiple reward samples against increasing variance.

Practical Advice

•Start with n=4 or n=8 — they work well for many problems
•If learning is unstable, try smaller n (more bias, but more stable)
•If learning is slow, try larger n (less bias, faster learning)
•You can even try different n values and average them (coming up in Chapter 12!)

Section 7.3

n-step Off-policy Learning

Importance sampling for n-step methods.

Combining n-step with off-policy learning

Remember how off-policy learning lets us learn about one policy while following another? We can do that with n-step methods too! But there's a catch: the importance sampling ratios get multiplied over n steps, which can make them really big or really small.

Off-policy n-step TD

V(S_t) \leftarrow V(S_t) + \alpha \rho_{t:t+n-1} [G_{t:t+n} - V(S_t)]

📖 What Each Symbol Means:

V(S_t)= state value being updated

α= learning rate

ρ_t:t+n-1= importance sampling ratio for n steps (corrects for policy difference)

G_t:t+n= n-step return

The importance sampling ratio for n steps is:

\rho_{t:t+n-1} = \prod_{k=t}^{t+n-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}

📐 For n=3, This Expands To:

Product of 3 ratios — one for each action in the n-step trajectory

⚠️ Why This Can Be Problematic:

Each ratio might be 2 or 0.5 or any number. When you multiply n of them together, the result can become very large (ρ = 32) or very small (ρ = 0.03). This causes high variance!

The Variance Problem

When we multiply n importance sampling ratios together, things can get out of hand:

Example with n=5, where each ratio is 2:
ρ = 2 × 2 × 2 × 2 × 2 = 32

The update gets scaled by 32! This causes huge variance.

The Problem

Importance sampling ratios can grow exponentially with n, leading to very high variance. The product of n ratios can be extremely large or small.

Solutions

• Per-decision importance sampling
• Control variates
• Tree backup (no IS needed!)

Section 7.4

Per-decision Methods with Control Variates

Reducing variance in off-policy learning.

A smarter way to apply importance sampling!

Instead of multiplying all the ratios together at once (which causes huge variance), we can be smarter: apply each ratio only to the rewards that come after that action. This way, later ratios don't affect earlier rewards, and variance stays under control.

Per-decision Importance Sampling

Instead of applying a single ratio ρ to the entire return, apply the appropriate ratio at each step. This reduces variance because later ratios only affect later rewards.

Control Variates

We can further reduce variance by using the action-value function as a control variate. The idea is to subtract a baseline that has expectation zero but reduces variance.

Standard IS

ρ × (full return)

High variance when ρ is large

With Control Variate

ρ × (return - Q) + Q

Lower variance (ρ only scales the deviation)

Why Control Variates Help

Think of it like this: instead of scaling the entire return by ρ, we only scale the difference between the return and our estimate Q. When Q is accurate, this difference is small, so even a large ρ doesn't cause huge updates.

Section 7.5

Tree Backup Algorithm

Off-policy learning without importance sampling.

What if we could avoid importance sampling entirely?

The Tree Backup algorithm is clever: instead of sampling actions and then correcting with importance sampling, it takes expectations over all possible actions at each step. Since we're averaging rather than sampling, we don't need to correct for anything — no importance sampling required!

Tree Backup

Tree backup avoids importance sampling entirely by taking expectations over all possible actions at each step, weighted by the target policy. Only the state transitions are sampled.

Tree Backup Visualization

3-step Tree Backup Diagram

Tree Backup Algorithm

Tree backup doesn't require importance sampling for off-policy learning. Instead, it backs up the expected value of all actions at each state, weighted by the target policy probabilities.

Key Insight

●Only sample state transitions (environment)
●Average over all actions (policy expectation)
●No importance sampling needed!

3-step Tree Backup Target

G_t:t+3 = R_t+1 + γΣ_{a≠A_t+1}π(a|S_t+1)Q(S_t+1,a) + γπ(A_t+1|S_t+1)[R_t+2 + ...]

The tree backup extends to the full tree of possible futures, using the target policy to weight each branch.

Why No Importance Sampling?

Tree backup uses the target policy π to weight all action values at each state. Since we're taking an expectation over actions (not sampling), we don't need to correct for the behavior policy. The variance is much lower than n-step Sarsa with importance sampling.

Comparison: Sampling vs. Expectation

n-step Sarsa (Sampling)

• Uses the action we actually took
• Needs importance sampling for off-policy
• Can have high variance

Tree Backup (Expectation)

• Averages over all possible actions
• No importance sampling needed!
• Lower variance

Section 7.6

A Unifying Algorithm: n-step Q(σ)

Interpolating between Sarsa and Tree Backup.

The ultimate unification!

What if you could smoothly blend between sampling (Sarsa-style) and taking expectations (Tree Backup-style)? Q(σ) does exactly that! The parameter σ at each step controls how much to sample vs. average. It's like having yet another dial to tune your algorithm.

The σ Parameter

At each step, σ ∈ [0, 1] controls the degree of sampling vs. expectation:

σ = 1

Full sampling (Sarsa)

Use the action we actually took

σ = 0.5

Mixture

Blend of both approaches

σ = 0

Full expectation (Tree Backup)

Average over all actions

Unification

Q(σ) unifies several algorithms:

Algorithm	σ	Property
n-step Sarsa	1	Sample all actions
n-step Tree Backup	0	Expect all actions
n-step Expected Sarsa	1...1,0	Sample, then expect at end
Q(σ)	any	Flexible per-step choice

The Elegance of Q(σ)

Q(σ) shows that all these algorithms are really the same algorithm with different settings of σ. You can even vary σ step by step, mixing sampling and expectations within a single episode!

Section 7.7

Summary

Key takeaways from Chapter 7.

Quick Reference Glossary

n-step Return

Sum of n rewards + bootstrap from step n

n-step TD

TD learning that looks n steps ahead before bootstrapping

n-step Sarsa

Sarsa with n-step returns for action values

Tree Backup

Off-policy n-step using expectations (no importance sampling)

Importance Sampling

Correcting returns from behavior to target policy

Q(σ)

Unified algorithm: σ=1 is Sarsa, σ=0 is Tree Backup

Bias-Variance Tradeoff

Small n = high bias, low variance; Large n = low bias, high variance

Bootstrapping

Using current estimates to update estimates

Formula Cheat Sheet

n-step Return:

G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})

n-step TD Update:

V(S_t) \leftarrow V(S_t) + \alpha [G_{t:t+n} - V(S_t)]

n-step Sarsa:

Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha [G_{t:t+n} - Q(S_t,A_t)]

Importance Sampling Ratio:

\rho_{t:t+n-1} = \prod_{k=t}^{t+n-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}

🎯 Key pattern: All n-step methods use n actual rewards + γⁿ × bootstrap

Common Mistakes (Avoid These!)

Wrong: "Larger n is always better"

Right: Larger n reduces bias but increases variance. The optimal n depends on your problem — often n=4 to n=8 works well.

Wrong: "n-step methods must wait n steps to update"

Right: At episode end, you can use shorter returns for the last n-1 states. You don't "lose" those updates!

What We Learned in This Chapter

n-step methods give us a tunable dial between TD(0) and Monte Carlo. We can choose how many steps to look ahead, balancing bias and variance for our specific problem.

📏 n-step returns — Look ahead n steps, then bootstrap
⚖️ Bias-variance tradeoff — Small n = high bias, large n = high variance
🌳 Tree Backup — Off-policy without importance sampling
🎛️ Q(σ) — Unifies everything with one parameter

n-step methods bridge TD(0) and Monte Carlo, offering a spectrum of algorithms

Larger n means less bias but more variance; smaller n means more bias but less variance

n-step Sarsa extends Sarsa by using n-step returns instead of 1-step

Off-policy n-step methods use importance sampling, which can have high variance

Per-decision importance sampling and control variates reduce variance

Tree backup avoids importance sampling by using expectations over actions

Q(σ) unifies sampling and expectation-based methods with a continuous parameter

The Key Insight

We now have two dials to tune our RL algorithms:

n (number of steps)

How far to look ahead before bootstrapping

σ (sampling degree)

How much to sample vs. take expectations

Looking Ahead

In the next chapter, we'll explore eligibility traces, which provide an even more elegant way to unify TD and MC. Traces allow efficient implementation of n-step methods and introduce λ-returns for weighted averages of different n-step returns.

Chapter 6 Continue to Chapter 8