Chapter 6

Temporal-Difference Learning

The best of both worlds: learn from experience like MC, but update immediately using bootstrapping like DP. TD is the foundation of modern RL.

The breakthrough that made modern RL possible!

Remember the main problem with Monte Carlo? You had to wait until the end of an episode to learn. And DP needed a complete model. TD learning solves both problems: it learns from experience (like MC) but updates immediately after each step (like DP)! This is the idea behind algorithms like Q-learning and DQN — the algorithms that learned to play Atari games.

Online

Update every step

Bootstrapping

Uses current estimates

Model-Free

No dynamics needed

The Big Picture: Combining the Best of Both Worlds

Dynamic Programming

  • ✅ Updates immediately
  • ✅ Uses bootstrapping
  • ❌ Needs a model

Monte Carlo

  • ✅ Model-free
  • ✅ Uses actual returns
  • ❌ Waits for episode end

TD Learning ⭐

  • ✅ Model-free
  • ✅ Updates immediately
  • ✅ Works for continuing tasks
Section 6.1

TD Prediction

Learning value functions from individual transitions.

Think of it like updating your GPS estimate!

Imagine you're driving to a destination. Your GPS says "1 hour remaining." After 10 minutes, you've traveled further than expected and the GPS updates to "45 minutes." It didn't wait until you arrived — it updated its estimate immediately based on how the trip is going so far. That's exactly what TD learning does!

TD(0) Update Rule
V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

📖 What Each Symbol Means:

V(St)= our current estimate of state St's value (what we're updating)
α (alpha)= learning rate (0 to 1) — how big a step we take (e.g., 0.1)
Rt+1= actual reward received after leaving St
γ (gamma)= discount factor — how much future rewards are worth now
V(St+1)= estimated value of the next state (bootstrapping!)
= "update to" (assignment, not equals)

📐 Understanding the Update Structure:

TD Target = Rt+1 + γV(St+1) — what we're moving toward

TD Error = Target - V(St) — how wrong we were

Update = Old + α × Error — move a small step toward target

Update V(S) toward the TD target: R + γV(S'). The TD error δ = R + γV(S') - V(S) measures how surprising the transition was.

Breaking Down the TD Update

V(St) = Our current estimate of state S's value

α (alpha) = Learning rate (how much to adjust, typically 0.1 to 0.5)

Rt+1 + γV(St+1) = TD Target (what we're updating toward)

[Target - Current] = TD Error (how much we were "surprised")

🎯 In plain English: "I thought this state was worth X, but based on what just happened, maybe it should be a bit closer to Y."

Interactive TD(0) Demo

TD(0) Value Estimation

TERM
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
A
0.0
0.0
0.0
0.0
0.0
0.0
TERM
0
Total Steps
0.1
Learning Rate (α)
TD(0) Update Rule
V(S) ← V(S) + α[R + γV(S') - V(S)]
δ = R + γV(S') - V(S) is the TD error
Key Insight

TD learns after every step, not waiting for episode end. It bootstraps by using V(S') as an estimate of future returns.

Watch how values update immediately after each transition. Unlike MC, TD doesn't wait for the episode to end.

The TD Error — The Core of RL

δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)

📖 Breaking Down the TD Error:

δt= TD error at time t (delta) — the "surprise" signal
Rt+1 + γV(St+1)= what we actually got (reward + estimated future)
V(St)= what we expected (our prediction)

🎯 In Plain English:

δ = (What I got) - (What I expected)

Positive δ = pleasant surprise, Negative δ = disappointment

If δ > 0, the outcome was better than expected—increase V(S). If δ < 0, the outcome was worse—decrease V(S). The TD error is the core learning signal in RL.

δ > 0 (Positive Surprise)

"This was better than I thought! Maybe this state is more valuable."

δ < 0 (Negative Surprise)

"This was worse than expected! Maybe this state is less valuable."

Fun Fact: Your Brain Does This Too!

Neuroscientists discovered that dopamine neurons in your brain fire in a pattern that looks exactly like the TD error! When something is better than expected, dopamine spikes. When worse, it dips. TD learning might be how your brain actually learns from experience.

For the Curious Mind

TD Learning Won a Nobel Prize!

The 2017 Nobel Prize in Economics went to Richard Thaler for behavioral economics. His work built on studies of how humans deviate from "optimal" decisions — and TD learning explains many of these "biases" as features, not bugs, of human learning!

Why is TD Better Than MC in Practice?

TD can learn from every step, not just episode ends. Imagine learning to drive — with MC, you'd only learn after completing a whole trip. With TD, you learn from each turn, each lane change, each stop. Much faster!

TD Powers All Modern Deep RL!

DQN (Atari), A3C, PPO, SAC — every famous deep RL algorithm uses TD learning at its core. The simple idea of "update based on prediction error" scales to neural networks with millions of parameters!

Section 6.2

Advantages of TD Prediction Methods

Why TD methods are preferred in practice.

Why is TD so popular?

TD methods have become the dominant approach in RL. They combine the best features of DP and MC while avoiding their main limitations. Let's see why!

Dynamic Programming

Full model knowledge, expected updates over all successors

V(s) ← E[r + γV(s')]

Monte Carlo

Model-free, sample complete episodes, average returns

V(s) ← average(Gt)

Temporal-Difference

Model-free, bootstraps from current estimate, online

V(s) ← V(s) + α[r + γV(s') - V(s)]
FeatureDPMCTD
Model-free
Bootstraps
Online learning
Works with incomplete episodes
Unbiased estimates
Low variance
Guaranteed convergence
DP Backup
●━━●━━●

Full-width (all successors)

MC Backup
●━━●━━●━━■

Sample path to terminal

TD Backup
●━━●

One-step sample

TD combines the best of both: model-free like MC, but bootstraps like DP for faster learning.

Advantages Over MC

  • • Learn before episode ends
  • • Works with continuing tasks
  • • Lower variance updates
  • • Often faster convergence

💡 No more waiting! Learn from every single step.

Advantages Over DP

  • • No model required
  • • Learn from experience
  • • Focus on visited states
  • • Scales to large problems

💡 Just interact with the environment — no perfect knowledge needed!

The Bias-Variance Trade-off

Monte Carlo
  • Unbiased (uses actual returns)
  • ❌ High variance (returns vary a lot)
TD Learning
  • Biased (uses estimated V(S'))
  • ✅ Lower variance (one-step estimates are more stable)

In practice, TD's lower variance often leads to faster, more stable learning despite the bias!

Section 6.3

Optimality of TD(0)

TD converges to different solutions than MC under batch training.

TD and MC find different answers!

Here's something interesting: if you show TD and MC the same data, they might converge to different values. Why? Because they're answering slightly different questions!

Batch TD vs Batch MC

When trained repeatedly on a fixed batch of experience:

Monte Carlo

Converges to values that minimize MSE on observed returns. Best fit to the data we've seen.

💡 "What average return did I see from each state?"

TD(0)

Converges to the certainty-equivalence estimate: the correct values for the MDP that best explains the data.

💡 "What values are consistent with the transitions I've seen?"

Why TD is Often Better

TD exploits the Markov property: it builds an implicit model and finds values consistent with that model. MC ignores structure and just averages returns. When the MDP assumption holds, TD is more data-efficient.

An Intuitive Example

Imagine you see these transitions: A → B (reward 0), B → end (reward 1), and A → end (reward 0).

MC says V(A) = ?

Average return from A: (1 + 0) / 2 = 0.5

TD says V(A) = ?

A leads to B with 50% chance (V(B)=1), end with 50% (value 0). So V(A) ≈ 0.5 × 1 = 0.5... but TD would find a consistent solution!

Section 6.4

Sarsa: On-policy TD Control

Learning Q-values while following an ε-greedy policy.

TD for control — learning action values!

Now we're getting to the good stuff! SARSA applies TD learning to learn Q(s, a) — the value of taking action a in state s. This lets us improve our policy!

SARSA Update
Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]

📖 What Each Symbol Means:

Q(St, At)= value of taking action At in state St
α= learning rate (how fast to update)
Rt+1= reward received after taking action
γ= discount factor
Q(St+1, At+1)= value of the next state-action pair (what we actually do next!)

📐 Same Structure as TD(0):

Just like TD(0), but with Q(s,a) instead of V(s):

TD(0): V(S) ← V(S) + α[R + γV(S') - V(S)]

SARSA: Q(S,A) ← Q(S,A) + α[R + γQ(S',A') - Q(S,A)]

State, Action, Reward,State, Action — hence the name SARSA.

Why "SARSA"?

The name comes from the sequence of things we need for the update:

S

State

A

Action

R

Reward

S'

Next State

A'

Next Action

Cliff Walking with SARSA

Watch how SARSA learns to navigate! Notice how it finds a "safe" path away from the cliff.

cliff
cliff
cliff
cliff
cliff
cliff
cliff
cliff
cliff
cliff
G
0
Episodes
0
Reward
ε=0.1
Exploration
SARSA Update
Q(S,A) ← Q(S,A) + α[R + γQ(S',A') - Q(S,A)]

On-policy: Learns Q for the ε-greedy policy being followed. Takes the "safe" path away from the cliff.

On-policy Behavior

SARSA learns the value of the policy it's actually following (including exploration). It learns the "safe" path because it accounts for occasional random actions near the cliff.

🎯 Key insight: SARSA is on-policy — it learns about the policy it follows, including its exploration mistakes!

Section 6.5

Q-learning: Off-policy TD Control

Learning the optimal Q-function directly.

The most famous RL algorithm!

Q-learning is arguably the most important algorithm in RL history. It directly learns the optimal Q-function, regardless of what policy you follow! This is the foundation of Deep Q-Networks (DQN) that learned to play Atari games at superhuman levels.

Q-learning Update
Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]

📖 What Each Symbol Means:

Q(St, At)= value of current state-action pair (what we're updating)
α= learning rate
Rt+1= immediate reward
maxa Q(S', a)= value of the BEST action in next state (the key difference!)

🔑 The Key Difference from SARSA:

SARSA: ... + γQ(S', A') — uses action we actually take

Q-learning: ... + γmaxa Q(S', a) — uses BEST action

Q-learning always assumes we'll act optimally in the future, even if we don't!

Uses max instead of the actual next action. This learns Q* directly, regardless of the behavior policy.

The Magic of "max"

SARSA uses Q(S', A')

Uses the action we actually take next (might be exploratory)

Q-learning uses max Q(S', a)

Uses the best action, regardless of what we actually do

🎯 Result: Q-learning learns the optimal policy even while exploring randomly! That's why it's called off-policy.

Cliff Walking with Q-learning

Compare this to SARSA! Q-learning finds the "optimal" (but risky) path along the cliff edge.

cliff
cliff
cliff
cliff
cliff
cliff
cliff
cliff
cliff
cliff
G
0
Episodes
0
Reward
ε=0.1
Exploration
Q-learning Update
Q(S,A) ← Q(S,A) + α[R + γ maxa Q(S',a) - Q(S,A)]

Off-policy: Learns the optimal Q* directly. Takes the "optimal" path along the cliff edge (risky during exploration!).

SARSA (On-policy)
  • • Learns Q for policy being followed
  • • Takes safer path (accounts for exploration)
  • • Better online performance

Best when you care about performance while learning

Q-learning (Off-policy)
  • • Learns optimal Q*
  • • Takes optimal path (risky during learning!)
  • • Better final policy

Best when you only care about the final policy

Section 6.6

Expected Sarsa

Taking the expected value over next actions instead of sampling.

Why not average instead of sample?

SARSA uses the actual next action A', which adds randomness. Expected Sarsa asks: "What if we used the expected value over all possible next actions?" This gives us lower variance updates!

Expected Sarsa Update

Q(St,At)Q(St,At)+α[Rt+1+γaπ(aSt+1)Q(St+1,a)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t)]

📖 What Each Symbol Means:

Σa= sum over all possible actions a
π(a|St+1)= probability of taking action a in next state under policy π
Q(St+1, a)= value of each action in the next state

📐 What This Calculates:

The term Σ π(a|S') Q(S', a) is the expected value of the next state:

If π = ε-greedy with ε=0.1 and 4 actions:

Expected = 0.925×Q(best) + 0.025×Q(a₂) + 0.025×Q(a₃) + 0.025×Q(a₄)

Instead of using the sampled A', take the expected value over all actions weighted by their probability under the policy.

Comparing All Three

SARSA: Uses Q(S', A') where A' is the action we actually take

Q-learning: Uses maxa Q(S', a) — the best action

Expected SARSA: Uses Σ π(a|S') Q(S', a) — weighted average

Lower Variance

Averaging reduces noise from action selection

Better Performance

Often outperforms both Sarsa and Q-learning

Flexible

Can be on-policy or off-policy

Special Cases

With a greedy target policy (π(a|s) = 1 for argmax), Expected Sarsa becomes Q-learning. Expected Sarsa generalizes both algorithms.

Section 6.7

Maximization Bias and Double Learning

Q-learning can overestimate values due to the max operator.

There's a subtle problem with "max"!

Using max to pick the best action sounds great, but it has a hidden flaw: if our Q-value estimates have noise (they always do!), the max operation tends to pick the action with the luckiest overestimate, not the truly best action.

Maximization Bias

When we use max to both select and evaluate an action, we get upward bias. Random noise in Q estimates gets amplified by always picking the maximum.

The Problem: A Simple Example

Consider a state where all actions have true value 0, but estimates have random error. maxa Q(s,a) will be positive (we pick the lucky overestimate), even though the true max is 0.

Example: True values are all 0, but estimates are [-0.1, +0.2, -0.05, +0.15]. The max is +0.2, but the true max is 0. We're overestimating!

Double Q-learning Solution

Maintain two Q-functions. Use one to select the best action, the other to evaluate it:

Q1(S,A)Q1(S,A)+α[R+γQ2(S,argmaxaQ1(S,a))Q1(S,A)]Q_1(S, A) \leftarrow Q_1(S, A) + \alpha [R + \gamma Q_2(S', \arg\max_a Q_1(S', a)) - Q_1(S, A)]

📖 Breaking Down the Formula:

Q1, Q2= two separate Q-tables with independent estimates
argmaxa Q1= Q₁ selects which action looks best
Q2(S', ...)= Q₂ evaluates that action (independent opinion!)

Q1 selects the action: argmaxa Q1(S', a)

Q2 evaluates it: Q2(S', that action)

🎯 Why Two Q-tables Fix the Bias:

Regular Q-learning: If Q₁ overestimates action A, we pick A AND use that overestimate → double trouble!

Double Q-learning: If Q₁ overestimates action A, we pick A BUT Q₂ gives an independent (likely more accurate) estimate.

Randomly update either Q1 or Q2 on each step. This decouples selection from evaluation, eliminating the bias.

Why This Works

Even if Q1 overestimates an action, Q2 has independent noise and will give a more accurate evaluation. The errors don't compound!

🎯 Fun fact: Double DQN (Double Deep Q-Network) uses this idea and significantly outperforms regular DQN on Atari games!

Section 6.8

Summary

Key takeaways from Chapter 6.

Quick Reference Glossary

Temporal-Difference (TD)

Learning from transitions using bootstrapping, no model needed

TD Error (δ)

The "surprise" signal: R + γV(S') - V(S)

TD Target

What we update toward: R + γV(S')

Bootstrapping

Using current estimates to update estimates

SARSA

On-policy TD control: State-Action-Reward-State-Action

Q-learning

Off-policy TD control using max over next actions

Expected SARSA

Average over next actions weighted by policy

Maximization Bias

Overestimation from using max on noisy estimates

Double Learning

Two Q-tables to decouple selection from evaluation

Learning Rate (α)

How big a step to take toward the target (0 to 1)

Formula Cheat Sheet

TD(0) Update (for state values):

V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

SARSA (on-policy):

Q(S,A)Q(S,A)+α[R+γQ(S,A)Q(S,A)]Q(S,A) \leftarrow Q(S,A) + \alpha [R + \gamma Q(S',A') - Q(S,A)]

Q-learning (off-policy):

Q(S,A)Q(S,A)+α[R+γmaxaQ(S,a)Q(S,A)]Q(S,A) \leftarrow Q(S,A) + \alpha [R + \gamma \max_a Q(S',a) - Q(S,A)]

Expected SARSA:

Q(S,A)Q(S,A)+α[R+γaπ(aS)Q(S,a)Q(S,A)]Q(S,A) \leftarrow Q(S,A) + \alpha [R + \gamma \sum_a \pi(a|S') Q(S',a) - Q(S,A)]

🎯 Pattern: All TD methods follow: Q ← Q + α × (target - Q)

Common Mistakes (Avoid These!)

Wrong: "SARSA and Q-learning always give the same result"

Right: They can differ significantly! SARSA is safer (avoids cliffs), Q-learning is more optimal but riskier during learning.

Wrong: "TD is biased so MC is always better"

Right: TD's lower variance often makes it learn faster! The bias shrinks as estimates improve.

What We Learned in This Chapter

TD learning is the foundation of modern RL! It combines the best of DP (bootstrapping) and MC (model-free learning).

  • TD(0) — Update after every step using the TD error
  • 🎯 SARSA — On-policy control (learns what it follows)
  • 🏆 Q-learning — Off-policy control (learns the optimal policy)
  • 📊 Expected SARSA — Lower variance by averaging
  • 🔄 Double Learning — Fix maximization bias

TD learning combines MC (model-free) and DP (bootstrapping)

TD(0) updates after every step using the TD error: δ = R + γV(S') - V(S)

TD has lower variance than MC but introduces bias through bootstrapping

SARSA is on-policy TD control: learns Q for the policy being followed

Q-learning is off-policy: learns the optimal Q* directly

Expected Sarsa reduces variance by taking expectation over next actions

Maximization bias affects Q-learning; Double Q-learning fixes it

TD methods are the foundation of modern deep RL algorithms

Quick Algorithm Comparison

AlgorithmTargetPolicy Type
SARSAR + γQ(S', A')On-policy
Q-learningR + γ max Q(S', a)Off-policy
Expected SARSAR + γ Σ π(a)Q(S', a)Both

Looking Ahead

We've covered the tabular foundations of RL. The next chapters explore n-step methods that bridge TD and MC, eligibility traces for more efficient credit assignment, and eventually function approximation for scaling to large state spaces.