Temporal-Difference Learning
The best of both worlds: learn from experience like MC, but update immediately using bootstrapping like DP. TD is the foundation of modern RL.
The breakthrough that made modern RL possible!
Remember the main problem with Monte Carlo? You had to wait until the end of an episode to learn. And DP needed a complete model. TD learning solves both problems: it learns from experience (like MC) but updates immediately after each step (like DP)! This is the idea behind algorithms like Q-learning and DQN — the algorithms that learned to play Atari games.
Online
Update every step
Bootstrapping
Uses current estimates
Model-Free
No dynamics needed
The Big Picture: Combining the Best of Both Worlds
Dynamic Programming
- ✅ Updates immediately
- ✅ Uses bootstrapping
- ❌ Needs a model
Monte Carlo
- ✅ Model-free
- ✅ Uses actual returns
- ❌ Waits for episode end
TD Learning ⭐
- ✅ Model-free
- ✅ Updates immediately
- ✅ Works for continuing tasks
TD Prediction
Learning value functions from individual transitions.
Think of it like updating your GPS estimate!
Imagine you're driving to a destination. Your GPS says "1 hour remaining." After 10 minutes, you've traveled further than expected and the GPS updates to "45 minutes." It didn't wait until you arrived — it updated its estimate immediately based on how the trip is going so far. That's exactly what TD learning does!
📖 What Each Symbol Means:
📐 Understanding the Update Structure:
TD Target = Rt+1 + γV(St+1) — what we're moving toward
TD Error = Target - V(St) — how wrong we were
Update = Old + α × Error — move a small step toward target
Update V(S) toward the TD target: R + γV(S'). The TD error δ = R + γV(S') - V(S) measures how surprising the transition was.
Breaking Down the TD Update
V(St) = Our current estimate of state S's value
α (alpha) = Learning rate (how much to adjust, typically 0.1 to 0.5)
Rt+1 + γV(St+1) = TD Target (what we're updating toward)
[Target - Current] = TD Error (how much we were "surprised")
🎯 In plain English: "I thought this state was worth X, but based on what just happened, maybe it should be a bit closer to Y."
Interactive TD(0) Demo
TD(0) Value Estimation
TD(0) Update Rule
Key Insight
TD learns after every step, not waiting for episode end. It bootstraps by using V(S') as an estimate of future returns.
Watch how values update immediately after each transition. Unlike MC, TD doesn't wait for the episode to end.
The TD Error — The Core of RL
📖 Breaking Down the TD Error:
🎯 In Plain English:
δ = (What I got) - (What I expected)
Positive δ = pleasant surprise, Negative δ = disappointment
If δ > 0, the outcome was better than expected—increase V(S). If δ < 0, the outcome was worse—decrease V(S). The TD error is the core learning signal in RL.
δ > 0 (Positive Surprise)
"This was better than I thought! Maybe this state is more valuable."
δ < 0 (Negative Surprise)
"This was worse than expected! Maybe this state is less valuable."
Fun Fact: Your Brain Does This Too!
Neuroscientists discovered that dopamine neurons in your brain fire in a pattern that looks exactly like the TD error! When something is better than expected, dopamine spikes. When worse, it dips. TD learning might be how your brain actually learns from experience.
For the Curious Mind
TD Learning Won a Nobel Prize!
The 2017 Nobel Prize in Economics went to Richard Thaler for behavioral economics. His work built on studies of how humans deviate from "optimal" decisions — and TD learning explains many of these "biases" as features, not bugs, of human learning!
Why is TD Better Than MC in Practice?
TD can learn from every step, not just episode ends. Imagine learning to drive — with MC, you'd only learn after completing a whole trip. With TD, you learn from each turn, each lane change, each stop. Much faster!
TD Powers All Modern Deep RL!
DQN (Atari), A3C, PPO, SAC — every famous deep RL algorithm uses TD learning at its core. The simple idea of "update based on prediction error" scales to neural networks with millions of parameters!
Advantages of TD Prediction Methods
Why TD methods are preferred in practice.
Why is TD so popular?
TD methods have become the dominant approach in RL. They combine the best features of DP and MC while avoiding their main limitations. Let's see why!
Dynamic Programming
Full model knowledge, expected updates over all successors
Monte Carlo
Model-free, sample complete episodes, average returns
Temporal-Difference
Model-free, bootstraps from current estimate, online
| Feature | DP | MC | TD |
|---|---|---|---|
| Model-free | |||
| Bootstraps | |||
| Online learning | |||
| Works with incomplete episodes | |||
| Unbiased estimates | |||
| Low variance | |||
| Guaranteed convergence |
DP Backup
Full-width (all successors)
MC Backup
Sample path to terminal
TD Backup
One-step sample
TD combines the best of both: model-free like MC, but bootstraps like DP for faster learning.
Advantages Over MC
- • Learn before episode ends
- • Works with continuing tasks
- • Lower variance updates
- • Often faster convergence
💡 No more waiting! Learn from every single step.
Advantages Over DP
- • No model required
- • Learn from experience
- • Focus on visited states
- • Scales to large problems
💡 Just interact with the environment — no perfect knowledge needed!
The Bias-Variance Trade-off
Monte Carlo
- ✅ Unbiased (uses actual returns)
- ❌ High variance (returns vary a lot)
TD Learning
- ❌ Biased (uses estimated V(S'))
- ✅ Lower variance (one-step estimates are more stable)
In practice, TD's lower variance often leads to faster, more stable learning despite the bias!
Optimality of TD(0)
TD converges to different solutions than MC under batch training.
TD and MC find different answers!
Here's something interesting: if you show TD and MC the same data, they might converge to different values. Why? Because they're answering slightly different questions!
Batch TD vs Batch MC
When trained repeatedly on a fixed batch of experience:
Monte Carlo
Converges to values that minimize MSE on observed returns. Best fit to the data we've seen.
💡 "What average return did I see from each state?"
TD(0)
Converges to the certainty-equivalence estimate: the correct values for the MDP that best explains the data.
💡 "What values are consistent with the transitions I've seen?"
TD exploits the Markov property: it builds an implicit model and finds values consistent with that model. MC ignores structure and just averages returns. When the MDP assumption holds, TD is more data-efficient.
An Intuitive Example
Imagine you see these transitions: A → B (reward 0), B → end (reward 1), and A → end (reward 0).
MC says V(A) = ?
Average return from A: (1 + 0) / 2 = 0.5
TD says V(A) = ?
A leads to B with 50% chance (V(B)=1), end with 50% (value 0). So V(A) ≈ 0.5 × 1 = 0.5... but TD would find a consistent solution!
Sarsa: On-policy TD Control
Learning Q-values while following an ε-greedy policy.
TD for control — learning action values!
Now we're getting to the good stuff! SARSA applies TD learning to learn Q(s, a) — the value of taking action a in state s. This lets us improve our policy!
📖 What Each Symbol Means:
📐 Same Structure as TD(0):
Just like TD(0), but with Q(s,a) instead of V(s):
TD(0): V(S) ← V(S) + α[R + γV(S') - V(S)]
SARSA: Q(S,A) ← Q(S,A) + α[R + γQ(S',A') - Q(S,A)]
State, Action, Reward,State, Action — hence the name SARSA.
Why "SARSA"?
The name comes from the sequence of things we need for the update:
S
State
A
Action
R
Reward
S'
Next State
A'
Next Action
Cliff Walking with SARSA
Watch how SARSA learns to navigate! Notice how it finds a "safe" path away from the cliff.
SARSA Update
On-policy: Learns Q for the ε-greedy policy being followed. Takes the "safe" path away from the cliff.
On-policy Behavior
SARSA learns the value of the policy it's actually following (including exploration). It learns the "safe" path because it accounts for occasional random actions near the cliff.
🎯 Key insight: SARSA is on-policy — it learns about the policy it follows, including its exploration mistakes!
Q-learning: Off-policy TD Control
Learning the optimal Q-function directly.
The most famous RL algorithm!
Q-learning is arguably the most important algorithm in RL history. It directly learns the optimal Q-function, regardless of what policy you follow! This is the foundation of Deep Q-Networks (DQN) that learned to play Atari games at superhuman levels.
📖 What Each Symbol Means:
🔑 The Key Difference from SARSA:
SARSA: ... + γQ(S', A') — uses action we actually take
Q-learning: ... + γmaxa Q(S', a) — uses BEST action
Q-learning always assumes we'll act optimally in the future, even if we don't!
Uses max instead of the actual next action. This learns Q* directly, regardless of the behavior policy.
The Magic of "max"
SARSA uses Q(S', A')
Uses the action we actually take next (might be exploratory)
Q-learning uses max Q(S', a)
Uses the best action, regardless of what we actually do
🎯 Result: Q-learning learns the optimal policy even while exploring randomly! That's why it's called off-policy.
Cliff Walking with Q-learning
Compare this to SARSA! Q-learning finds the "optimal" (but risky) path along the cliff edge.
Q-learning Update
Off-policy: Learns the optimal Q* directly. Takes the "optimal" path along the cliff edge (risky during exploration!).
SARSA (On-policy)
- • Learns Q for policy being followed
- • Takes safer path (accounts for exploration)
- • Better online performance
Best when you care about performance while learning
Q-learning (Off-policy)
- • Learns optimal Q*
- • Takes optimal path (risky during learning!)
- • Better final policy
Best when you only care about the final policy
Expected Sarsa
Taking the expected value over next actions instead of sampling.
Why not average instead of sample?
SARSA uses the actual next action A', which adds randomness. Expected Sarsa asks: "What if we used the expected value over all possible next actions?" This gives us lower variance updates!
Expected Sarsa Update
📖 What Each Symbol Means:
📐 What This Calculates:
The term Σ π(a|S') Q(S', a) is the expected value of the next state:
If π = ε-greedy with ε=0.1 and 4 actions:
Expected = 0.925×Q(best) + 0.025×Q(a₂) + 0.025×Q(a₃) + 0.025×Q(a₄)
Instead of using the sampled A', take the expected value over all actions weighted by their probability under the policy.
Comparing All Three
SARSA: Uses Q(S', A') where A' is the action we actually take
Q-learning: Uses maxa Q(S', a) — the best action
Expected SARSA: Uses Σ π(a|S') Q(S', a) — weighted average
Lower Variance
Averaging reduces noise from action selection
Better Performance
Often outperforms both Sarsa and Q-learning
Flexible
Can be on-policy or off-policy
With a greedy target policy (π(a|s) = 1 for argmax), Expected Sarsa becomes Q-learning. Expected Sarsa generalizes both algorithms.
Maximization Bias and Double Learning
Q-learning can overestimate values due to the max operator.
There's a subtle problem with "max"!
Using max to pick the best action sounds great, but it has a hidden flaw: if our Q-value estimates have noise (they always do!), the max operation tends to pick the action with the luckiest overestimate, not the truly best action.
When we use max to both select and evaluate an action, we get upward bias. Random noise in Q estimates gets amplified by always picking the maximum.
The Problem: A Simple Example
Consider a state where all actions have true value 0, but estimates have random error. maxa Q(s,a) will be positive (we pick the lucky overestimate), even though the true max is 0.
Example: True values are all 0, but estimates are [-0.1, +0.2, -0.05, +0.15]. The max is +0.2, but the true max is 0. We're overestimating!
Double Q-learning Solution
Maintain two Q-functions. Use one to select the best action, the other to evaluate it:
📖 Breaking Down the Formula:
Q1 selects the action: argmaxa Q1(S', a)
Q2 evaluates it: Q2(S', that action)
🎯 Why Two Q-tables Fix the Bias:
Regular Q-learning: If Q₁ overestimates action A, we pick A AND use that overestimate → double trouble!
Double Q-learning: If Q₁ overestimates action A, we pick A BUT Q₂ gives an independent (likely more accurate) estimate.
Randomly update either Q1 or Q2 on each step. This decouples selection from evaluation, eliminating the bias.
Why This Works
Even if Q1 overestimates an action, Q2 has independent noise and will give a more accurate evaluation. The errors don't compound!
🎯 Fun fact: Double DQN (Double Deep Q-Network) uses this idea and significantly outperforms regular DQN on Atari games!
Summary
Key takeaways from Chapter 6.
Quick Reference Glossary
Temporal-Difference (TD)
Learning from transitions using bootstrapping, no model needed
TD Error (δ)
The "surprise" signal: R + γV(S') - V(S)
TD Target
What we update toward: R + γV(S')
Bootstrapping
Using current estimates to update estimates
SARSA
On-policy TD control: State-Action-Reward-State-Action
Q-learning
Off-policy TD control using max over next actions
Expected SARSA
Average over next actions weighted by policy
Maximization Bias
Overestimation from using max on noisy estimates
Double Learning
Two Q-tables to decouple selection from evaluation
Learning Rate (α)
How big a step to take toward the target (0 to 1)
Formula Cheat Sheet
TD(0) Update (for state values):
SARSA (on-policy):
Q-learning (off-policy):
Expected SARSA:
🎯 Pattern: All TD methods follow: Q ← Q + α × (target - Q)
Common Mistakes (Avoid These!)
Wrong: "SARSA and Q-learning always give the same result"
Right: They can differ significantly! SARSA is safer (avoids cliffs), Q-learning is more optimal but riskier during learning.
Wrong: "TD is biased so MC is always better"
Right: TD's lower variance often makes it learn faster! The bias shrinks as estimates improve.
What We Learned in This Chapter
TD learning is the foundation of modern RL! It combines the best of DP (bootstrapping) and MC (model-free learning).
- ⚡ TD(0) — Update after every step using the TD error
- 🎯 SARSA — On-policy control (learns what it follows)
- 🏆 Q-learning — Off-policy control (learns the optimal policy)
- 📊 Expected SARSA — Lower variance by averaging
- 🔄 Double Learning — Fix maximization bias
TD learning combines MC (model-free) and DP (bootstrapping)
TD(0) updates after every step using the TD error: δ = R + γV(S') - V(S)
TD has lower variance than MC but introduces bias through bootstrapping
SARSA is on-policy TD control: learns Q for the policy being followed
Q-learning is off-policy: learns the optimal Q* directly
Expected Sarsa reduces variance by taking expectation over next actions
Maximization bias affects Q-learning; Double Q-learning fixes it
TD methods are the foundation of modern deep RL algorithms
Quick Algorithm Comparison
| Algorithm | Target | Policy Type |
|---|---|---|
| SARSA | R + γQ(S', A') | On-policy |
| Q-learning | R + γ max Q(S', a) | Off-policy |
| Expected SARSA | R + γ Σ π(a)Q(S', a) | Both |
Looking Ahead
We've covered the tabular foundations of RL. The next chapters explore n-step methods that bridge TD and MC, eligibility traces for more efficient credit assignment, and eventually function approximation for scaling to large state spaces.