Multi-armed Bandits
The simplest form of RL: balancing exploration and exploitation when choosing between multiple actions with unknown rewards.
Exploration
Try new actions to learn their values
Exploitation
Choose the best known action
Value Estimation
Learn action values from experience
The k-armed Bandit Problem
A simplified RL problem focusing on a single situation with k possible actions.
Think of it like choosing restaurants!
Imagine you just moved to a new city with 10 restaurants. You don't know which ones are good. Each night, you pick a restaurant to try and get a "reward" (how much you enjoyed the meal). Your goal: maximize your total enjoymentover many dinners. Do you keep going back to the first decent place, or do you keep trying new ones?
You are faced with k different actions (like k slot machine levers). After each action, you receive a numerical reward from a stationary probability distribution. Your goal is to maximize total reward over some time period.
Why "bandit"? Slot machines are nicknamed "one-armed bandits" because they "rob" your money. Now imagine a machine with k arms (levers)!
The Slot Machine Analogy
Imagine a row of slot machines. Each machine has a different probability distribution of payouts — some pay out more on average than others. You don't know which machines pay best. How do you maximize your winnings?
Each arm has an unknown true value q*(a). Your job is to figure out which arm is best by trying them!
Key Terminology (Don't Worry, It's Simple!)
In plain English: "The expected reward when I choose action a"
Restaurant example: If restaurant A's meals average 8/10 enjoyment, then q*(A) = 8.
Goal: Qt(a) → q*(a) as you gain experience
In plain English: "What I currently think action a is worth"
Restaurant example: After 3 visits to restaurant A (ratings: 7, 8, 9), your estimate Q(A) = 8.
You don't know q*(a) — that's the whole point! You have to estimate it from experience. The challenge is learning accurate estimates while also maximizing reward along the way.
Action-value Methods
Methods for estimating action values and using them to make decisions.
Now that we understand the problem, how do we actually solve it? We need two things: (1) a way to estimate action values, and (2) a way to choose actions based on those estimates.
Step 1: Estimate Action Values (Sample-Average Method)
The simplest way to estimate an action's value: average all the rewards you've received from that action.
🎯 Example:
You've tried restaurant B three times with ratings: 6, 8, 7.
Your estimate: Q(B) = (6 + 8 + 7) / 3 = 7.0
Why this works: By the law of large numbers, as you try an action more times, your average will get closer and closer to the true value q*(a).
Step 2: Choose Actions (The Exploration-Exploitation Dilemma Returns!)
Now you have estimates. How do you use them? Here are two main strategies:
Greedy Selection
Always pick the action with the highest estimated value.
⚠️ Problem:
If restaurant A was good on your first visit, you might never try restaurants B, C, D... even if they're actually better!
- • Pure exploitation, no exploration
- • Risk: may get stuck on suboptimal actions
- • Fast convergence... to possibly wrong answer
ε-Greedy Selection
• With probability (1-ε): pick the best action (exploit)
• With probability ε: pick a random action (explore)
✓ Solution:
Even if restaurant A seems best, you occasionally try random restaurants. This way, you might discover that restaurant D is actually amazing!
- • Balances exploration and exploitation
- • ε = 0.1 means 10% random exploration
- • Guarantees you'll try all actions eventually
- ε = 0: Pure greedy (no exploration) — risky!
- ε = 0.01: Mostly exploits, occasional exploration — good for later learning
- ε = 0.1: 10% exploration — popular choice for balancing learning and performance
- ε = 1: Pure random — never uses what you've learned!
The 10-armed Testbed
An interactive simulation comparing different action-selection strategies.
Theory is great, but let's see it in action! Below is an interactive 10-armed bandit simulation. Try different values of ε and watch how different strategies perform.
Average Reward
% Optimal Action
What Did We Learn?
Greedy (ε=0)
Gets stuck early! Finds the optimal action only ~33% of the time because it commits too quickly without exploring other options.
ε=0.1
Explores well! Finds optimal ~91% of the time. But keeps exploring even after finding the best action (wastes 10% of actions).
ε=0.01
Slower start, but eventually reaches the highest performance! Only wastes 1% on exploration once values are learned.
Early vs Late: High ε (like 0.1) learns faster initially, but low ε (like 0.01) eventually performs better because it exploits more once learning is done.
Pro tip: In practice, many algorithms start with high ε and gradually reduce it over time — getting the best of both worlds!
Incremental Implementation
Efficiently updating estimates without storing all past rewards.
The Memory Problem
With our sample-average method, we'd need to store every single reward we've ever received and recompute the average each time. After millions of actions, this becomes impractical!
Good news: there's a clever math trick that lets us update our estimates incrementally — we only need to store the current estimate and update it with each new reward!
Incremental Update Rule
NewEstimate = OldEstimate + StepSize × [Target - OldEstimate]
| n | Qn | Rn | Error (Rn - Qn) | Step (1/n) | Qn+1 |
|---|---|---|---|---|---|
| 1 | 0.000 | 1 | +1.000 | 1.000 | 1.000 |
| 2 | 1.000 | 3 | +2.000 | 0.500 | 2.000 |
| 3 | 2.000 | 2 | +0.000 | 0.333 | 2.000 |
| 4 | 2.000 | 4 | +2.000 | 0.250 | 2.500 |
| 5 | 2.500 | 2 | -0.500 | 0.200 | 2.400 |
The General Update Rule
Target - OldEstimate
This is the error in your prediction
StepSize
How much to adjust (also called learning rate)
NewEstimate
We move toward the target by a small step
🎯 Concrete Example:
You estimate restaurant A at Q(A) = 7.0. You visit and get reward R = 9.
With step size α = 0.1:
New Q(A) = 7.0 + 0.1 × (9 - 7.0) = 7.0 + 0.2 = 7.2
Your estimate moved a little bit toward the actual reward you received!
This update rule is fundamental to almost all of reinforcement learning. You'll see it again in TD learning, Q-learning, and many other algorithms. The idea is always the same: adjust your estimate a little bit toward a target.
Tracking Nonstationary Problems
When the true action values change over time.
Real World is Changing!
What if restaurant quality changes over time? Restaurant A used to be great, but they got a new chef and now it's mediocre. Your old ratings don't reflect the current quality!
When we average all past rewards equally, old information has the same weight as new information. This is a problem when the environment changes — old rewards become misleading!
If restaurant A was 9/10 for the first 100 visits but is now 4/10, your average is still around 8.5 — way off from the current reality!
Solution: Constant Step-Size (Weighted Recency)
Using a constant α (like 0.1) instead of 1/n gives more weight to recent rewards!
Step-size = 1/n (Sample Average)
Every reward gets equal weight. Good for stationary problems where nothing changes.
Step-size = α (Constant)
Recent rewards have more weight. Good for nonstationary problems where things change over time.
How Does This Work? Exponential Decay!
🎯 What this means:
- • Old rewards decay exponentially — they "fade away" over time
- • Weight on reward Ri is α(1-α)n-i — smaller as it gets older
- • Recent rewards have much higher influence on your estimate
- Environment never changes? Use 1/n — you'll eventually converge to the true average.
- Environment might change? Use constant α — you'll adapt to changes (at the cost of never fully converging).
Optimistic Initial Values
Encouraging exploration through initialization.
A Clever Trick: Start Optimistic!
What if we initialize all Q values to a very high number (like +5), even though actual rewards are much lower (like 0-2)? When we try any action and get a "disappointing" reward, we'll be encouraged to try other actions!
This technique is called optimistic initial values. It's a simple way to encourage exploration without using ε-greedy.
How It Works (Step by Step)
Initialize all Q(a) = +5 (very optimistic — actual rewards are 0-2)
Greedy selects action a (they all look equally great at +5)
Get reward = 1.5 (disappointing compared to our +5 estimate!)
Q(a) drops to ~3.5 — now other actions (still at +5) look better!
Greedy now tries a different action! Natural exploration emerges.
Advantages
- Simple to implement — just change initial values
- Encourages early exploration automatically
- Can outperform ε-greedy in some cases
- No random actions — all exploration is "informed"
Limitations
- • Only works at the beginning — exploration fades as estimates converge
- • Bad for nonstationary problems — can't adapt to changes later
- • Need to know the reward scale to set good initial values
- • Doesn't provide ongoing exploration
Upper-Confidence-Bound Action Selection
Exploration based on uncertainty in value estimates.
The Problem with Random Exploration
ε-greedy explores randomly — it might try an action you've already tried 100 times instead of one you've never tried! UCB is smarter: it prioritizes actions we're uncertain about.
The key insight: actions we haven't tried much might be better than we think!UCB adds a "bonus" for uncertainty.
UCB Action Selection
Blue: Exploitation term Q(a) — estimated value
Orange: Exploration bonus — uncertainty/confidence bound
Arms with fewer pulls (smaller N) have larger exploration bonuses, encouraging the agent to try them.
Understanding the UCB Formula
Qt(a)
What we know: Our current estimate of action a's value (exploitation)
c√(ln t / Nt(a))
Uncertainty bonus: Gets bigger when we haven't tried action a much (exploration)
🎯 Intuition:
- • Tried action 100 times? Small bonus — we're pretty confident about its value
- • Only tried action 2 times? Big bonus — it might be much better than we think!
- • Over time, all actions get tried, but uncertain ones get priority
- ε-greedy: Random exploration — might waste time re-exploring well-known actions
- UCB: Directed exploration — systematically explores uncertain actions first
- Result: UCB often outperforms ε-greedy, especially early on
Gradient Bandit Algorithms
Learning action preferences instead of values.
A Different Approach: Learn Preferences!
Instead of estimating how good each action is, what if we just learned which actions we prefer? Higher preference = higher probability of choosing that action.
Gradient bandits maintain a preference H(a) for each action. These preferences get converted to probabilities using the softmax function.
Step 1: Convert Preferences to Probabilities
🎯 What this means:
Actions with higher preference H(a) get higher probability of being selected. If H(a) = 2 and H(b) = 1, action a is more likely to be chosen than action b.
Step 2: Update Preferences Based on Rewards
For the action we took:
For all other actions:
🎯 Breaking it down:
- • R̄t = average of all rewards so far (the "baseline")
- • Rt > R̄t = "better than average!" → increase preference for the action we took
- • Rt < R̄t = "worse than average!" → decrease preference for the action we took
The baseline (R̄t) is crucial! Without it:
- • If all rewards are positive, ALL preferences would increase
- • If all rewards are negative, ALL preferences would decrease
The baseline ensures we only increase preferences for better-than-average actions and decrease for worse-than-average ones. It's about relative performance!
Summary
Key takeaways from Chapter 2.
Congratulations on completing Chapter 2! You've learned the fundamentals of balancing exploration and exploitation. Here's what you should remember:
The k-armed Bandit Problem
The simplest RL problem: one state, k actions, maximize total reward by learning which action is best
Action-value Estimation
Q(a) estimates the true value q*(a) by averaging rewards — sample averages converge to true values
ε-greedy Exploration
Explore randomly with probability ε, exploit with probability 1-ε — balances learning and earning
Incremental Update Rule
NewEstimate = OldEstimate + StepSize × [Target - OldEstimate] — fundamental to all RL
Constant α for Nonstationary
Use constant step-size to give more weight to recent rewards when the environment changes
Optimistic Initialization
Start with high estimates to encourage early exploration — simple but limited to stationary problems
UCB Selection
Explore based on uncertainty, not randomness — prioritize actions we know less about
Gradient Bandits
Learn preferences instead of values — compare rewards to a baseline to update probabilities
What's Next?
In Chapter 3, we'll move beyond bandits to Markov Decision Processes (MDPs) — problems with multiple states where your actions affect not just rewards but also where you end up next. This is where RL gets really interesting!