Chapter 2

Multi-armed Bandits

The simplest form of RL: balancing exploration and exploitation when choosing between multiple actions with unknown rewards.

Exploration

Try new actions to learn their values

Exploitation

Choose the best known action

Value Estimation

Learn action values from experience

Section 2.1

The k-armed Bandit Problem

A simplified RL problem focusing on a single situation with k possible actions.

Think of it like choosing restaurants!

Imagine you just moved to a new city with 10 restaurants. You don't know which ones are good. Each night, you pick a restaurant to try and get a "reward" (how much you enjoyed the meal). Your goal: maximize your total enjoymentover many dinners. Do you keep going back to the first decent place, or do you keep trying new ones?

k-armed Bandit Problem

You are faced with k different actions (like k slot machine levers). After each action, you receive a numerical reward from a stationary probability distribution. Your goal is to maximize total reward over some time period.

Why "bandit"? Slot machines are nicknamed "one-armed bandits" because they "rob" your money. Now imagine a machine with k arms (levers)!

The Slot Machine Analogy

Imagine a row of slot machines. Each machine has a different probability distribution of payouts — some pay out more on average than others. You don't know which machines pay best. How do you maximize your winnings?

🎰

Arm 1

q*(a) = ?

🎰

Arm 2

q*(a) = ?

🎰

Arm 3

q*(a) = ?

🎰

Arm 4

q*(a) = ?

🎰

Arm 5

q*(a) = ?

Each arm has an unknown true value q*(a). Your job is to figure out which arm is best by trying them!

Key Terminology (Don't Worry, It's Simple!)

q*(a) — True Action Value

The actual average reward you'd get if you always picked action a.

q_*(a) = \mathbb{E}[R_t | A_t = a]

In plain English: "The expected reward when I choose action a"

Restaurant example: If restaurant A's meals average 8/10 enjoyment, then q*(A) = 8.

Q_t(a) — Estimated Value

Your current guess of what q*(a) is, based on your experience so far.

Goal: Q_t(a) → q*(a) as you gain experience

In plain English: "What I currently think action a is worth"

Restaurant example: After 3 visits to restaurant A (ratings: 7, 8, 9), your estimate Q(A) = 8.

The Core Challenge

You don't know q*(a) — that's the whole point! You have to estimate it from experience. The challenge is learning accurate estimates while also maximizing reward along the way.

Section 2.2

Action-value Methods

Methods for estimating action values and using them to make decisions.

Now that we understand the problem, how do we actually solve it? We need two things: (1) a way to estimate action values, and (2) a way to choose actions based on those estimates.

Step 1: Estimate Action Values (Sample-Average Method)

The simplest way to estimate an action's value: average all the rewards you've received from that action.

Q_t(a) = \frac{\text{sum of rewards when } a \text{ was taken}}{\text{number of times } a \text{ was taken}}

🎯 Example:

You've tried restaurant B three times with ratings: 6, 8, 7.
Your estimate: Q(B) = (6 + 8 + 7) / 3 = 7.0

Why this works: By the law of large numbers, as you try an action more times, your average will get closer and closer to the true value q*(a).

Step 2: Choose Actions (The Exploration-Exploitation Dilemma Returns!)

Now you have estimates. How do you use them? Here are two main strategies:

Greedy Selection

A_t = \underset{a}{\operatorname{argmax}} Q_t(a)

Always pick the action with the highest estimated value.

⚠️ Problem:

If restaurant A was good on your first visit, you might never try restaurants B, C, D... even if they're actually better!

• Pure exploitation, no exploration
• Risk: may get stuck on suboptimal actions
• Fast convergence... to possibly wrong answer

ε-Greedy Selection

• With probability (1-ε): pick the best action (exploit)

• With probability ε: pick a random action (explore)

✓ Solution:

Even if restaurant A seems best, you occasionally try random restaurants. This way, you might discover that restaurant D is actually amazing!

• Balances exploration and exploitation
• ε = 0.1 means 10% random exploration
• Guarantees you'll try all actions eventually

Choosing ε (Epsilon)

ε = 0: Pure greedy (no exploration) — risky!
ε = 0.01: Mostly exploits, occasional exploration — good for later learning
ε = 0.1: 10% exploration — popular choice for balancing learning and performance
ε = 1: Pure random — never uses what you've learned!

Section 2.3

The 10-armed Testbed

An interactive simulation comparing different action-selection strategies.

Theory is great, but let's see it in action! Below is an interactive 10-armed bandit simulation. Try different values of ε and watch how different strategies perform.

Step: 0 / 1000

Average Reward

% Optimal Action

What Did We Learn?

Greedy (ε=0)

Gets stuck early! Finds the optimal action only ~33% of the time because it commits too quickly without exploring other options.

ε=0.1

Explores well! Finds optimal ~91% of the time. But keeps exploring even after finding the best action (wastes 10% of actions).

ε=0.01

Slower start, but eventually reaches the highest performance! Only wastes 1% on exploration once values are learned.

Key Insight from the Experiment

Early vs Late: High ε (like 0.1) learns faster initially, but low ε (like 0.01) eventually performs better because it exploits more once learning is done.

Pro tip: In practice, many algorithms start with high ε and gradually reduce it over time — getting the best of both worlds!

Section 2.4

Incremental Implementation

Efficiently updating estimates without storing all past rewards.

The Memory Problem

With our sample-average method, we'd need to store every single reward we've ever received and recompute the average each time. After millions of actions, this becomes impractical!

Good news: there's a clever math trick that lets us update our estimates incrementally — we only need to store the current estimate and update it with each new reward!

Incremental Update Rule

Q_{n+1} = Q_n + \frac{1}{n}[R_n - Q_n]

NewEstimate = OldEstimate + StepSize × [Target - OldEstimate]

Reward sequence:

R₁ = 1R₂ = 3R₃ = 2R₄ = 4R₅ = 2

n	Q_n	R_n	Error (R_n - Q_n)	Step (1/n)	Q_n+1
1	0.000	1	+1.000	1.000	1.000
2	1.000	3	+2.000	0.500	2.000
3	2.000	2	+0.000	0.333	2.000
4	2.000	4	+2.000	0.250	2.500
5	2.500	2	-0.500	0.200	2.400

Current Estimate

Q = 2.400

True average: 2.400

The General Update Rule

\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} \times [\text{Target} - \text{OldEstimate}]

Target - OldEstimate

This is the error in your prediction

StepSize

How much to adjust (also called learning rate)

NewEstimate

We move toward the target by a small step

🎯 Concrete Example:

You estimate restaurant A at Q(A) = 7.0. You visit and get reward R = 9.
With step size α = 0.1:
New Q(A) = 7.0 + 0.1 × (9 - 7.0) = 7.0 + 0.2 = 7.2

Your estimate moved a little bit toward the actual reward you received!

Why This Matters

This update rule is fundamental to almost all of reinforcement learning. You'll see it again in TD learning, Q-learning, and many other algorithms. The idea is always the same: adjust your estimate a little bit toward a target.

Section 2.5

Tracking Nonstationary Problems

When the true action values change over time.

Real World is Changing!

What if restaurant quality changes over time? Restaurant A used to be great, but they got a new chef and now it's mediocre. Your old ratings don't reflect the current quality!

Problem with Simple Averaging

When we average all past rewards equally, old information has the same weight as new information. This is a problem when the environment changes — old rewards become misleading!

If restaurant A was 9/10 for the first 100 visits but is now 4/10, your average is still around 8.5 — way off from the current reality!

Solution: Constant Step-Size (Weighted Recency)

Q_{n+1} = Q_n + \alpha [R_n - Q_n]

Using a constant α (like 0.1) instead of 1/n gives more weight to recent rewards!

Step-size = 1/n (Sample Average)

Every reward gets equal weight. Good for stationary problems where nothing changes.

Step-size = α (Constant)

Recent rewards have more weight. Good for nonstationary problems where things change over time.

How Does This Work? Exponential Decay!

Q_{n+1} = (1-\alpha)^n Q_1 + \sum_{i=1}^{n} \alpha(1-\alpha)^{n-i} R_i

🎯 What this means:

• Old rewards decay exponentially — they "fade away" over time
• Weight on reward R_i is α(1-α)^n-i — smaller as it gets older
• Recent rewards have much higher influence on your estimate

When to Use Which

Environment never changes? Use 1/n — you'll eventually converge to the true average.
Environment might change? Use constant α — you'll adapt to changes (at the cost of never fully converging).

Section 2.6

Optimistic Initial Values

Encouraging exploration through initialization.

A Clever Trick: Start Optimistic!

What if we initialize all Q values to a very high number (like +5), even though actual rewards are much lower (like 0-2)? When we try any action and get a "disappointing" reward, we'll be encouraged to try other actions!

This technique is called optimistic initial values. It's a simple way to encourage exploration without using ε-greedy.

How It Works (Step by Step)

Initialize all Q(a) = +5 (very optimistic — actual rewards are 0-2)

Greedy selects action a (they all look equally great at +5)

Get reward = 1.5 (disappointing compared to our +5 estimate!)

Q(a) drops to ~3.5 — now other actions (still at +5) look better!

Greedy now tries a different action! Natural exploration emerges.

Advantages

Simple to implement — just change initial values
Encourages early exploration automatically
Can outperform ε-greedy in some cases
No random actions — all exploration is "informed"

Limitations

• Only works at the beginning — exploration fades as estimates converge
• Bad for nonstationary problems — can't adapt to changes later
• Need to know the reward scale to set good initial values
• Doesn't provide ongoing exploration

Section 2.7

Upper-Confidence-Bound Action Selection

Exploration based on uncertainty in value estimates.

The Problem with Random Exploration

ε-greedy explores randomly — it might try an action you've already tried 100 times instead of one you've never tried! UCB is smarter: it prioritizes actions we're uncertain about.

The key insight: actions we haven't tried much might be better than we think!UCB adds a "bonus" for uncertainty.

UCB Action Selection

A_t = \underset{a}{\operatorname{argmax}} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]

Time step (t): 20

Exploration (c): 2

Arm 1N = 5, Q = 1.20

Q(a) = 1.20Bonus = 1.55UCB = 2.75

Arm 2N = 3, Q = 0.80

Q(a) = 0.80Bonus = 2.00UCB = 2.80

Arm 3N = 8, Q = 1.50

Q(a) = 1.50Bonus = 1.22UCB = 2.72

Arm 4N = 2, Q = 0.50

Q(a) = 0.50Bonus = 2.45UCB = 2.95

Arm 5(Selected)N = 2, Q = 1.00

Q(a) = 1.00Bonus = 2.45UCB = 3.45

Blue: Exploitation term Q(a) — estimated value

Orange: Exploration bonus — uncertainty/confidence bound

Arms with fewer pulls (smaller N) have larger exploration bonuses, encouraging the agent to try them.

Understanding the UCB Formula

A_t = \underset{a}{\operatorname{argmax}} \left[ Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}} \right]

Q_t(a)

What we know: Our current estimate of action a's value (exploitation)

c√(ln t / N_t(a))

Uncertainty bonus: Gets bigger when we haven't tried action a much (exploration)

🎯 Intuition:

• Tried action 100 times? Small bonus — we're pretty confident about its value
• Only tried action 2 times? Big bonus — it might be much better than we think!
• Over time, all actions get tried, but uncertain ones get priority

UCB vs ε-greedy

ε-greedy: Random exploration — might waste time re-exploring well-known actions
UCB: Directed exploration — systematically explores uncertain actions first
Result: UCB often outperforms ε-greedy, especially early on

Section 2.8

Gradient Bandit Algorithms

Learning action preferences instead of values.

A Different Approach: Learn Preferences!

Instead of estimating how good each action is, what if we just learned which actions we prefer? Higher preference = higher probability of choosing that action.

Gradient bandits maintain a preference H(a) for each action. These preferences get converted to probabilities using the softmax function.

Step 1: Convert Preferences to Probabilities

\pi_t(a) = \frac{e^{H_t(a)}}{\sum_{b=1}^{k} e^{H_t(b)}}

🎯 What this means:

Actions with higher preference H(a) get higher probability of being selected. If H(a) = 2 and H(b) = 1, action a is more likely to be chosen than action b.

Step 2: Update Preferences Based on Rewards

For the action we took:

H_{t+1}(A_t) = H_t(A_t) + \alpha(R_t - \bar{R}_t)(1 - \pi_t(A_t))

For all other actions:

H_{t+1}(a) = H_t(a) - \alpha(R_t - \bar{R}_t)\pi_t(a) \quad \text{for } a \neq A_t

🎯 Breaking it down:

• R̄_t = average of all rewards so far (the "baseline")
• R_t > R̄_t = "better than average!" → increase preference for the action we took
• R_t < R̄_t = "worse than average!" → decrease preference for the action we took

Why Use a Baseline?

The baseline (R̄_t) is crucial! Without it:

• If all rewards are positive, ALL preferences would increase
• If all rewards are negative, ALL preferences would decrease

The baseline ensures we only increase preferences for better-than-average actions and decrease for worse-than-average ones. It's about relative performance!

Section 2.10

Summary

Key takeaways from Chapter 2.

Congratulations on completing Chapter 2! You've learned the fundamentals of balancing exploration and exploitation. Here's what you should remember:

The k-armed Bandit Problem

The simplest RL problem: one state, k actions, maximize total reward by learning which action is best

Action-value Estimation

Q(a) estimates the true value q*(a) by averaging rewards — sample averages converge to true values

ε-greedy Exploration

Explore randomly with probability ε, exploit with probability 1-ε — balances learning and earning

Incremental Update Rule

NewEstimate = OldEstimate + StepSize × [Target - OldEstimate] — fundamental to all RL

Constant α for Nonstationary

Use constant step-size to give more weight to recent rewards when the environment changes

Optimistic Initialization

Start with high estimates to encourage early exploration — simple but limited to stationary problems

UCB Selection

Explore based on uncertainty, not randomness — prioritize actions we know less about

Gradient Bandits

Learn preferences instead of values — compare rewards to a baseline to update probabilities

What's Next?

In Chapter 3, we'll move beyond bandits to Markov Decision Processes (MDPs) — problems with multiple states where your actions affect not just rewards but also where you end up next. This is where RL gets really interesting!

Continue to Chapter 3