Baby Steps to Epsilon Greedy, UCB 1 and Thomson Sampling — Introduction to Reinforcement Learning

Alex Yeo
4 min readNov 29, 2020


Introduction to:
- Epsilon Greedy
- UCB 1 (Explanation + Proving)
- Thomson Sampling — Bayesian Bandit (Explanation + Proving)

What is reinforcement learning (RL) and why is it needed? These are the questions that we need to ask ourselves when we plan to RL solution. One of the reasons is that RL model can adapt to the current environment faster due to exploration and exploitation.

To demonstrate the concept of exploration and exploitation, we will use classical multi-armed bandit problem as our example. For those who do no understanding on multi-armed bandit, let’s me give you a real life example — English Premier Leagues. There are many teams in the League and you have 100 dollar to bet on. Which team will you bet on assuming the reward is the same for each team?

First thing that you will do is to calculate the winning rate of each team in the league and place the bet on the team which has the highest winning rate. Right?

And you are lazy, you just want to update the winning rate of one team each time you do the calculation.

which team’s winning rate that you wish to update?

Logically, we would update the team (team A)with the highest winning rate and this is known as exploitation. Yet, we are curious with the winning rate of other teams (team B, C, D… etc) as well. Therefore, we will update the winning rate of the other team after we update the highest winning rate team (team A) for 10 times. This is known as exploration.

Randomly, team D is selected to update it’s winning rate after 10 times of updating the winning rate of team A.

def update(self, x):
self.N += 1.0
self.previous_winning_rate = ( (self.N — 1)*self.previous_winning_rate + x ) / self.N )

Epsilon Greedy
The update strategy that has been described above is better-known as Epsilon Greedy approach. Basically, (self.N — 1)*self.previous_winning_rate is to get back the historical winning times. The formula is very straight forward. We get the historical winning time and add with the win/loss result of this time. Then, we divide with the total calculation for this team to get the updated winning rate.

This is a good method but we need to predefine the exploration time. In our example, we defined the exploration to start after 10 times of exploitation. If we don’t want to set the exploration time, we could use UCB1.

In UCB1, we will update the team with highest winning rate + variable error.

winning_rate + Square_root(2*log(total_trial) / team_trial)

From this formula, we know that the variable error = Square_root(2*log(total_trial) / team_trial)

team_trial = number of winning rate calculation for this team
total_trial = number of winning rate calculation for all teams

And from the formula, we notice that when team_trial increases, the variable error decreases. This makes sense because the more the calculations are performed by the team, the lesser the error could be introduced to the team. This formula can be derived from Hoeffding’s Inequality.

Thomson Sampling (Bayesian Bandit)

For Thomson Sampling, the update is very simple. We just need to aggregate the success counts and failure counts so far and built a beta distribution using success and failure counts. Even though the update is very simple, the concept behind is elegantly designed. The proving and concepts are briefly described as below.

These are the methods that can be used in exploration. Though these methods are easy ,yet they can be applied in many sophisticated algorithms such as Q-Learning.

Hopefully, this blog helps you understanding the logics behind epsilon greedy, UCB1 and Thomson Sampling. Please help to share and like if you feel this blog can help others. Your help is much appreciated! Thanks!