Continuous Adaptive Learning — Reinforcement Learning with Normal Distribution

5 min readAug 14, 2023

As a human, we always adjust ourselves to adapt to a new environments. One of the example is to adapt to seasonality. During summer, we will go for swimming or surfing and while during winter, we go for skating or snowboarding. By adapting to the new environment, we would be able to survive or find happiness within ourselves.

Similarly, we want our products to adapt to new environments in order to generate more incomes or to serve customers better. For example, we want our website to sell sweaters during winter and sell swimsuits during summer automatically. This sounds good. But how can we achieve this?

Thanks to reinforcement learning, our products can learn to adapt to new environment. The concept of reinforcement learning is simple, it has been well explained in this blog. Let’s us quickly go through the concept of reinforcement learning by going through these terms: state, action, reward, policy, Q-table.

policy : consist of state and action. eg. We are selling swimsuit today, but we want to change to sell sweaters tomorrow; or we are selling swimsuits today and we want to keep selling swimsuit tomorrow, etc. Let’s us define this is in array form. it will be written as [swimsuit, change, sweater] for the first case, and [swimsuit, keep, swimsuit] for the second case. After writing the 2 cases in array form, we should be able to identify the states and actions. In our case, swimsuits and sweaters are the states, while “keep” and “change” are the actions.

Reward : The score that will be given to a policy. Positive score will be given if the policy is increasing our sales or customer satisfactions. Otherwise, negative score will be given.

from statistics import mean

historical_sales = [100, 90, 120, 110, 80, 100, 70]
sales_avg = mean(historical_sales)

today_sales = 150
reward = 0
print("average sales : ", sales_avg)

if today_sales > sales_avg:
    reward = 1
else:
    reward = -1

print("reward :", reward)

Q-table : consist of a list of policies. Reinforcement Learning algorithm will choose the best policy among the policies in the Q-table.

As we can see from the Q-table, we want to choose the best policy when the season transits from from winter to summer. if we implemented the reinforcement algorithm correctly, the best policy will change from [sweater, keep, sweater] to [swimsuit, keep, swimsuit]. This process of finding the best policy is known as exploitation. However, in reinforcement learning algorithm, we want to introduce some randomness to explore opportunities. For example, we could try selling sweater during summer. Our sales might increase because the sweater price will reduce during summer due to low demand. Similarly, we could try selling swimsuit during winter because the swimsuit price will reduce due to low demand. This process of exploring other actions is known as exploration.

Exploitation

So now, we know that there are 2 processes (exploitation & exploration) in reinforcement learning. In the exploitation process, we will update the score of the chosen policy using the stated formula below.

Q[s][a] = Q[s][a] + ALPHA * (reward + GAMMA* maxQ[s_next][A] - Q[s][a] )

Let’s assume we have chosen to change from selling sweater to swimsuit and we have increased the sale for today. This mean we can fill up the formula with positive reward due to the increase in sale.

ALPHA = 0.2 # how fast you want to update your chosen policy
GAMMA = 0.1 # how many information you want to borrow from your next state to make decision on your current state
Q["sweater"]["change"] = Q["sweater"]["change"] + ALPHA * ( 1 dollar + GAMMA* max(Q["swimsuit"]["keep"],Q["swimsuit"]["change"]) - Q["sweater"]["change"])

Once we update the policy, Q[“sweater”][“change”], we need to choose what to sell tomorrow. We have 2 options now : keep selling swimsuit Q[“swimsuit”][“keep”] or change to sell sweater Q[“swimsuit”][“change”]. This the exploration part. The objective here is to maximize our sales by choosing the best policy as much as possible while giving chance for trying different policy. There are few strategies such as epsilon greedy, Thompson Sampling and Normal Distribution for exploration. For this example, we will go through the technique of using Normal distribution of rewards to choose the product to sell tomorrow.

Exploration

For the normal distribution technique, we will look into the rewards assigned to each (state, action) pair.

# reward[state][action] = [0]
reward["swimsuit"]["keep"] = [0, 1, 1, 1, 1, 1, 1, -1]
reward["swimsuit"]["change"] = [0, -1, -1, 1, -1]
reward["sweater"]["keep"] = [0, -1, -1, -1]
reward["sweater"]["change"] = [0, 1, 1, 1, 1]

For the rewards of the (state, action) pairs, we will calculate the mean and standard deviation. The distribution will be similar to the chart below.

As we can see from the chart, the reward distribution of the policy to change and sell sweater is moving toward -1 reward. this is because the policy reduces the company sale compare to historical sales. On the other hand, we could see the reward distribution of the policy to keep selling sweater is moving toward +1 reward. This is align with our expectation as we expect the sale of swimsuit to increase during summer.

Since we have the reward distributions of the 2 policies, we could get one sample from each reward distribution. By comparing the samples from the 2 distributions, we can decide on the policy for tomorrow sale.

import numpy as np

# example to get sample from each distribution
mean["swimsuit"]["keep"] = 0.5
mean["swimsuit"]["change"] = 0.5
std["swimsuit"]["keep"] = 0.05
std["swimsuit"]["change"] = 0.05

policy_change = np.random.normal(mean["swimsuit"]["change"], std["swimsuit"]["change"], 1)
policy_keep = np.random.normal(mean["swimsuit"]["keep"], std["swimsuit"]["keep"], 1)

next_state = ""
if policy_change > policy_keep :
    next_state = "sweater"
else:
    next_state = "swimsuit"

If the sample drawn from the reward distribution of the policy, to change to sell sweater, is higher than the sample drawn from the reward distribution of the policy, to keep selling swimsuit, we will start selling sweater on tomorrow. This is the mechanism that we used to do exploration.

The exploitation and exploration processes are clearly explained. Please give us a clap if you find this explanation helpful! See you in our next blog.

Continuous Adaptive Learning — Reinforcement Learning with Normal Distribution

Exploitation

Exploration

Written by Alex Yeo

No responses yet