Create AB test From Scratch With Confidence using P-value — Fair Experimentation Platform

Alex Yeo
8 min readAug 11, 2023

--

Who will win?

How can we measure the new value added to a company product is actually coming from the change made by us? This change could be the size increase of the “buy” button in a website or the replacement of the product pictures in a website. All of these changes are arbitrary. Some people will argue that these changes will bring profits to our product because customers will feel refresh with the new look and buy our product. Some people will argue that these changes will bring losses to our product because customers cannot adapt to the new look and stop purchasing the product from us. Therefore we need to a mechanism to determine to impact of the change. If the change brought positive impact, we are all happy with it; However, if the change brought negative impact, we need to revert the change quickly and conduct further analysis. This is crucial because the negative impact could cause us losing moneys, increase in logged complaints and loss of customer satisfactions.

The solution to this issue is to setup an AB test for the change the we have made. The “A” in the AB test term means the current implementation (before change) while the “B” in the AB test term means the new implementation (after change). Basically, we are comparing the value before and after the change. By now, we know the meaning and purpose of AB test. The next step would be the implementation of AB test. To implement the AB test, there are few steps. Don’t worry, we will go through these steps one-by-one with brief explanation.

  • Determine the evaluation metric for the A/B test.
  • Decide the AB group splitting/allocation factor.
  • Conduct AA analysis to make sure the splitting/allocation is fair.
  • Calculate the confidence of the AB test using z-score/p-val.

Determine the evaluation metric for the AB test

For this step, we need to define a relevant metric related to our change. Usually, this metric will be related to business value. For example, in Hotel/Homestay booking platform, Booking number could be used as the evaluation metric. For property selling platform, click number could be the evaluation metric. Most of the time, Product Owner will work together with the team to determine the evaluation metric. For the illustration purpose, let’s us take booking number as our evaluation metric.

Decide the AB traffic splitting/allocation factor

For this step, we need to propose a way to split the customers into 2 group. 1 group will be allocated to A side (current implementation), 1 group will be allocated to B side (new implementation by increasing button size). There are multiple ways to split the customers into group A and group B. We can use customer ID to do the splitting. Customers with ODD id will be allocated to group A while customers with EVEN id will be allocated to group B. Then, we will get the bookings difference between group A (customers with ODD id) and group B (customers with EVEN id) to calculate the additional value contributed by Group B. However, some customers might not have customer IDs because they do not register as a member for the platform. In this case, we could use time-based allocation. In time-based allocation, we could allocate the customers who have made the Bookings in our website at ODD hour as group A, and the customers who have made the Bookings in our website at EVEN hour as group B. In additional, we could further make our allocations unbiased by allocating bookings on (ODD days, ODD hours), (EVEN days, EVEN hours) to group A, and (ODD days, Even hours), (EVEN days, ODD hours) to Group B as shown in chart below.

Booking Difference between A and B

Conduct AA analysis to prove the splitting/allocation is fair.

As mentioned earlier, we need to make sure that our allocation is unbiased/fair when using time-based allocation. We could achieve this by conducting an AA analysis. In an AA analysis, we will split the bookings into 2 groups. The 2 groups will be using current implementation where there is no change implemented on the website. Let’s us call the Bookings on (ODD days, ODD hours), (EVEN days, EVEN hours) as A1 and the Bookings on (ODD day, EVEN hours), (EVEN days, ODD hours) as A2. Our expectation for the Bookings made on A1 and A2 should be approximately the same. Or the Bookings difference between A1 and A2 should be approximately 0.

Booking Difference AA Proving

Basically, we can verify this using the history Bookings. The codes below can be used to calculate the booking difference (BD) and standard deviation (STD) using 1 year data. Since we calculate the booking difference for 28-days, 1 year data can give us 12 data points from 12 AA tests. For example, BD1 shown in the chart below represents data point 1. This data is derived from the difference of sum A1 and sum A2 (sum A1 — sum A2) for 28 days (4 weeks).

Booking Differences from 12 AA tests

Based on the 12 data points, we have the mean and standard deviation of booking differences as 18 and 138 respectively. As we mentioned earlier, we want to have the mean of booking differences close to 0 and we get 18 as our result. In our case, we can claim that 18 is very close to 0 because the 18 booking difference is derived from the mean of differences of A1 bookings which are close to 50k bookings and A2 bookings which are also close to 50k bookings.

We choose 28-days because this is the duration for 1 cycle of AB test. we could choose different duration for 1 cycle of AB test. However, the shorter the duration, the higher the standard deviation. And the higher the standard deviation, the harder for us to claim that the B side is winning (reject null hypothesis). We can prove this using statistic and this will be explain in part 4.

import os
import pandas as pd
import numpy as np

booking_path = r"hourly_scale_bookings.csv"

booking_content = pd.read_csv(booking_path)
# the content is in this format
# index datadate hour bookings
# 3254 20220302 0 94.0
# 3336 20220302 1 79.0
# 825 20220302 2 73.0
# 566 20220302 3 73.0
# 5850 20220302 4 69.0

booking_content_sort = booking_content.sort_values(by=['datadate','hour'])
booking_content_sort = booking_content_sort.reset_index()

print(booking_content_sort.head())
print(booking_content_sort.dtypes)


def calculate_std(bookings, time_interval, exp_time):
'''
time_interval : how many hour we do flip
exp_time for how long we do the experiment
'''
day = 0
bookings_diff = []
AB_bookings = []
a_sum = 0
b_sum = 0
off_site = 0
a_hour = 0
b_hour = 0
for index, row in bookings.iterrows():
hour = int(row["hour"])
booking_counts = int(row["bookings"])
if (int(hour / time_interval) + off_site) % 2 == 0: # time_interval = flipping hours
b_sum += booking_counts
b_hour += 1
else:
a_sum += booking_counts
a_hour += 1
if hour == 23:
day += 1
off_site = (off_site + 1) % 2
if day == exp_time:
day = 0
bookings_diff.append(b_sum - a_sum)
AB_bookings.append([b_sum, a_sum])
b_sum = 0
a_sum = 0
bookings_diff = np.array(bookings_diff)
std = np.std(bookings_diff)
return bookings_diff.mean(), std, bookings_diff, AB_bookings

mean, std, bookings_diff, AB_bookings = calculate_std(booking_content_sort,1,28)
print('mean:',mean,'std:',std)


# calculate z-score
# z = (x - mean) / std
z = (501 - mean) / std
print(z, mean, std)

# calculate p_value, right tail test
p_value = 1.0-stat.norm(0,1).cdf(z)
print(p_value)

Calculate the confidence of the AB test using z-score/p-val.

By now, we have verified that the time-based allocation is unbiased. Next, we will run this AB test for 4-weeks. Let’s refresh our memory. During (ODD day, ODD hour) and (EVEN day, EVEN hour), we will use current button size. While during(ODD day, EVEN hour) and (EVEN day, ODD hour), we will use “BIG” size button. To determine whether the change on B side (increase of the Button Size) is significantly bringing additional value to the Booking platform, we need to see huge increase in Bookings on B side. Assuming we have additional 500 booking after 28-days run from A/B test.

Standard Deviation of AB test

Here, we need to understand few statistical terms such as mean, standard deviation and null hypothesis.

mean : the average bookings from 12 data points.
standard deviation : Determine the sparseness of the 12 data points. One standard deviation means we have 68% of the 12 data points which is 8 data points fall in the range of (18–138, 18+138).
null-hypothesis : Change in B side (“Big” button size) performs the same as A side (No change).

In order for us to say the change in B is significantly adding new value to our product, we need to reject the null-hypothesis by calculating the z-score and perform the right-tail test to get the p-value. We will start with the explanation of z-score.

The concept of z-score is easy to understand. Since mean is close to 0, we simply divide the new value to standard deviation. This will tell us how far the new data (additional booking from “Big” button) is away from our historical data. From our example, we see that the new data is 3.49 time away from our historical data. This shows change in B side is very significant. However, we want to convey this information to non-technical people. They might not know the meaning of standard deviation and z-score. Therefore, we need to convert z-score into p-value. The formula to convert z-score to p-val is shown in the codes. Basically p-value will give us the confidence of our change. In our case, we have p-val = 0.00023. This mean, we have 0.023% chance that the change in button size will not tell us the increase in additional bookings. In the other hand, we could tell the non-technical people that we are 99.977% confidence that the change in button size will bring us additional bookings.

We have gone through the 4 steps to create an unbiased experimentation platform. Hope that this explanation is helpful and give us a like if this blog bring some values to you. Next, we will apply this A/B test in reinforcement learning for continuous improvement. Hope we can go through this together is our next blog. See you there!

--

--

No responses yet