A shoe company launches a premium shoe model SC-X01. The sales for the first month (in USD) are given below:
[54, 69, 50, 57, 70, 69, 52, 46, 60, 42, 57, 64, 58, 60, 40, 63, 60, 70, 57, 71, 66, 67, 68, 42, 46, 50, 48, 65, 45, 45]
Ignoring all other effects (eg: holidays, seasonality, new-ness factor, etc), can you provide an estimate of the average sales the company should expect?
What is the variability around that average estimate that you are providing?
The marketing manager has heard about the term confidence interval
. Can you provide that for the estimate and explain what it means to the marketing manager?
In [1]:
import numpy as np
import seaborn as sns
sns.set(color_codes=True)
%matplotlib inline
In [2]:
# Get the data into a numpy array
sales_1 = np.array([54, 69, 50, 57, 70, 69, 52, 46, 60, 42, 57, 64, 58,
60, 40, 63, 60, 70, 57, 71, 66, 67, 68, 42, 46, 50,
48, 65, 45, 45])
In [3]:
# Find mean of sales_1
np.mean(sales_1)
Out[3]:
But the above is based on one sample. How do I get an estimate for the population?
Resampling/Bootstrapping
In [4]:
# Let's create a bootstrap sample - we will use np.random.choice
?np.random.choice
In [5]:
#create a bootstrap sample from sales_1
sample_1 = np.random.choice(sales_1, 30)
In [6]:
sample_1
Out[6]:
In [7]:
#mean of sample_1
np.mean(sample_1)
Out[7]:
In [8]:
# Now, let's write a function that returns mean of n bootstrap samples
def bootstrap_sample_mean(input, num_samples):
"""
Function that generates bootstrap sample, computes its mean and return the array
Arguments:
input : the input array for which bootstrap sample has to be generated
num_samples: num of times the sample has to be generated
output:
bootstrap_mean : This returns the mean of each of the bootstrap sample as a numpy array
"""
bootstrap_mean = np.zeros([num_samples, 1])
for i in np.arange(num_samples):
sample = np.random.choice(input, input.shape[0])
bootstrap_mean[i] = np.mean(sample)
return bootstrap_mean
In [9]:
bootstrap_mean = bootstrap_sample_mean(sales_1, 5)
In [10]:
sns.distplot(bootstrap_mean, kde=False)
Out[10]:
In [11]:
#What is the variability around this estimate?
# this is referred to as standard error of the estimate
In [12]:
np.std(bootstrap_mean)
Out[12]:
In [13]:
# Of course, this is small sample. So, generate it for a number of samples (say: 100k).
# Report the estimate and standard error
In [14]:
# Mean
In [15]:
# Standard Error
In [16]:
# Bonus - this is how the standard error is obtained using scipy
from scipy import stats
stats.sem(sales_1)
Out[16]:
The shoe company, after a month of sales, got endorsed by the current MVP of NBA.
The sales for the second month after lunch, and the first month after endorsement is as follows:
[64, 60, 52, 51, 49, 71, 71, 68, 56, 43, 71, 61, 43, 47, 40, 42, 42, 42, 54, 45, 59, 44, 64, 63, 59, 40, 55, 61, 57, 63]
Did the endorsement impact the sales positively?
In [39]:
# Load 2nd month sales data
sales_2 = np.array([64, 60, 52, 51, 69, 71, 71, 68, 56, 63, 71, 61, 63, 67, 60, 62, 72,
42, 54, 45, 59, 44, 64, 63, 59, 40, 55, 61, 57, 63])
In [40]:
# find mean of sales_2
np.mean(sales_2)
Out[40]:
In [90]:
# The difference between the two means:
shoe_sales_diff = np.mean(sales_2) - np.mean(sales_1)
shoe_sales_diff
Out[90]:
In [26]:
# So, sales improved, on average, by 2.53 shoes. But is it statistically significant?
This is the approach we are going to take:
In [42]:
sales_before = np.array([sales_1, np.repeat(1, sales_1.shape[0])]).T
In [43]:
sales_after = np.array([sales_2, np.repeat(2, sales_2.shape[0])]).T
In [63]:
# Combine them into one ndarray
shoe_sales = np.concatenate((sales_before, sales_after), axis=0)
In [139]:
shoe_sales
In [65]:
#randomly shuffle labels. We will do this by introducing a new variable experiment_label
experiment_label = np.random.randint(1, 3, shoe_sales.shape[0])
In [140]:
experiment_label
In [77]:
# create the experiment data combining actual sales with the shuffled label
experiment_data = np.transpose(np.array([shoe_sales[:, 0], experiment_label]))
In [141]:
experiment_data
In [79]:
# Compute mean between the two groups now.
# Again - take a minute to think what these two groups mean?
experiment_diff_mean = experiment_data[experiment_data[:,1]==2].mean() \
- experiment_data[experiment_data[:,1]==1].mean()
In [80]:
experiment_diff_mean
Out[80]:
In [110]:
# Let's repeat this process many times. Let's write a function to do this
def mean_diff_shuffle_labels(input, num_times):
"""
Function that shuffles the labels, computes the difference between the groups and return the mean difference array
arguments:
input: input array that has two columns: the data in the first column and the label in the second column
num_times: number of times the label
output:
mean_diff_shuffle_mean : This returns the mean differene of each of the shuffled sample as a numpy array
"""
mean_diff_shuffle_mean = np.zeros([num_times, 1])
for i in np.arange(num_times):
experiment_label = np.random.randint(1, 3, input.shape[0])
experiment_data = np.transpose(np.array([input[:, 0], experiment_label]))
experiment_diff_mean = experiment_data[experiment_data[:,1]==2].mean() \
- experiment_data[experiment_data[:,1]==1].mean()
mean_diff_shuffle_mean[i] = experiment_diff_mean
return mean_diff_shuffle_mean
In [101]:
shoe_mean_diff_shuffle = mean_diff_shuffle_labels(shoe_sales, 5)
In [102]:
# Let's plot this
In [103]:
sns.distplot(shoe_mean_diff_shuffle, kde=False)
Out[103]:
In [104]:
# find % of times difference of means is greater than observed
In [105]:
#Finding % of times difference of means is greater than observed
print("Data: Difference in mean greater than observed:", \
shoe_mean_diff_shuffle[shoe_mean_diff_shuffle>shoe_sales_diff])
In [106]:
print("Number of times diff in mean greater than observed:", \
shoe_mean_diff_shuffle[shoe_mean_diff_shuffle>shoe_sales_diff].shape[0])
In [107]:
print("% of times diff in mean greater than observed:", \
shoe_mean_diff_shuffle[shoe_mean_diff_shuffle>shoe_sales_diff].shape[0]/float(shoe_mean_diff_shuffle.shape[0])*100)
What does the above mean?
Exercise: Repeat the above for 100,000 runs and report the results
In [95]:
shoe_mean_diff_shuffle = mean_diff_shuffle_labels(shoe_sales, 100000)
In [ ]:
What do you infer?
Effect Size
Find % increase in sales. Does it make sense?
In [109]:
(np.mean(sales_2) - np.mean(sales_1) )/np.mean(sales_1)*100
Out[109]:
Does 4.4% make sense for the business?
Confidence interval is the range of values the measurement metric is going to take.
An example would be: 90% of the times, the increase in average sales (before and after endorsement) would be within the bucket 3.4 and 6.7 (These numbers are illustrative. We will derive those numbers below)
What is the hacker's way of doing it? We will do the following steps:
Does this process sound familiar ?
In [119]:
# Function to generate mean difference for many bootstrap samples
def mean_diff_bootstrap(before_sales, after_sales, num_times):
"""
Function that computes the difference between the groups on a bootstrap sample and return the mean difference array
arguments:
before_sales: sales before the campaign
after_sales : sales after the campaign
num_times: number of times to run the bootstrap sample
output:
mean_diff : This returns the mean differene of each of the bootstrap groups as a numpy array
"""
mean_diff = np.zeros([num_times, 1])
for i in np.arange(num_times):
bootstrap_before = np.random.choice(before_sales, before_sales.shape[0])
bootstrap_after = np.random.choice(after_sales, after_sales.shape[0])
bootstrap_mean_diff = np.mean(bootstrap_after) - np.mean(bootstrap_before)
mean_diff[i] = bootstrap_mean_diff
return mean_diff
In [134]:
mean_diff_bootstrap_experiment = mean_diff_bootstrap(sales_2, sales_1, 50)
In [135]:
#Let's plot this
In [136]:
sns.distplot(mean_diff_bootstrap_experiment, kde=False)
Out[136]:
In [137]:
mean_diff_bootstrap_experiment_sorted = np.sort(mean_diff_bootstrap_experiment, axis=0)
In [138]:
np.percentile(mean_diff_bootstrap_experiment_sorted, [5, 95])
Out[138]:
Run the above for 500k times and report the interval
In [ ]: