A shoe company sells two models: UA101 and UA102. They wanted to improve their sales and so, ran an aggressive campaign.
The sales data before the campaign, during the campaign and after the campaign are provided.
Can you help them figure out if the campaign was successful or not? What additional insights will you provide them ?
In [1]:
#import the necessary datasets
import pandas as pd
import numpy as np
In [3]:
pd.__version__
Out[3]:
In [4]:
!pip install xlrd
In [5]:
#Read the dataset
shoes_before = pd.read_excel("data/shoe_sales_before.xlsx")
shoes_during = pd.read_excel("data/shoe_sales_during.xlsx")
shoes_after = pd.read_excel("data/shoe_sales_after.xlsx")
In [6]:
shoes_before.head()
Out[6]:
In [7]:
#What was the mean sales
#before, during and after for the two shoe models
#Hint: use df.mean function
print("Before Campaign:")
print(shoes_before.mean())
print()
print("During Campaign:")
print(shoes_during.mean())
print()
print("After Campaign:")
print(shoes_after.mean())
Once two statistician of height 4 feet and 5 feet have to cross a river of AVERAGE depth 3 feet. Meanwhile, a third person comes and said, "what are you waiting for? You can easily cross the river"
It's the average distance of the data values from the mean
It is the square root of variance. This will have the same units as the data and mean.
In [9]:
shoes_before.head()
Out[9]:
In [13]:
import matplotlib.pyplot as plt
%matplotlib inline
In [8]:
#Find the standard deviation of the sales
print("Before Campaign:")
print(shoes_before.std())
print()
print("During Campaign:")
print(shoes_during.std())
print()
print("After Campaign:")
print(shoes_after.std())
covariance as a measure of the (average) co-variation between two variables, say x and y. Covariance describes both how far the variables are spread out, and the nature of their relationship, Covariance is a measure of how much two variables change together. Compare this to Variance, which is just the range over which one measure (or variable) varies.
In [14]:
#Find the covariance of the sales
#Use the cov function
print("Before Campaign:")
print(shoes_before.cov())
print()
print("During Campaign:")
print(shoes_during.cov())
print()
print("After Campaign:")
print(shoes_after.cov())
In [15]:
import seaborn as sns
In [17]:
sns.jointplot(x= "UA101", y ="UA102", data=shoes_before)
Out[17]:
In [19]:
sns.jointplot(x= "UA101", y ="UA102", data=shoes_during)
Out[19]:
In [18]:
#Find correlation between sales
print("Before Campaign:")
print(shoes_before.corr())
print()
print("During Campaign:")
print(shoes_during.corr())
print()
print("After Campaign:")
print(shoes_after.corr())
In [20]:
#Let's do some analysis on UA101 now
In [21]:
#Find difference between mean sales before and after the campaign
np.mean(shoes_after.UA101) - np.mean(shoes_before.UA101)
Out[21]:
On average, the sales after campaign is more than the sales before campaign. But is the difference legit? Could it be due to chance?
Classical Method : t-test
Hacker's Method : provided in notebook 2c
In [22]:
#Find %increase in mean sales
(np.mean(shoes_after.UA101) - np.mean(shoes_before.UA101))/np.mean(shoes_after.UA101) * 100
Out[22]:
Would business feel comfortable spending millions of dollars if the increase is going to be 25%.
Does it work for the company?
Maybe yes - if margins are good and this increase is considered good. But if the returns from the campaign does not let the company break even, it makes no sense to take that path.
Someone tells you the result is statistically significant. The first question you should ask?
To answer such a question, we will make use of the concept confidence interval
In plain english, confidence interval is the range of values the measurement metric is going to take.
An example would be: 90% of the times, the increase in average sales (before and after campaign) would be within the bucket 3.4 and 6.7
(These numbers are illustrative. We will derive those numbers below)
Hacker's way to do this: Bootstrapping
(We will use the library here though)
First of all there are two points to be made.
For the first one:
What if sales in the first month after campaign was 80 and the month before campaign was 40. The difference is 40. And confidence interval,as explained above, using replacements, would always generate 40. But if we do the significance testing, as detailed above - where the labels are shuffled, the sales are equally likely to occur in both the groups. And so, significance testing would answer that there was no difference. But don't we all know that the data is too small to make meaningful inferences?
For the second one:
Traditional statistics derivation assumes normal distribution. But what if the underlying distribution isn't normal? Also, people relate to resampling much better :-)
It is a measure of how far the estimate to be off, on average. More technically, it is the standard deviation of the sampling distribution of a statistic(mostly the mean). Please do not confuse it with standard deviation. Standard deviation is a measure of the variability of the observed quantity. Standard error, on the other hand, describes variability of the estimate.
In [23]:
from scipy import stats
In [24]:
#Standard error for mean sales before campaign
stats.sem(shoes_before.UA101)
Out[24]:
(We are covering, what is referred to as, frequentist method of Hypothesis testing)
We would like to know if the effects we see in the sample(observed data) are likely to occur in the population.
The way classical hypothesis testing works is by conducting a statistical test to answer the following question:
Given the sample and an effect, what is the probability of seeing that effect just by chance?
Here are the steps on how we would do this
If p-value is very low(most often than now, below 0.05), the effect is considered statistically significant. That means that effect is unlikely to have occured by chance. The inference? The effect is likely to be seen in the population too.
This process is very similar to the proof by contradiction paradigm. We first assume that the effect is false. That's the null hypothesis. Next step is to compute the probability of obtaining that effect (the p-value). If p-value is very low(<0.05 as a rule of thumb), we reject the null hypothesis.
In [25]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib as mpl
%matplotlib inline
In [26]:
import seaborn as sns
sns.set(color_codes=True)
In [27]:
#Mean of sales before campaign
shoes_before.UA101.mean()
Out[27]:
In [28]:
#Confidence interval on the mean of sales before campaign
stats.norm.interval(0.95, loc=shoes_before.UA101.mean(),
scale = shoes_before.UA101.std()/np.sqrt(len(shoes_before)))
Out[28]:
In [37]:
#Find 80% Confidence interval
In [ ]:
In [42]:
#Find mean and 95% CI on mean of sales after campaign
print(shoes_after.UA101.mean())
stats.norm.interval(0.95, loc=shoes_after.UA101.mean(),
scale = shoes_after.UA101.std()/np.sqrt(len(shoes_after)))
Out[42]:
In [38]:
#What does confidence interval mean?
In [ ]:
In [29]:
#Effect size
In [40]:
print("Effect size:", shoes_after.UA101.mean()
- shoes_before.UA101.mean() )
Null Hypothesis: Mean sales aren't significantly different
Perform t-test and determine the p-value.
In [29]:
stats.ttest_ind(shoes_before.UA101,
shoes_after.UA101, equal_var=True)
Out[29]:
p-value is the probability that the effective size was by chance. And here, p-value is almost 0.
Conclusion: The sales difference is significant.
One assumption is that the data used came from a normal distribution.
There's a Shapiro-Wilk test to test for normality. If p-value is less than 0.05, then there's a low chance that the distribution is normal.
In [30]:
stats.shapiro(shoes_before.UA101)
Out[30]:
In [51]:
?stats.shapiro
Higher variance
Answer: (Homework :-) )
In [ ]: