by Alejandro Correa Bahnsen & Iván Torroledo
version 1.2, Feb 2018
This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
It's usually acepted that Machine Learning algorithms have a huge power to predict and describe unknown data based on observed data. However, Machine Learning algorithms is not generally concern about the statistical inference for example as significance of predictions or estimated parameters. This focus it's usually true for traditional quantitative areas like econometrics, psicometrics that use significance as a evaluation metrics of models.
The following data is a sample of demographic and bank information of certain group of clients.
In [2]:
import pandas as pd
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Credit.csv', index_col=0)
data.head(10)
Out[2]:
Usually, this kind of data it's commonly used to create scoring models. With the tools already studied, we could achieve this task easily. However this time, we would like to know which variables are important to explain Balance account of a given client?. In other words, we would like to know if it is a statistical relation between Balance and the other variables. For now, take Gender to test this hypothesis.
Question: Is Gender statistically relevant to explain Balance account of a client? To answer this question we could find if there are a difference in Balance account between Males and Females.
But, first analyze data visually to get a sense of data:
In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.pairplot(data,hue='Gender',palette="husl",markers="+")
Out[3]:
It seems that Balance account distribution doesn't change across Gender. But, if we calculate the mean value of the Balance by Male and Female?
In [3]:
# splitting data
male_= data[data.Gender==' Male'].Balance
female_ = data[data.Gender=='Female'].Balance
fig = plt.figure(figsize=(14,7))
n, bins, patches = plt.hist(male_, bins =50, facecolor='blue', alpha=0.5,label='Male')
n, bins, patches = plt.hist(female_, bins =50,facecolor='red', alpha=0.5,label='Female')
plt.axvline(male_.mean(),linestyle='--',color='blue',)
plt.axvline(female_.mean(),linestyle='--',color='red',)
plt.xlabel('Balance')
plt.legend();
In [4]:
Gender_differences = data.groupby('Gender').mean()
Gender_differences
Out[4]:
In [5]:
print('The mean difference in Balance by Gender is : '+ str(Gender_differences.loc[' Male','Balance']-Gender_differences.loc['Female','Balance']))
So, we got it?, is this difference between Male and Female Balance enough to answer the initial question?
Short Answer: No!
Long Answer: No, we calculate a mean difference, but we haven't checked yet whether this value is statistically significant
To check the statistical significance of the mean difference estimated above we can postulate the following hypothesis:
Ho: There is no difference in Balance account between Male and Female
Ha: There is a statistical difference in Balance account between Male and Female
We want to calculate the p-value of our estimation to compare with a accepted threshold of significance choosen by us:
$\alpha = (1\%,5\%,10\%)$
We can use the traditional method of statistics: assume a distribution for the data, calculate a statistcs like t distribution.
Using the data and some sampling techniques we can computing the empirical distrubutin of data, and to check what is the probability asocieated with our estimation (P-value).
As we know traditional method (1), lets do the uncommon approach. We are going to see that this method can be implemented easily and have a huge power in more complicated tasks.
The Shuffling algorithm is a sampling technique commonly used to simulate empirical distributions from the data.
The basic idea is to simulate the distribution by shuffling the labels (Male and Female) repeatedly and computing a desired statistic. In our case, the choosen statistic is the mean difference.
If the labels (Male and Female) really don't matter to explain Balance, then switching them randomly sould not change the result we got.
Steps:
In [6]:
# Building features and target variable
X = data.Gender.map({' Male': 1, 'Female':0})
Y = data.Balance
First calculate the statistics (mean difference) in the data.
In [7]:
original_difference = female_.mean() - male_.mean()
print('The difference in Balance by Gender (in the data) is: '+ str(original_difference))
In [8]:
# Create a Data Frame with desiered variables
dataframe = pd.DataFrame(X)
dataframe['Balance'] = Y
dataframe.head()
Out[8]:
In [9]:
# Step 1 & 2
def shuffle_data(frame):
vec = np.zeros(frame.Gender.count())#.astype(float)
vec[np.random.choice(frame.Gender.count(),int(sum(frame.Gender)),replace=False)] = 1
frame['Gender'] = vec
return frame
# Step 3
def mean_difference(frame):
return frame.groupby('Gender').mean().loc[0,'Balance'] - frame.groupby('Gender').mean().loc[1,'Balance']
In [10]:
import numpy as np
def simulate_distribution(frame, N=100):
a = []
for i in range(N):
a.append(mean_difference(shuffle_data(dataframe)))
return a
def plot_distribution(dist,data,color='blue',bins=bins,orig=True):
fig = plt.figure(figsize=(10,6))
n, bins, patches = plt.hist(dist, bins = bins, normed=1.0, facecolor=color, alpha=0.5)
values, base = np.histogram(dist, bins = bins)
if orig:
plt.axvline(np.mean(data), color=color, linestyle='dashed', linewidth=2,label='Original data')
plt.legend()
plt.title('Mean difference')
In [11]:
## Simulation
N = 1000
distribution = simulate_distribution(dataframe,N)
In [12]:
plot_distribution(distribution,original_difference,'blue',100)
In [13]:
# Calculating P-Value
def pvalue(dist,estimation):
return float(sum(np.array(dist)>estimation))/len(dist)
In [14]:
p_value = pvalue(distribution,original_difference)
p_value
Out[14]: