This notebook provides a walk-through of the standard analysis done before modelling a binary target. The purpose of this analysis is to get familiar with the data set, clean up the features, and get an initial understanding of the features associated with the target.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Local packages
import stats
# Set up formatting
pd.set_option('display.float_format', lambda x: '%.3f' % x)
plt.style.use('ggplot')
In [3]:
# Load data
example_df = pd.read_csv('bank.csv', na_values=[np.nan])
# Print high-level summary of the data frame
stats.df_stats(example_df)
# Preview data
print('\nPreview of Data:')
print(example_df.head(1))
# print(example_df.tail(1))
In [4]:
# Drop features that are not meaningful for the analysis
drop_elements = ['ID', 'Unnamed: 0']
example_df = example_df.drop(drop_elements, axis=1)
# Group features by data types
categorical_cols = ['Marital_2', 'Education_2']
numerical_cols = [col for col in example_df.columns if col not in categorical_cols]
# Basic NA cleaning
example_df.fillna({x: 'unknwn' for x in categorical_cols}, inplace=True)
example_df.fillna({x: -1 for x in numerical_cols}, inplace=True)
# Correct dtypes
for col in categorical_cols:
example_df[col] = example_df[col].astype(np.object)
for col in numerical_cols:
example_df[col] = example_df[col].astype(np.float)
print('Corrected Data Types:\n{}'.format(example_df.dtypes))
# Review target
target = example_df['Y_2']
print('\nTarget Summary:')
stats.analyse_target(target=target)
# Review key stats for features
print('\nFeature Summary:')
print(example_df.describe(include='all'))
# Review NA values
data_na = stats.count_na(example_df)
print('\nMissing values:')
print(data_na)
# Drop features with > 90% missing
high_na = data_na.loc[data_na['Percent'] > 0.9].index.tolist()
print('\nFeatures with > 90% missing values to be dropped: {}'.format(len(high_na)))
example_df = example_df.drop(high_na, axis=1)
# Group infrequent values in categorical features to 'other' for visualisation
example_df = stats.clean_cats(example_df)
Correlation measures the linear (straight line) relationship between two numerical features. The correlation coefficienct measures the strength and direction of that linear relationship, with values ranging from -1 to 1. A negative coefficient indicates a negative relationship (as one goes up, the other goes down), whereas a positive coefficient indicates a positive relationship (as one goes up, so does the other). The closer the correlation co-efficient is to 1 or -1, the stronger the relationship.
In [5]:
# Numerical vs Numerical Features - Correlation
high_corr_df = stats.correlated_features(df=example_df, min_corr=0.85, target_col_str='Y_2')
high_corr_unique_f = list(set(high_corr_df['feature 1'].unique().tolist()
+ high_corr_df['feature 2'].unique().tolist()))
print('Number of highly correlated features: {}'.format(len(high_corr_unique_f)))
print(high_corr_unique_f)
If you have a high number of features (100's or 1000's), a quick way to reduce dimensionality is to remove numerical features that are highly correlated with other numerical features. Below is a quick solution, but you could also use the strength of the relationship with the target to decide which features to keep.
In [7]:
# Quick solution to reduce correlated features - drop the 2nd feature
high_corr_f2 = high_corr_df['feature 2'].unique().tolist()
example_df = example_df.drop(high_corr_f2, axis=1)
# Plot correlations between remaining features
f, ax = plt.subplots(figsize=(10, 10))
correlations = stats.correlation_plot(example_df, ax=ax)
plt.show()
In [8]:
# Remaining Numerical Features vs Categorical Target
pbs_corr = stats.point_biserial_corr(df=example_df, target_col_str='Y_2')
title = 'Numerical feature correlation with target (top 20)'
score_label = 'correlation coefficient (r)'
stats.plot_score(df=pbs_corr, score_col='corr', n=20, title=title, log_scale=False, score_label=score_label)
Out[8]:
In [9]:
# Visual analysis of Top Numerical Features vs Target
top_numerical_f = pbs_corr.sort_values(by='corr_abs', ascending=False)[0:20].index.tolist()
top_numerical_f.append('Y_2')
numerical_df_p = example_df[top_numerical_f]
stats.plot_features(df=numerical_df_p, target_col_str='Y_2')
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
plt.show()
The chi-squared test is used to test an association between each categorical feature and the target. The test calculates the expected counts (what they would be if the feature and target were indpendent) for each cell in a crosstab (E = row total * column total / sample size) and compares them to the observed counts. The p-value is the probability of seeing a difference at least as extreme as the result observed. If the p-value is low (<0.05) we reject the null hypothesis (no association) and conclude that it is unlikely there is no association between the feature and the target
In [10]:
# Categorical vs Categorical - chi squared test of independence
print('\nChi-squared Test for Independence:')
chi_square_test = stats.chi_squared(example_df, target_col_str='Y_2')
failed_chi = chi_square_test.loc[chi_square_test['p_value'] >= 0.05]
failed_chi_f = failed_chi.index.tolist()
if len(failed_chi_f) > 0:
print('\nAccept null hypothesis - no apparent association with target:')
print(failed_chi_f)
passed_chi = chi_square_test.loc[chi_square_test['p_value'] < 0.05]
passed_chi_f = passed_chi.index.tolist()
if len(passed_chi_f) > 0:
print('\nReject null hypothesis - apparent association with target:')
print(passed_chi_f)
invalid_chi = chi_square_test.loc[chi_square_test['p_value'].isnull()]
invalid_chi_f = invalid_chi.index.tolist()
if len(invalid_chi_f) > 0:
print('\nDid not meet assumptions for test (>20% of expected counts <5):')
print(invalid_chi_f)
# need to divert these to another test (e.g. Fishers Exact)
In [11]:
# Visual analysis of Categorical Features
top_categorical_f = passed_chi_f + invalid_chi_f + ['Y_2']
categorical_df_p = example_df[top_categorical_f]
stats.plot_features(df=categorical_df_p, target_col_str='Y_2')
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
plt.show()