In [1]:
import pandas as pd
import numpy as np
%pylab inline
pylab.style.use('ggplot')


Populating the interactive namespace from numpy and matplotlib

In [2]:
with open('bank-additional-names.txt', encoding='utf-8') as f:
    text = f.read()

print(text)


Citation Request:
  This dataset is publicly available for research. The details are described in [Moro et al., 2014]. 
  Please include this citation if you plan to use this database:

  [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, In press, http://dx.doi.org/10.1016/j.dss.2014.03.001

  Available at: [pdf] http://dx.doi.org/10.1016/j.dss.2014.03.001
                [bib] http://www3.dsi.uminho.pt/pcortez/bib/2014-dss.txt

1. Title: Bank Marketing (with social/economic context)

2. Sources
   Created by: Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE-IUL) @ 2014
   
3. Past Usage:

  The full dataset (bank-additional-full.csv) was described and analyzed in:

  S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems (2014), doi:10.1016/j.dss.2014.03.001.
 
4. Relevant Information:

   This dataset is based on "Bank Marketing" UCI dataset (please check the description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).
   The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb.
   This dataset is almost identical to the one used in [Moro et al., 2014] (it does not include all attributes due to privacy concerns). 
   Using the rminer package and R tool (http://cran.r-project.org/web/packages/rminer/), we found that the addition of the five new social and economic attributes (made available here) lead to substantial improvement in the prediction of a success, even when the duration of the call is not included. Note: the file can be read in R using: d=read.table("bank-additional-full.csv",header=TRUE,sep=";")
   
   The zip file includes two datasets: 
      1) bank-additional-full.csv with all examples, ordered by date (from May 2008 to November 2010).
      2) bank-additional.csv with 10% of the examples (4119), randomly selected from bank-additional-full.csv.
   The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g., SVM).

   The binary classification goal is to predict if the client will subscribe a bank term deposit (variable y).

5. Number of Instances: 41188 for bank-additional-full.csv

6. Number of Attributes: 20 + output attribute.

7. Attribute information:

   For more information, read [Moro et al., 2014].

   Input variables:
   # bank client data:
   1 - age (numeric)
   2 - job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
   3 - marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
   4 - education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
   5 - default: has credit in default? (categorical: "no","yes","unknown")
   6 - housing: has housing loan? (categorical: "no","yes","unknown")
   7 - loan: has personal loan? (categorical: "no","yes","unknown")
   # related with the last contact of the current campaign:
   8 - contact: contact communication type (categorical: "cellular","telephone") 
   9 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
  10 - day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
  11 - duration: last contact duration, in seconds (numeric). Important note:  this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
   # other attributes:
  12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14 - previous: number of contacts performed before this campaign and for this client (numeric)
  15 - poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")
   # social and economic context attributes
  16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
  17 - cons.price.idx: consumer price index - monthly indicator (numeric)     
  18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)     
  19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
  20 - nr.employed: number of employees - quarterly indicator (numeric)

  Output variable (desired target):
  21 - y - has the client subscribed a term deposit? (binary: "yes","no")

8. Missing Attribute Values: There are several missing values in some categorical attributes, all coded with the "unknown" label. These missing values can be treated as a possible class label or using deletion or imputation techniques. 


In [3]:
data = pd.read_csv('bank-additional-full.csv', sep=';')

In [4]:
data.head()


Out[4]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no

5 rows × 21 columns


In [5]:
data = data.rename(columns={c: c.replace('.', '_') for c in data.columns})

In [6]:
data.columns


Out[6]:
Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx',
       'cons_conf_idx', 'euribor3m', 'nr_employed', 'y'],
      dtype='object')

In [7]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp_var_rate      41188 non-null float64
cons_price_idx    41188 non-null float64
cons_conf_idx     41188 non-null float64
euribor3m         41188 non-null float64
nr_employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.6+ MB

In [33]:
data['y'].value_counts().plot(kind='bar', title='number of observations per label value.')


Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1aa4ae10b38>

In [8]:
numeric_cols = data.dtypes[data.dtypes == np.float].index

In [9]:
numeric_cols


Out[9]:
Index(['emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m',
       'nr_employed'],
      dtype='object')

In [10]:
corrs = data.loc[:, numeric_cols].corr()

import seaborn as sns
sns.heatmap(corrs, annot=True)


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1aa4762d5f8>

In [11]:
data.groupby(by='y')[numeric_cols].describe().T


Out[11]:
y no yes
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
emp_var_rate 36548.0 0.248875 1.482932 -3.400 -1.800 1.100 1.400 1.400 4640.0 -1.233448 1.623626 -3.400 -1.800 -1.800 -0.100 1.400
cons_price_idx 36548.0 93.603757 0.558993 92.201 93.075 93.918 93.994 94.767 4640.0 93.354386 0.676644 92.201 92.893 93.200 93.918 94.767
cons_conf_idx 36548.0 -40.593097 4.391155 -50.800 -42.700 -41.800 -36.400 -26.900 4640.0 -39.789784 6.139668 -50.800 -46.200 -40.400 -36.100 -26.900
euribor3m 36548.0 3.811491 1.638187 0.634 1.405 4.857 4.962 5.045 4640.0 2.123135 1.742598 0.634 0.849 1.266 4.406 5.045
nr_employed 36548.0 5176.166600 64.571979 4963.600 5099.100 5195.800 5228.100 5228.100 4640.0 5095.115991 87.572641 4963.600 5017.500 5099.100 5191.000 5228.100

In [12]:
from  sklearn.feature_selection import f_classif

In [13]:
f_test_vals, p_vals = f_classif(data[numeric_cols], data['y'])

In [14]:
anova_results = pd.DataFrame(data={'F_test_statistic': f_test_vals, 'p_vals': p_vals}, index=numeric_cols)

In [15]:
anova_results.plot(kind='bar', subplots=True)
plt.xticks(rotation=45)


Out[15]:
(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)

In [16]:
data.loc[:, 'y_numeric'] = data['y'].map(lambda v: 1 if v == 'yes' else 0)

In [17]:
import statsmodels.formula.api as sm
model = sm.logit(formula='y_numeric ~ nr_employed', data=data)
result = model.fit()
result.summary()


Optimization terminated successfully.
         Current function value: 0.297725
         Iterations 7
Out[17]:
Logit Regression Results
Dep. Variable: y_numeric No. Observations: 41188
Model: Logit Df Residuals: 41186
Method: MLE Df Model: 1
Date: Thu, 04 May 2017 Pseudo R-squ.: 0.1543
Time: 22:13:23 Log-Likelihood: -12263.
converged: True LL-Null: -14499.
LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
Intercept 66.0828 1.065 62.067 0.000 63.996 68.170
nr_employed -0.0133 0.000 -63.687 0.000 -0.014 -0.013

In [18]:
col_f = ' + '.join([c for c in numeric_cols if c != 'euribor3m'])
formula = ' ~ '.join(['y_numeric', col_f])
model = sm.logit(formula=formula, data=data)
result = model.fit()
result.summary()


Optimization terminated successfully.
         Current function value: 0.295556
         Iterations 10
Out[18]:
Logit Regression Results
Dep. Variable: y_numeric No. Observations: 41188
Model: Logit Df Residuals: 41183
Method: MLE Df Model: 4
Date: Thu, 04 May 2017 Pseudo R-squ.: 0.1604
Time: 22:13:24 Log-Likelihood: -12173.
converged: True LL-Null: -14499.
LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
Intercept -27.5512 10.456 -2.635 0.008 -48.044 -7.058
emp_var_rate -0.4715 0.051 -9.160 0.000 -0.572 -0.371
cons_price_idx 0.6534 0.076 8.622 0.000 0.505 0.802
cons_conf_idx 0.0371 0.003 12.106 0.000 0.031 0.043
nr_employed -0.0067 0.001 -9.187 0.000 -0.008 -0.005

In [19]:
data = data.drop('duration', axis=1)

In [20]:
cat_columns = data.dtypes[data.dtypes == np.object].index

In [21]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, confusion_matrix

In [24]:
fold = StratifiedKFold(n_splits=5, shuffle=True)
   
def cv(train_index, test_index, df):
    train_data = df.iloc[train_index]
    
    col_f = ' + '.join([c for c in numeric_cols if c != 'euribor3m'])
    formula = ' ~ '.join(['y_numeric', col_f])
    model = sm.logit(formula=formula, data=train_data)
    
    result = model.fit()
    
    predictions = result.predict(df.iloc[test_index])
    
    predictions = predictions.map(lambda v: 'yes' if v >= 0.5 else 'no')
   
    actual = df.iloc[test_index]['y']
    
    #print(confusion_matrix(predictions, actual))
    return f1_score(predictions, actual, labels=['yes', 'no'], average=None)

cv_results = [cv(train_index, test_index, data) for train_index, test_index in fold.split(data['y'], data['y'])]


Optimization terminated successfully.
         Current function value: 0.296344
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.295808
         Iterations 8
d:\Anaconda3\envs\latest\lib\site-packages\sklearn\metrics\classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
Optimization terminated successfully.
         Current function value: 0.294743
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.294605
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.296220
         Iterations 7

In [28]:
results_df = pd.DataFrame(cv_results, columns=['f1_score_yes', 'f1_score_no'])

In [29]:
results_df.plot(kind='bar', subplots=True)


Out[29]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001AA4AA09940>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001AA4AB0F2E8>], dtype=object)