notebook.community

Edit and run



In [1]:

    
import pandas as pd
import numpy as np
%pylab inline
pylab.style.use('ggplot')









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
with open('bank-additional-names.txt', encoding='utf-8') as f:
    text = f.read()

print(text)









    



Citation Request:
  This dataset is publicly available for research. The details are described in [Moro et al., 2014]. 
  Please include this citation if you plan to use this database:

  [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, In press, http://dx.doi.org/10.1016/j.dss.2014.03.001

  Available at: [pdf] http://dx.doi.org/10.1016/j.dss.2014.03.001
                [bib] http://www3.dsi.uminho.pt/pcortez/bib/2014-dss.txt

1. Title: Bank Marketing (with social/economic context)

2. Sources
   Created by: Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE-IUL) @ 2014
   
3. Past Usage:

  The full dataset (bank-additional-full.csv) was described and analyzed in:

  S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems (2014), doi:10.1016/j.dss.2014.03.001.
 
4. Relevant Information:

   This dataset is based on "Bank Marketing" UCI dataset (please check the description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).
   The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb.
   This dataset is almost identical to the one used in [Moro et al., 2014] (it does not include all attributes due to privacy concerns). 
   Using the rminer package and R tool (http://cran.r-project.org/web/packages/rminer/), we found that the addition of the five new social and economic attributes (made available here) lead to substantial improvement in the prediction of a success, even when the duration of the call is not included. Note: the file can be read in R using: d=read.table("bank-additional-full.csv",header=TRUE,sep=";")
   
   The zip file includes two datasets: 
      1) bank-additional-full.csv with all examples, ordered by date (from May 2008 to November 2010).
      2) bank-additional.csv with 10% of the examples (4119), randomly selected from bank-additional-full.csv.
   The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g., SVM).

   The binary classification goal is to predict if the client will subscribe a bank term deposit (variable y).

5. Number of Instances: 41188 for bank-additional-full.csv

6. Number of Attributes: 20 + output attribute.

7. Attribute information:

   For more information, read [Moro et al., 2014].

   Input variables:
   # bank client data:
   1 - age (numeric)
   2 - job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
   3 - marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
   4 - education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
   5 - default: has credit in default? (categorical: "no","yes","unknown")
   6 - housing: has housing loan? (categorical: "no","yes","unknown")
   7 - loan: has personal loan? (categorical: "no","yes","unknown")
   # related with the last contact of the current campaign:
   8 - contact: contact communication type (categorical: "cellular","telephone") 
   9 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
  10 - day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
  11 - duration: last contact duration, in seconds (numeric). Important note:  this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
   # other attributes:
  12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14 - previous: number of contacts performed before this campaign and for this client (numeric)
  15 - poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")
   # social and economic context attributes
  16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
  17 - cons.price.idx: consumer price index - monthly indicator (numeric)     
  18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)     
  19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
  20 - nr.employed: number of employees - quarterly indicator (numeric)

  Output variable (desired target):
  21 - y - has the client subscribed a term deposit? (binary: "yes","no")

8. Missing Attribute Values: There are several missing values in some categorical attributes, all coded with the "unknown" label. These missing values can be treated as a possible class label or using deletion or imputation techniques.



In [3]:

    
data = pd.read_csv('bank-additional-full.csv', sep=';')



In [4]:

    
data.head()









    Out[4]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
      56
      housemaid
      married
      basic.4y
      no
      no
      no
      telephone
      may
      mon
      ...
      1
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.857
      5191.0
      no
    
    
      1
      57
      services
      married
      high.school
      unknown
      no
      no
      telephone
      may
      mon
      ...
      1
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.857
      5191.0
      no
    
    
      2
      37
      services
      married
      high.school
      no
      yes
      no
      telephone
      may
      mon
      ...
      1
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.857
      5191.0
      no
    
    
      3
      40
      admin.
      married
      basic.6y
      no
      no
      no
      telephone
      may
      mon
      ...
      1
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.857
      5191.0
      no
    
    
      4
      56
      services
      married
      high.school
      no
      no
      yes
      telephone
      may
      mon
      ...
      1
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.857
      5191.0
      no
    
  

5 rows × 21 columns



In [5]:

    
data = data.rename(columns={c: c.replace('.', '_') for c in data.columns})



In [6]:

    
data.columns









    Out[6]:





Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx',
       'cons_conf_idx', 'euribor3m', 'nr_employed', 'y'],
      dtype='object')



In [7]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp_var_rate      41188 non-null float64
cons_price_idx    41188 non-null float64
cons_conf_idx     41188 non-null float64
euribor3m         41188 non-null float64
nr_employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.6+ MB



In [33]:

    
data['y'].value_counts().plot(kind='bar', title='number of observations per label value.')









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x1aa4ae10b38>



In [8]:

    
numeric_cols = data.dtypes[data.dtypes == np.float].index



In [9]:

    
numeric_cols









    Out[9]:





Index(['emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m',
       'nr_employed'],
      dtype='object')



In [10]:

    
corrs = data.loc[:, numeric_cols].corr()

import seaborn as sns
sns.heatmap(corrs, annot=True)









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x1aa4762d5f8>



In [11]:

    
data.groupby(by='y')[numeric_cols].describe().T









    Out[11]:






  
    
      y
      no
      yes
    
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      emp_var_rate
      36548.0
      0.248875
      1.482932
      -3.400
      -1.800
      1.100
      1.400
      1.400
      4640.0
      -1.233448
      1.623626
      -3.400
      -1.800
      -1.800
      -0.100
      1.400
    
    
      cons_price_idx
      36548.0
      93.603757
      0.558993
      92.201
      93.075
      93.918
      93.994
      94.767
      4640.0
      93.354386
      0.676644
      92.201
      92.893
      93.200
      93.918
      94.767
    
    
      cons_conf_idx
      36548.0
      -40.593097
      4.391155
      -50.800
      -42.700
      -41.800
      -36.400
      -26.900
      4640.0
      -39.789784
      6.139668
      -50.800
      -46.200
      -40.400
      -36.100
      -26.900
    
    
      euribor3m
      36548.0
      3.811491
      1.638187
      0.634
      1.405
      4.857
      4.962
      5.045
      4640.0
      2.123135
      1.742598
      0.634
      0.849
      1.266
      4.406
      5.045
    
    
      nr_employed
      36548.0
      5176.166600
      64.571979
      4963.600
      5099.100
      5195.800
      5228.100
      5228.100
      4640.0
      5095.115991
      87.572641
      4963.600
      5017.500
      5099.100
      5191.000
      5228.100



In [12]:

    
from  sklearn.feature_selection import f_classif



In [13]:

    
f_test_vals, p_vals = f_classif(data[numeric_cols], data['y'])



In [14]:

    
anova_results = pd.DataFrame(data={'F_test_statistic': f_test_vals, 'p_vals': p_vals}, index=numeric_cols)



In [15]:

    
anova_results.plot(kind='bar', subplots=True)
plt.xticks(rotation=45)









    Out[15]:





(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)



In [16]:

    
data.loc[:, 'y_numeric'] = data['y'].map(lambda v: 1 if v == 'yes' else 0)



In [17]:

    
import statsmodels.formula.api as sm
model = sm.logit(formula='y_numeric ~ nr_employed', data=data)
result = model.fit()
result.summary()









    



Optimization terminated successfully.
         Current function value: 0.297725
         Iterations 7






    Out[17]:





Logit Regression Results

  Dep. Variable:      y_numeric       No. Observations:     41188 


  Model:                Logit         Df Residuals:         41186 


  Method:                MLE          Df Model:                 1 


  Date:           Thu, 04 May 2017    Pseudo R-squ.:       0.1543 


  Time:               22:13:23        Log-Likelihood:      -12263.


  converged:            True          LL-Null:             -14499.


                                    LLR p-value:          0.000 




                 coef      std err       z       P>|z|   [0.025     0.975]  


  Intercept       66.0828      1.065     62.067   0.000     63.996     68.170


  nr_employed     -0.0133      0.000    -63.687   0.000     -0.014     -0.013



In [18]:

    
col_f = ' + '.join([c for c in numeric_cols if c != 'euribor3m'])
formula = ' ~ '.join(['y_numeric', col_f])
model = sm.logit(formula=formula, data=data)
result = model.fit()
result.summary()









    



Optimization terminated successfully.
         Current function value: 0.295556
         Iterations 10






    Out[18]:





Logit Regression Results

  Dep. Variable:      y_numeric       No. Observations:     41188 


  Model:                Logit         Df Residuals:         41183 


  Method:                MLE          Df Model:                 4 


  Date:           Thu, 04 May 2017    Pseudo R-squ.:       0.1604 


  Time:               22:13:24        Log-Likelihood:      -12173.


  converged:            True          LL-Null:             -14499.


                                    LLR p-value:          0.000 




                    coef      std err       z       P>|z|   [0.025     0.975]  


  Intercept         -27.5512     10.456     -2.635   0.008    -48.044     -7.058


  emp_var_rate       -0.4715      0.051     -9.160   0.000     -0.572     -0.371


  cons_price_idx      0.6534      0.076      8.622   0.000      0.505      0.802


  cons_conf_idx       0.0371      0.003     12.106   0.000      0.031      0.043


  nr_employed        -0.0067      0.001     -9.187   0.000     -0.008     -0.005



In [19]:

    
data = data.drop('duration', axis=1)



In [20]:

    
cat_columns = data.dtypes[data.dtypes == np.object].index



In [21]:

    
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, confusion_matrix



In [24]:

    
fold = StratifiedKFold(n_splits=5, shuffle=True)
   
def cv(train_index, test_index, df):
    train_data = df.iloc[train_index]
    
    col_f = ' + '.join([c for c in numeric_cols if c != 'euribor3m'])
    formula = ' ~ '.join(['y_numeric', col_f])
    model = sm.logit(formula=formula, data=train_data)
    
    result = model.fit()
    
    predictions = result.predict(df.iloc[test_index])
    
    predictions = predictions.map(lambda v: 'yes' if v >= 0.5 else 'no')
   
    actual = df.iloc[test_index]['y']
    
    #print(confusion_matrix(predictions, actual))
    return f1_score(predictions, actual, labels=['yes', 'no'], average=None)

cv_results = [cv(train_index, test_index, data) for train_index, test_index in fold.split(data['y'], data['y'])]









    



Optimization terminated successfully.
         Current function value: 0.296344
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.295808
         Iterations 8






    



d:\Anaconda3\envs\latest\lib\site-packages\sklearn\metrics\classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)






    



Optimization terminated successfully.
         Current function value: 0.294743
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.294605
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.296220
         Iterations 7



In [28]:

    
results_df = pd.DataFrame(cv_results, columns=['f1_score_yes', 'f1_score_no'])



In [29]:

    
results_df.plot(kind='bar', subplots=True)









    Out[29]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001AA4AA09940>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001AA4AB0F2E8>], dtype=object)

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	56	housemaid	married	basic.4y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
1	57	services	married	high.school	unknown	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
2	37	services	married	high.school	no	yes	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
3	40	admin.	married	basic.6y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
4	56	services	married	high.school	no	no	yes	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no

y	no								yes
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
emp_var_rate	36548.0	0.248875	1.482932	-3.400	-1.800	1.100	1.400	1.400	4640.0	-1.233448	1.623626	-3.400	-1.800	-1.800	-0.100	1.400
cons_price_idx	36548.0	93.603757	0.558993	92.201	93.075	93.918	93.994	94.767	4640.0	93.354386	0.676644	92.201	92.893	93.200	93.918	94.767
cons_conf_idx	36548.0	-40.593097	4.391155	-50.800	-42.700	-41.800	-36.400	-26.900	4640.0	-39.789784	6.139668	-50.800	-46.200	-40.400	-36.100	-26.900
euribor3m	36548.0	3.811491	1.638187	0.634	1.405	4.857	4.962	5.045	4640.0	2.123135	1.742598	0.634	0.849	1.266	4.406	5.045
nr_employed	36548.0	5176.166600	64.571979	4963.600	5099.100	5195.800	5228.100	5228.100	4640.0	5095.115991	87.572641	4963.600	5017.500	5099.100	5191.000	5228.100

Dep. Variable:	y_numeric	No. Observations:	41188
Model:	Logit	Df Residuals:	41186
Method:	MLE	Df Model:	1
Date:	Thu, 04 May 2017	Pseudo R-squ.:	0.1543
Time:	22:13:23	Log-Likelihood:	-12263.
converged:	True	LL-Null:	-14499.
		LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	66.0828	1.065	62.067	0.000	63.996	68.170
nr_employed	-0.0133	0.000	-63.687	0.000	-0.014	-0.013