Exercise with bank marketing data

Introduction

Data from the UCI Machine Learning Repository: data, data dictionary
Goal: Predict whether a customer will purchase a bank product marketed over the phone
bank-additional.csv is already in our repo, so there is no need to download the data from the UCI website

Step 1: Read the data into Pandas



In [2]:

    
import pandas as pd
url = 'data/bank-additional.csv'
bank = pd.read_csv(url, sep=';')
bank.head()









    Out[2]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
      30
      blue-collar
      married
      basic.9y
      no
      yes
      no
      cellular
      may
      fri
      ...
      2
      999
      0
      nonexistent
      -1.8
      92.893
      -46.2
      1.313
      5099.1
      no
    
    
      1
      39
      services
      single
      high.school
      no
      no
      no
      telephone
      may
      fri
      ...
      4
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.855
      5191.0
      no
    
    
      2
      25
      services
      married
      high.school
      no
      yes
      no
      telephone
      jun
      wed
      ...
      1
      999
      0
      nonexistent
      1.4
      94.465
      -41.8
      4.962
      5228.1
      no
    
    
      3
      38
      services
      married
      basic.9y
      no
      unknown
      unknown
      telephone
      jun
      fri
      ...
      3
      999
      0
      nonexistent
      1.4
      94.465
      -41.8
      4.959
      5228.1
      no
    
    
      4
      47
      admin.
      married
      university.degree
      no
      yes
      no
      cellular
      nov
      mon
      ...
      1
      999
      0
      nonexistent
      -0.1
      93.200
      -42.0
      4.191
      5195.8
      no
    
  

5 rows × 21 columns

Step 2: Prepare at least three features

Include both numeric and categorical features
Choose features that you think might be related to the response (based on intuition or exploration)
Think about how to handle missing values (encoded as "unknown")



In [2]:

    
# list all columns (for reference)
bank.columns









    Out[2]:





Index([u'age', u'job', u'marital', u'education', u'default', u'housing',
       u'loan', u'contact', u'month', u'day_of_week', u'duration', u'campaign',
       u'pdays', u'previous', u'poutcome', u'emp.var.rate', u'cons.price.idx',
       u'cons.conf.idx', u'euribor3m', u'nr.employed', u'y'],
      dtype='object')

Step 3: Model building

Use cross-validation to evaluate the AUC of a logistic regression model with your chosen features
Try to increase the AUC by selecting different sets of features

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	30	blue-collar	married	basic.9y	no	yes	no	cellular	may	fri	...	2	999	nonexistent	-1.8	92.893	-46.2	1.313	5099.1	no
1	39	services	single	high.school	no	no	no	telephone	may	fri	...	4	999	nonexistent	1.1	93.994	-36.4	4.855	5191.0	no
2	25	services	married	high.school	no	yes	no	telephone	jun	wed	...	1	999	nonexistent	1.4	94.465	-41.8	4.962	5228.1	no
3	38	services	married	basic.9y	no	unknown	unknown	telephone	jun	fri	...	3	999	nonexistent	1.4	94.465	-41.8	4.959	5228.1	no
4	47	admin.	married	university.degree	no	yes	no	cellular	nov	mon	...	1	999	nonexistent	-0.1	93.200	-42.0	4.191	5195.8	no