Exercise with bank marketing data

Introduction

  • Data from the UCI Machine Learning Repository: data, data dictionary
  • Goal: Predict whether a customer will purchase a bank product marketed over the phone
  • bank-additional.csv is already in our repo, so there is no need to download the data from the UCI website

Step 1: Read the data into Pandas


In [2]:
import pandas as pd
url = 'data/bank-additional.csv'
bank = pd.read_csv(url, sep=';')
bank.head()


Out[2]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 30 blue-collar married basic.9y no yes no cellular may fri ... 2 999 0 nonexistent -1.8 92.893 -46.2 1.313 5099.1 no
1 39 services single high.school no no no telephone may fri ... 4 999 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0 no
2 25 services married high.school no yes no telephone jun wed ... 1 999 0 nonexistent 1.4 94.465 -41.8 4.962 5228.1 no
3 38 services married basic.9y no unknown unknown telephone jun fri ... 3 999 0 nonexistent 1.4 94.465 -41.8 4.959 5228.1 no
4 47 admin. married university.degree no yes no cellular nov mon ... 1 999 0 nonexistent -0.1 93.200 -42.0 4.191 5195.8 no

5 rows × 21 columns

Step 2: Prepare at least three features

  • Include both numeric and categorical features
  • Choose features that you think might be related to the response (based on intuition or exploration)
  • Think about how to handle missing values (encoded as "unknown")

In [2]:
# list all columns (for reference)
bank.columns


Out[2]:
Index([u'age', u'job', u'marital', u'education', u'default', u'housing',
       u'loan', u'contact', u'month', u'day_of_week', u'duration', u'campaign',
       u'pdays', u'previous', u'poutcome', u'emp.var.rate', u'cons.price.idx',
       u'cons.conf.idx', u'euribor3m', u'nr.employed', u'y'],
      dtype='object')

Step 3: Model building

  • Use cross-validation to evaluate the AUC of a logistic regression model with your chosen features
  • Try to increase the AUC by selecting different sets of features