In [1]:

    
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data









    



--2019-05-26 17:05:55--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3974305 (3.8M) [application/x-httpd-php]
Saving to: ‘adult.data.3’

adult.data.3        100%[===================>]   3.79M  1.36MB/s    in 2.8s    

2019-05-26 17:05:58 (1.36 MB/s) - ‘adult.data.3’ saved [3974305/3974305]

Imports



In [2]:

    
import pandas as pd
import numpy as np

Read in our data



In [3]:

    
column_names = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]

I am using Pandas to read an manipulate the data. It makes it easy to work with tabular data and is the default data tool for most data-science work.

The dataformat is a DataFrame.



In [4]:

    
df = pd.read_csv('adult.data', names=column_names)



In [5]:

    
# quick look at the top few rows
df.head()









    Out[5]:







  
    
      
      Age
      WorkClass
      fnlwgt
      Education
      EducationNum
      MaritalStatus
      Occupation
      Relationship
      Race
      Gender
      CapitalGain
      CapitalLoss
      HoursPerWeek
      NativeCountry
      Income
    
  
  
    
      0
      39
      State-gov
      77516
      Bachelors
      13
      Never-married
      Adm-clerical
      Not-in-family
      White
      Male
      2174
      0
      40
      United-States
      <=50K
    
    
      1
      50
      Self-emp-not-inc
      83311
      Bachelors
      13
      Married-civ-spouse
      Exec-managerial
      Husband
      White
      Male
      0
      0
      13
      United-States
      <=50K
    
    
      2
      38
      Private
      215646
      HS-grad
      9
      Divorced
      Handlers-cleaners
      Not-in-family
      White
      Male
      0
      0
      40
      United-States
      <=50K
    
    
      3
      53
      Private
      234721
      11th
      7
      Married-civ-spouse
      Handlers-cleaners
      Husband
      Black
      Male
      0
      0
      40
      United-States
      <=50K
    
    
      4
      28
      Private
      338409
      Bachelors
      13
      Married-civ-spouse
      Prof-specialty
      Wife
      Black
      Female
      0
      0
      40
      Cuba
      <=50K



In [6]:

    
# and get a sense of the size of the dataframe
df.shape









    Out[6]:





(32561, 15)

Pandas can access columns by their string names. In this case we want to turn the Income column into a target variable (the thing we are trying to predict). Everything matching >50K is turned into a True and everything else is a False. This 'binary target' is the bread-and-butter of classification.

As a technical note, when you use a string to grab a column from a pandas dataframe, it returns a Series. I use the .values attribute to get a numpy array and then turn the True and False values into 1 and 0 values by casting the values to int.



In [7]:

    
y = (df['Income'] == ' >50K').values.astype(int)



In [8]:

    
use_columns = ['Age', 'HoursPerWeek']
X = df[use_columns]

We are almost ready to fit our first model, but before we do, we need to create training and testing sets. Why is it necessary to split our data? We want to be sure we are not overfitting (wiki).

If we had a model that fit the training perfectly, it may not generalize well to new data.



In [9]:

    
from sklearn.model_selection import train_test_split



In [10]:

    
# splits both X and y into train and testing data - I am using the `random_state` argument to reproduce the same results
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=432345)

Logistic Regression

Let's start with one of the simplest models: logistic regression. By default, this model adds "regularization". In this case an "L2" regularization. In practical terms this makes a model that 1) tries to fit the data as well as possible while 2) keep the coefficients as small as possible. This second part helps to avoid overfitting by not letting the model fit too closely to our training data.



In [11]:

    
from sklearn.linear_model import LogisticRegression



In [12]:

    
# initialize the model by calling the class `LogisticRegression`
model = LogisticRegression()



In [13]:

    
# now that we have our model we can call the `fit` method
model.fit(X_train, y_train)









    Out[13]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [14]:

    
# get the predicted classes (0 or 1)
model.predict(X_train)









    Out[14]:





array([0, 0, 0, ..., 0, 0, 0])



In [15]:

    
# get the odds (probability) of each class this is an Nx2 array where the first column contains the probability of the first class
model.predict_proba(X_train)









    Out[15]:





array([[0.6119385 , 0.3880615 ],
       [0.51311941, 0.48688059],
       [0.91545097, 0.08454903],
       ...,
       [0.93442776, 0.06557224],
       [0.84230077, 0.15769923],
       [0.81645598, 0.18354402]])

Testing our model

We want to see how good our model is at fitting the data. This is a classification model, so one sensible way of seeing how good our model would be would be to look at the "accuracy", that is, how often does something that we say should happen actually happen and how often does something we say shouldn't happen not happen (see: https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers for more).

Accuracy can be a misleading metric based on the base rate of the event. If the event usually happens 99% of the time, a classifier could get 99% accuracy by just guessing that event all the time.

A better metric would balance both the true positive rate (how often the thing we predicted happened) and the false positive rate (how often the thing we predicted didn't happen). A popular metric for that is the "area under the receiving operator curve" or AUROC (sometimes just called AUC).

This metric goes from 0.5 to 1.0, where 0.5 is a random classifier and 1.0 is a perfect classifier.



In [16]:

    
from sklearn.metrics import roc_curve, roc_auc_score



In [17]:

    
# we are going to get the false positive rates and true positive rates to plot
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
# and the total area under the curve
auroc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])



In [18]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [19]:

    
# plot the ROC
fig, axis = plt.subplots()
axis.plot(fpr, tpr)
axis.set_xlabel('False Positive Rate')
axis.set_ylabel('True Positive Rate')

axis.plot([0, 1], [0, 1], color='black', linestyle='--')
axis.set_xlim([0, 1])
axis.set_ylim([0, 1])
axis.set_title(f'AUROC: {auroc:.2f}')









    Out[19]:





Text(0.5,1,'AUROC: 0.73')

We can look at the coefficients of our model by examining the coef_ attribute.



In [20]:

    
model.coef_









    Out[20]:





array([[0.04314066, 0.04641807]])

Since we will need to make this plot often, let's make it into a function.



In [21]:

    
def make_auc_plot(y_true: np.ndarray, y_pred: np.ndarray) -> None:
    """
    Make a beautiful plot of the receiver operating characteristic curve
    
    Parameters
    ----------
    y_true: ndarray
        An array of the true labels (1s for True, 0s for False)
    y_pred: ndarray
        An array of the predictions (between 0 and 1)
    
    
    Side Effects
    ------------
    Produces a matplotlib figure
    """
  
    fpr, tpr, thresholds = roc_curve(y_true, y_pred)
    auroc = roc_auc_score(y_true, y_pred)
  
    fig, axis = plt.subplots()
    axis.plot(fpr, tpr)
    axis.set_xlabel('False Positive Rate')
    axis.set_ylabel('True Positive Rate')

    axis.plot([0, 1], [0, 1], color='black', linestyle='--')
    axis.set_xlim([0, 1])
    axis.set_ylim([0, 1])
    axis.set_title(f'AUROC: {auroc:.2f}')  # using formating codes here https://pyformat.info/

New model types

Decision Trees

These flexible models can fit complicated functions with interactions between features and ability to handle different data. They tend to not fit that well and to overfit. Wiki SKLearn Docs

An example of a decision tree:



In [22]:

    
# again we will import a new model type from sklearn
from sklearn.tree import DecisionTreeClassifier



In [23]:

    
# we initialize the model using the same structure as before
tree_model = DecisionTreeClassifier(max_depth=10, min_samples_leaf=10)



In [24]:

    
tree_model.fit(X_train, y_train)









    Out[24]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')



In [25]:

    
y_pred = tree_model.predict_proba(X_test)[:, 1]
make_auc_plot(y_test, y_pred)



In [26]:

    
y_pred = tree_model.predict_proba(X_train)[:, 1]
make_auc_plot(y_train, y_pred)

Add Another Feature

Let's add a new feature MaritalStatus to our model. We run into a problem in that Marital Status is a feature of different strings. We need to find a way to 'encode' this value into a numeric value.

The method we will use here is called 'dummy encoding' or 'one-hot' encoding. We will create a new feature for each string in the old feature and then give that row a 1 for the new feature if it takes that value (e.g. we would create a 'Divorced' feature and then place a 1 in each row that had an MaritalStatus value 'Divorced'). Bonus: there are some 'fancier' ways that one could use to encode categorical variables for instance in this blog post.



In [27]:

    
# We could also do this directly in sklearn:
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
marital_status_dummies = pd.get_dummies(df['MaritalStatus'])



In [28]:

    
marital_status_dummies.head()









    Out[28]:







  
    
      
      Divorced
      Married-AF-spouse
      Married-civ-spouse
      Married-spouse-absent
      Never-married
      Separated
      Widowed
    
  
  
    
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      1
      0
      0
      1
      0
      0
      0
      0
    
    
      2
      1
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      1
      0
      0
      0
      0
    
    
      4
      0
      0
      1
      0
      0
      0
      0



In [29]:

    
use_columns = ['Age', 'HoursPerWeek', 'MaritalStatus']



In [30]:

    
X = pd.get_dummies(df[use_columns])



In [31]:

    
X.head()









    Out[31]:







  
    
      
      Age
      HoursPerWeek
      MaritalStatus_ Divorced
      MaritalStatus_ Married-AF-spouse
      MaritalStatus_ Married-civ-spouse
      MaritalStatus_ Married-spouse-absent
      MaritalStatus_ Never-married
      MaritalStatus_ Separated
      MaritalStatus_ Widowed
    
  
  
    
      0
      39
      40
      0
      0
      0
      0
      1
      0
      0
    
    
      1
      50
      13
      0
      0
      1
      0
      0
      0
      0
    
    
      2
      38
      40
      1
      0
      0
      0
      0
      0
      0
    
    
      3
      53
      40
      0
      0
      1
      0
      0
      0
      0
    
    
      4
      28
      40
      0
      0
      1
      0
      0
      0
      0



In [32]:

    
# now that we have new columns we need to redo our train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3523523)



In [33]:

    
model.fit(X_train, y_train)









    Out[33]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [34]:

    
y_pred = model.predict_proba(X_test)[:, 1]
make_auc_plot(y_test, y_pred)

Sometimes there are lots of categories. For instance the NativeCountry feature has 42 unique values. We want to add this information to our model, but adding 42 features seems like a lot for our simple model. Let's add just an 'indicator' if the person came from the USA.



In [35]:

    
len(np.unique(df['NativeCountry']))









    Out[35]:





42



In [36]:

    
# we can add a new column like we would add an entry to a dictionary
X['from_USA'] = (df['NativeCountry'] == ' United-States').astype(int)



In [37]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4535234)



In [38]:

    
model.fit(X_train, y_train)









    Out[38]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [39]:

    
y_pred = model.predict_proba(X_test)[:, 1]
make_auc_plot(y_test, y_pred)



In [40]:

    
tree_model.fit(X_train, y_train)









    Out[40]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')



In [41]:

    
y_pred = tree_model.predict_proba(X_test)[:, 1]
make_auc_plot(y_test, y_pred)

Hyperparameter Tuning

The last step on our journey is to conduct some hyperparameter tuning. We have several parameters that we can set in our Decision Tree, and it is not always obvious what we should set them to. We can use a grid search to find the best model after we change many different parameters.



In [42]:

    
from sklearn.model_selection import GridSearchCV



In [43]:

    
# initialize our model with the default parameters
tree_model = DecisionTreeClassifier()



In [44]:

    
# set up a dictionary with lists for the different parameter values
param_grid = {'min_samples_leaf': [1, 2, 10, 20, 100], 'min_impurity_decrease': [0.0, 0.1, 0.2]}
# initialize our grid searcher
grid_searcher = GridSearchCV(tree_model, param_grid)



In [45]:

    
grid_searcher.fit(X_train, y_train)









    Out[45]:





GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'min_samples_leaf': [1, 2, 10, 20, 100], 'min_impurity_decrease': [0.0, 0.1, 0.2]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)



In [46]:

    
# get our best model
tree_model = grid_searcher.best_estimator_



In [47]:

    
tree_model.get_params()









    Out[47]:





{'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 100,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': False,
 'random_state': None,
 'splitter': 'best'}



In [48]:

    
y_pred = tree_model.predict_proba(X_test)[:, 1]
make_auc_plot(y_test, y_pred)

Group time!

Some ideas:

Explore more feature engineering and more features
Explore other model types:
- ensemble methods: either gradient boosted classifiers or random forests
- GAMs (https://github.com/dswah/pyGAM)
- other classifiers (https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)
Explore the data (or model) using visualizations
Do so clustering analysis of the data
Mix and match from above

A note on making plots

If there is one thing I have learned in Astro grad school it is how to make good plots, things like: labeling axes, having clear style, and choosing the right plot type. These small things will make a big difference when presenting your work to a wide array of audiences.

So you want to make some plots...

What type of plot do you want to make? If you need inspiration or want to see different options you can check out the sites below.

https://www.data-to-viz.com/

https://datavizcatalogue.com/index.html

http://datavizproject.com/

And see some examples of common plots made in different python libraries.

http://pythonplot.com/



In [49]:

    
import xgboost
from xgboost import XGBClassifier



In [50]:

    
use_cols = [
    'Age', 'WorkClass', 'EducationNum', 'MaritalStatus', 'Occupation', 'Relationship', 'Race', 'Gender',
 'CapitalGain', 'CapitalLoss', 'HoursPerWeek', 'NativeCountry'
]
X = pd.get_dummies(df[use_cols])



In [51]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=432345)



In [52]:

    
model = XGBClassifier(n_estimators=250, max_depth=5)
model.fit(X_train, y_train)









    Out[52]:





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=None, n_estimators=250,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)



In [53]:

    
y_pred = model.predict_proba(X_test)[:, 1]



In [54]:

    
make_auc_plot(y_test, y_pred)



In [ ]:

	Age	WorkClass	fnlwgt	Education	EducationNum	MaritalStatus	Occupation	Relationship	Race	Gender	CapitalGain	HoursPerWeek	NativeCountry	Income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

	Age	HoursPerWeek	MaritalStatus_ Divorced	MaritalStatus_ Married-civ-spouse	MaritalStatus_ Never-married
0	39	40	0	0	1
1	50	13	0	1	0
2	38	40	1	0	0
3	53	40	0	1	0
4	28	40	0	1	0

	Age	HoursPerWeek	MaritalStatus_ Divorced	MaritalStatus_ Married-civ-spouse	MaritalStatus_ Never-married
0	39	40	0	0	1
1	50	13	0	1	0
2	38	40	1	0	0
3	53	40	0	1	0
4	28	40	0	1	0

	Age	HoursPerWeek	MaritalStatus_ Divorced	MaritalStatus_ Married-civ-spouse	MaritalStatus_ Never-married
0	39	40	0	0	1
1	50	13	0	1	0
2	38	40	1	0	0
3	53	40	0	1	0
4	28	40	0	1	0