In [5]:
%matplotlib inline

(This notebook was updated to run in Python 3, 2016 Nov 11.)

Building a Classifier from Cenus Data

An end-to-end machine learning example using Pandas and Scikit-Learn

One of the machine learning workshops given to students in the Georgetown Data Science Certificate is to build a classification, regression, or clustering model using one of the UCI Machine Learning Repository datasets. The idea behind the workshop is to ingest data from a website, perform some initial analyses to get a sense for what's in the data, then structure the data to fit a Scikit-Learn model and evaluate the results. Although the repository does give advice as to what types of machine learning might be applied, this workshop still poses a challenge, especially in terms of data wrangling.

In this post, I'll outline how I completed this workshop alongside my students this past weekend. For those new to machine learning or to Scikit-Learn, I hope this is a practical example that may shed light on many challenges that crop up developing predictive models. For more experienced readers, I hope that I can challenge you to try this workshop, and to contribute iPython notebooks with your efforts as tutorials!

Data Ingestion

The first part of the workshop is to use the UCI Machine Learning Repository to find a non-trivial dataset with which to build a model. While the example datasets included with Scikit-Learn are good examples of how to fit models, they do tend to be either trivial or overused. By exploring a novel dataset with several (more than 10) features and many instances (more than 10,000), I was hoping to conduct a predictive exercise that could show a bit more of a challenge.

There are around 350 datasets in the repository, categorized by things like task, attribute type, data type, area, or number of attributes or instances. I ended up choosing a Census Income dataset that had 14 attributes and 48,842 instances. The task listed was a binary classifier to build a model that could determine from census information whether or not the person made more than $50k per year.

Every dataset in the repository comes with a link to the data folder, which I simply clicked and downloaded to my computer. However, in an effort to make it easier for you to follow along, I've included a simple download_data function that uses requests.py to fetch the data.


In [ ]:
import os 
import requests 

CENSUS_DATASET = (
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", 
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names", 
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", 
)

def download_data(path='data', urls=CENSUS_DATASET):
    if not os.path.exists(path):
        os.mkdir(path) 
    
    for url in urls:
        response = requests.get(url)
        name = os.path.basename(url) 
        with open(os.path.join(path, name), 'wb') as f: 
            f.write(response.content)

download_data()

This code also helps us start to think about how we're going to manage our data on disk. I've created a data folder in my current working directory to hold the data as it's downloaded. In the data management section, we'll expand this folder a bit further to be loaded as a Bunch object.

Data Exploration

The very first thing to do is to explore the dataset and see what's inside. The three files that downloaded do not have a file extension, but they are simply text files. You can change the extension to .txt for easier exploration if that helps. By using the head and wc -l commands on the command line, our files appear to be as follows:

  • adult.data: A CSV dataset containing 32,562 rows and no header
  • adult.names: A text file containing meta information about the dataset
  • adult.test: A CSV dataset containing 16,283 rows with a weird first line

Clearly this dataset is intended to be used for machine learning, and a test and training data set has already been constructed. Similar types of split datasets are used for Kaggle competitions and academic conferences. This will save us a step when it comes to evaluation time.

Since we already have a csv file, let's explore the dataset using Pandas:


In [1]:
import pandas as pd 
import seaborn as sns

names = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income',
]

data = pd.read_csv('data/adult.data', names=names)
data.head()


Out[1]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

In [3]:
print(data.shape)

tc = data.groupby('income').aggregate('count')
tc.count


(32561, 15)
Out[3]:
<bound method DataFrame.count of           age  workclass  fnlwgt  education  education-num  marital-status  \
income                                                                       
 <=50K  24720      24720   24720      24720          24720           24720   
 >50K    7841       7841    7841       7841           7841            7841   

        occupation  relationship   race    sex  capital-gain  capital-loss  \
income                                                                       
 <=50K       24720         24720  24720  24720         24720         24720   
 >50K         7841          7841   7841   7841          7841          7841   

        hours-per-week  native-country  
income                                  
 <=50K           24720           24720  
 >50K             7841            7841  >

Because the CSV data doesn't have a header row, I had to supply the names directly to the pd.read_csv function. To get these names, I manually constructed the list by reading the adult.names file. In the future, we'll store these names as a machine readable JSON file so that we don't have to manuually construct it.

By glancing at the first 5 rows of the data, we can see that we have primarily categorical data. Our target, data.income is also currently constructed as a categorical field. Unfortunately, with categorical fields, we don't have a lot of visualization options (quite yet). However, it would be interesting to see the frequencies of each class, relative to the target of our classifier. To do this, we can use Seaborn's countplot function to count the occurrences of each data point. Let's take a look at the counts of data.occupation and data.education — two likely predictors of income in the census data:


In [6]:
sns.countplot(y='occupation', hue='income', data=data,)


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1123f0c18>

In [7]:
sns.countplot(y='education', hue='income', data=data,)


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1126476a0>

The countplot function accepts either an x or a y argument to specify if this is a bar plot or a column plot. I chose to use the y argument so that the labels were readable. The hue argument specifies a column for comparison; in this case we're concerned with the relationship of our categorical variables to the target income. Go ahead and explore other variables in the dataset, for example data.race and data.sex to see if those values are predictive of the level of income or not!

Data Management

Now that we've completed some initial investigation and have started to identify the possible feautures available in our dataset, we need to structure our data on disk in a way that we can load into Scikit-Learn in a repeatable fashion for continued analysis. My proposal is to use the sklearn.datasets.base.Bunch object to load the data into data and target attributes respectively, similar to how Scikit-Learn's toy datasets are structured. Using this object to manage our data will mirror the native API and allow us to easily copy and paste code that demonstrates classifiers and technqiues with the built in datasets. Importantly, this API will also allow us to communicate to other developers and our future-selves exactly how to use the data.

In order to organize our data on disk, we'll need to add the following files:

  • README.md: a markdown file containing information about the dataset and attribution. Will be exposed by the DESCR attribute.
  • meta.json: a helper file that contains machine readable information about the dataset like target_names and feature_names.

I constructed a pretty simple README.md in Markdown that gave the title of the dataset, the link to the UCI Machine Learning Repository page that contained the dataset, as well as a citation to the author. I simply wrote this file directly using my own text editor.

The meta.json file, however, we can write using the data frame that we already have. We've already done the manual work of writing the column names into a names variable earlier, there's no point in letting that go to waste!


In [8]:
import json 


meta = {
    'target_names': list(data.income.unique()),
    'feature_names': list(data.columns),
    'categorical_features': {
        column: list(data[column].unique())
        for column in data.columns
        if data[column].dtype == 'object'
    },
}

# with open('data/meta.json', 'w') as f:
#     json.dump(meta, f, indent=2)

This code creates a meta.json file by inspecting the data frame that we have constructued. The target_names column, is just the two unique values in the data.income series; by using the pd.Series.unique method - we're guarenteed to spot data errors if there are more or less than two values. The feature_names is simply the names of all the columns.

Then we get tricky — we want to store the possible values of each categorical field for lookup later, but how do we know which columns are categorical and which are not? Luckily, Pandas has already done an analysis for us, and has stored the column data type, data[column].dtype, as either int64 or object. Here I am using a dictionary comprehension to create a dictionary whose keys are the categorical columns, determined by checking the object type and comparing with object, and whose values are a list of unique values for that field.

Now that we have everything we need stored on disk, we can create a load_data function, which will allow us to load the training and test datasets appropriately from disk and store them in a Bunch:


In [9]:
import os
import json

from sklearn.datasets.base import Bunch

def load_data(root='data'):
    # Load the meta data from the file 
    with open(os.path.join(root, 'meta.json'), 'r') as f:
        meta = json.load(f) 
    
    names = meta['feature_names']
    
    # Load the readme information 
    with open(os.path.join(root, 'README.md'), 'r') as f:
        readme = f.read() 
    
    # Load the training and test data, skipping the bad row in the test data 
    train = pd.read_csv(os.path.join(root, 'adult.data'), names=names)
    test  = pd.read_csv(os.path.join(root, 'adult.test'), names=names, skiprows=1)
    
    # Remove the target from the categorical features 
    meta['categorical_features'].pop('income')
    
    # Return the bunch with the appropriate data chunked apart
    return Bunch(
        data = train[names[:-1]], 
        target = train[names[-1]], 
        data_test = test[names[:-1]], 
        target_test = test[names[-1]], 
        target_names = meta['target_names'],
        feature_names = meta['feature_names'], 
        categorical_features = meta['categorical_features'], 
        DESCR = readme,
    )

dataset = load_data()

In [20]:
dataset.data.head()


Out[20]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba

In [ ]:

The primary work of the load_data function is to locate the appropriate files on disk, given a root directory that's passed in as an argument (if you saved your data in a different directory, you can modify the root to have it look in the right place). The meta data is included with the bunch, and is also used split the train and test datasets into data and target variables appropriately, such that we can pass them correctly to the Scikit-Learn fit and predict estimator methods.

Feature Extraction

Now that our data management workflow is structured a bit more like Scikit-Learn, we can start to use our data to fit models. Unfortunately, the categorical values themselves are not useful for machine learning; we need a single instance table that contains numeric values. In order to extract this from the dataset, we'll have to use Scikit-Learn transformers to transform our input dataset into something that can be fit to a model. In particular, we'll have to do the following:

  • encode the categorical labels as numeric data
  • impute missing values with data (or remove)

We will explore how to apply these transformations to our dataset, then we will create a feature extraction pipeline that we can use to build a model from the raw input data. This pipeline will apply both the imputer and the label encoders directly in front of our classifier, so that we can ensure that features are extracted appropriately in both the training and test datasets.

Label Encoding

Our first step is to get our data out of the object data type land and into a numeric type, since nearly all operations we'd like to apply to our data are going to rely on numeric types. Luckily, Sckit-Learn does provide a transformer for converting categorical labels into numeric integers: sklearn.preprocessing.LabelEncoder. Unfortunately it can only transform a single vector at a time, so we'll have to adapt it in order to apply it to multiple columns.

Like all Scikit-Learn transformers, the LabelEncoder has fit and transform methods (as well as a special all-in-one, fit_transform method) that can be used for stateful transformation of a dataset. In the case of the LabelEncoder, the fit method discovers all unique elements in the given vector, orders them lexicographically, and assigns them an integer value. These values are actually the indices of the elements inside the LabelEncoder.classes_ attribute, which can also be used to do a reverse lookup of the class name from the integer value.

For example, if we were to encode the gender column of our dataset as follows:


In [21]:
from sklearn.preprocessing import LabelEncoder 

gender = LabelEncoder() 
gender.fit(dataset.data.sex)
print(gender.classes_)


[' Female' ' Male']

We can then transform a single vector into a numeric vector as follows:


In [22]:
print(gender.transform([
    ' Female', ' Female', ' Male', ' Female', ' Male'
]))


[0 0 1 0 1]

Obviously this is very useful for a single column, and in fact the LabelEncoder really was intended to encode the target variable, not necessarily categorical data expected by the classifiers.

Note: Unfortunately, it was at this point that I realized the values all had a space in front of them. I'll address what I might have done about this in the conclusion.

In order to create a multicolumn LabelEncoder, we'll have to extend the TransformerMixin in Scikit-Learn to create a transformer class of our own, then provide fit and transform methods that wrap individual LabelEncoders for our columns. My code, inspired by the StackOverflow post “Label encoding across multiple columns in scikit-learn”, is as follows:


In [31]:
from sklearn.base import BaseEstimator, TransformerMixin

class EncodeCategorical(BaseEstimator, TransformerMixin):
    """
    Encodes a specified list of columns or all columns if None. 
    """
    
    def __init__(self, columns=None):
        self.columns  = [col for col in columns] 
        self.encoders = None
    
    def fit(self, data, target=None):
        """
        Expects a data frame with named columns to encode. 
        """
        # Encode all columns if columns is None
        if self.columns is None:
            self.columns = data.columns 
        
        # Fit a label encoder for each column in the data frame
        self.encoders = {
            column: LabelEncoder().fit(data[column])
            for column in self.columns 
        }
        return self

    def transform(self, data):
        """
        Uses the encoders to transform a data frame. 
        """
        output = data.copy()
        for column, encoder in self.encoders.items():
            output[column] = encoder.transform(data[column])
            
        return output

encoder = EncodeCategorical(dataset.categorical_features.keys())
data = encoder.fit_transform(dataset.data)

data


Out[31]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 39 7 77516 9 13 4 1 1 4 1 2174 0 40 39
1 50 6 83311 9 13 2 4 0 4 1 0 0 13 39
2 38 4 215646 11 9 0 6 1 4 1 0 0 40 39
3 53 4 234721 1 7 2 6 0 2 1 0 0 40 39
4 28 4 338409 9 13 2 10 5 2 0 0 0 40 5
5 37 4 284582 12 14 2 4 5 4 0 0 0 40 39
6 49 4 160187 6 5 3 8 1 2 0 0 0 16 23
7 52 6 209642 11 9 2 4 0 4 1 0 0 45 39
8 31 4 45781 12 14 4 10 1 4 0 14084 0 50 39
9 42 4 159449 9 13 2 4 0 4 1 5178 0 40 39
10 37 4 280464 15 10 2 4 0 2 1 0 0 80 39
11 30 7 141297 9 13 2 10 0 1 1 0 0 40 19
12 23 4 122272 9 13 4 1 3 4 0 0 0 30 39
13 32 4 205019 7 12 4 12 1 2 1 0 0 50 39
14 40 4 121772 8 11 2 3 0 1 1 0 0 40 0
15 34 4 245487 5 4 2 14 0 0 1 0 0 45 26
16 25 6 176756 11 9 4 5 3 4 1 0 0 35 39
17 32 4 186824 11 9 4 7 4 4 1 0 0 40 39
18 38 4 28887 1 7 2 12 0 4 1 0 0 50 39
19 43 6 292175 12 14 0 4 4 4 0 0 0 45 39
20 40 4 193524 10 16 2 10 0 4 1 0 0 60 39
21 54 4 302146 11 9 5 8 4 2 0 0 0 20 39
22 35 1 76845 6 5 2 5 0 2 1 0 0 40 39
23 43 4 117037 1 7 2 14 0 4 1 0 2042 40 39
24 59 4 109015 11 9 0 13 4 4 0 0 0 40 39
25 56 2 216851 9 13 2 13 0 4 1 0 0 40 39
26 19 4 168294 11 9 4 3 3 4 1 0 0 40 39
27 54 0 180211 15 10 2 0 0 1 1 0 0 60 35
28 39 4 367260 11 9 0 4 1 4 1 0 0 80 39
29 49 4 193366 11 9 2 3 0 4 1 0 0 40 39
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32531 30 0 33811 9 13 4 0 1 1 0 0 0 99 39
32532 34 4 204461 10 16 2 10 0 4 1 0 0 60 39
32533 54 4 337992 9 13 2 4 0 1 1 0 0 50 24
32534 37 4 179137 15 10 0 1 4 4 0 0 0 39 39
32535 22 4 325033 2 8 4 11 3 2 1 0 0 35 39
32536 34 4 160216 9 13 4 4 1 4 0 0 0 55 39
32537 30 4 345898 11 9 4 3 1 2 1 0 0 46 39
32538 38 4 139180 9 13 0 10 4 2 0 15020 0 45 39
32539 71 0 287372 10 16 2 0 0 4 1 0 0 10 39
32540 45 7 252208 11 9 5 1 3 4 0 0 0 40 39
32541 41 0 202822 11 9 5 0 1 2 0 0 0 32 39
32542 72 0 129912 11 9 2 0 0 4 1 0 0 25 39
32543 45 2 119199 7 12 0 10 4 4 0 0 0 48 39
32544 31 4 199655 12 14 0 8 1 3 0 0 0 30 39
32545 39 2 111499 7 12 2 1 5 4 0 0 0 20 39
32546 37 4 198216 7 12 0 13 1 4 0 0 0 40 39
32547 43 4 260761 11 9 2 7 0 4 1 0 0 40 26
32548 65 6 99359 14 15 4 10 1 4 1 1086 0 60 39
32549 43 7 255835 15 10 0 1 2 4 0 0 0 40 39
32550 43 6 27242 15 10 2 3 0 4 1 0 0 50 39
32551 32 4 34066 0 6 2 6 0 0 1 0 0 40 39
32552 43 4 84661 8 11 2 12 0 4 1 0 0 45 39
32553 32 4 116138 12 14 4 13 1 1 1 0 0 11 36
32554 53 4 321865 12 14 2 4 0 4 1 0 0 40 39
32555 22 4 310152 15 10 4 11 1 4 1 0 0 40 39
32556 27 4 257302 7 12 2 13 5 4 0 0 0 38 39
32557 40 4 154374 11 9 2 7 0 4 1 0 0 40 39
32558 58 4 151910 11 9 6 1 4 4 0 0 0 40 39
32559 22 4 201490 11 9 4 1 3 4 1 0 0 20 39
32560 52 5 287927 11 9 2 4 5 4 0 15024 0 40 39

32561 rows × 14 columns


In [36]:
for col, lec in encoder.encoders.items():
    try:
        print(lec.classes_)
    except:
        pass


[' Divorced' ' Married-AF-spouse' ' Married-civ-spouse'
 ' Married-spouse-absent' ' Never-married' ' Separated' ' Widowed']
[' 10th' ' 11th' ' 12th' ' 1st-4th' ' 5th-6th' ' 7th-8th' ' 9th'
 ' Assoc-acdm' ' Assoc-voc' ' Bachelors' ' Doctorate' ' HS-grad' ' Masters'
 ' Preschool' ' Prof-school' ' Some-college']
[' Amer-Indian-Eskimo' ' Asian-Pac-Islander' ' Black' ' Other' ' White']
[' ?' ' Federal-gov' ' Local-gov' ' Never-worked' ' Private'
 ' Self-emp-inc' ' Self-emp-not-inc' ' State-gov' ' Without-pay']
[' ?' ' Adm-clerical' ' Armed-Forces' ' Craft-repair' ' Exec-managerial'
 ' Farming-fishing' ' Handlers-cleaners' ' Machine-op-inspct'
 ' Other-service' ' Priv-house-serv' ' Prof-specialty' ' Protective-serv'
 ' Sales' ' Tech-support' ' Transport-moving']
[' Female' ' Male']
[' ?' ' Cambodia' ' Canada' ' China' ' Columbia' ' Cuba'
 ' Dominican-Republic' ' Ecuador' ' El-Salvador' ' England' ' France'
 ' Germany' ' Greece' ' Guatemala' ' Haiti' ' Holand-Netherlands'
 ' Honduras' ' Hong' ' Hungary' ' India' ' Iran' ' Ireland' ' Italy'
 ' Jamaica' ' Japan' ' Laos' ' Mexico' ' Nicaragua'
 ' Outlying-US(Guam-USVI-etc)' ' Peru' ' Philippines' ' Poland' ' Portugal'
 ' Puerto-Rico' ' Scotland' ' South' ' Taiwan' ' Thailand'
 ' Trinadad&Tobago' ' United-States' ' Vietnam' ' Yugoslavia']
[' Husband' ' Not-in-family' ' Other-relative' ' Own-child' ' Unmarried'
 ' Wife']

In [ ]:


In [ ]:

This specialized transformer now has the ability to label encode multiple columns in a data frame, saving information about the state of the encoders. It would be trivial to add an inverse_transform method that accepts numeric data and converts it to labels, using the inverse_transform method of each individual LabelEncoder on a per-column basis.

Imputation

According to the adult.names file, unknown values are given via the "?" string. We'll have to either ignore rows that contain a "?" or impute their value to the row. Scikit-Learn provides a transformer for dealing with missing values at either the column level or at the row level in the sklearn.preprocessing library called the Imputer.

The Imputer requires information about what missing values are, either an integer or the string, Nan for np.nan data types, it then requires a strategy for dealing with it. For example, the Imputer can fill in the missing values with the mean, median, or most frequent values for each column. If provided an axis argument of 0 then columns that contain only missing data are discarded; if provided an axis argument of 1, then rows which contain only missing values raise an exception. Basic usage of the Imputer is as follows:

imputer = Imputer(missing_values='Nan', strategy='most_frequent')
imputer.fit(dataset.data)

Unfortunately, this would not work for our label encoded data, because 0 is an acceptable label — unless we could guarentee that 0 was always "?", then this would break our numeric columns that already had zeros in them. This is certainly a challenging problem, and unfortunately the best we can do, is to once again create a custom Imputer.


In [38]:
from sklearn.preprocessing import Imputer 

class ImputeCategorical(BaseEstimator, TransformerMixin):
    """
    Encodes a specified list of columns or all columns if None. 
    """
    
    def __init__(self, columns=None):
        self.columns = columns 
        self.imputer = None
    
    def fit(self, data, target=None):
        """
        Expects a data frame with named columns to impute. 
        """
        # Encode all columns if columns is None
        if self.columns is None:
            self.columns = data.columns 
        
        # Fit an imputer for each column in the data frame
        self.imputer = Imputer(missing_values=0, strategy='most_frequent')
        self.imputer.fit(data[self.columns])

        return self

    def transform(self, data):
        """
        Uses the encoders to transform a data frame. 
        """
        output = data.copy()
        output[self.columns] = self.imputer.transform(output[self.columns])
        
        return output

    
imputer = ImputeCategorical(['workclass', 'native-country', 'occupation'])
data = imputer.fit_transform(data)

In [ ]:
data

Our custom imputer, like the EncodeCategorical transformer takes a set of columns to perform imputation on. In this case we only wrap a single Imputer as the Imputer is multicolumn — all that's required is to ensure that the correct columns are transformed. I inspected the encoders and found only three columns that had missing values in them, and passed them directly into the customer imputer.

I had chosen to do the label encoding first, assuming that because the Imputer required numeric values, I'd be able to do the parsing in advance. However, after requiring a custom imputer, I'd say that it's probably best to deal with the missing values early, when they're still a specific value, rather than take a chance.

Model Build

Now that we've finally acheived our feature extraction, we can continue on to the model build phase. To create our classifier, we're going to create a Pipeline that uses our feature transformers and ends in an estimator that can do classification. We can then write the entire pipeline object to disk with the pickle, allowing us to load it up and use it to make predictions in the future.

A pipeline is a step-by-step set of transformers that takes input data and transforms it, until finally passing it to an estimator at the end. Pipelines can be constructed using a named declarative syntax so that they're easy to modify and develop. Our pipeline is as follows:


In [43]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

# we need to encode our target data as well. 
yencode = LabelEncoder().fit(dataset.target)

# figure out the indices of the categorical columns 
categorical_indexes = [
    list(dataset.data.columns).index(key) for key in dataset.categorical_features.keys()
]

# construct the pipeline 
census = Pipeline([
        ('encoder',  EncodeCategorical(dataset.categorical_features.keys())),
        ('imputer', ImputeCategorical(['workclass', 'native-country', 'occupation'])), 
        ('onehot', OneHotEncoder(categorical_features=categorical_indexes)),
        ('classifier', LogisticRegression())
    ])

# fit the pipeline 
census.fit(dataset.data, yencode.transform(dataset.target))


Out[43]:
Pipeline(steps=[('encoder', EncodeCategorical(columns=['marital-status', 'occupation', 'workclass', 'race', 'education', 'sex', 'native-country', 'relationship'])), ('imputer', ImputeCategorical(columns=['workclass', 'native-country', 'occupation'])), ('onehot', OneHotEncoder(categorical_features=[5, 6, 1, 8...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

The pipeline first passes data through our encoder, then to the imputer, and finally to our classifier. In this case, I have chosen a LogisticRegression, a regularized linear model that is used to estimate a categorical dependent variable, much like the binary target we have in this case. We can then evaluate the model on the test data set using the same exact pipeline.


In [44]:
from sklearn.metrics import classification_report 

# encode test targets, and strip traililng '.' 
y_true = yencode.transform([y.rstrip(".") for y in dataset.target_test])

# use the model to get the predicted value
y_pred = census.predict(dataset.data_test)

# execute classification report 
print(classification_report(y_true, y_pred, target_names=dataset.target_names))


             precision    recall  f1-score   support

      <=50K       0.81      0.97      0.88     12435
       >50K       0.71      0.26      0.38      3846

avg / total       0.78      0.80      0.76     16281

As part of the process in encoding the target for the test data, I discovered that the classes in the test data set had a "." appended to the end of the class name, which I had to strip in order for the encoder to work! However, once done, I could predict the y values using the test dataset, passing the predicted and true values to the classifier report.

The classifier I built does an ok job, with an F1 score of 0.77, nothing to sneer at. However, it is possible that an SVM, a Naive Bayes, or a k nearest neighbor model would do better. It is easy to construct new models using the pipeline approach that we prepared before, and I would encourage you to try it out! Furthermore, a grid search or feature analysis may lead to a higher scoring model than the one we quickly put together. Luckily, now that we've sorted out all the pipeline issues, we can get to work on inspecting and improving the model!

The last step is to save our model to disk for reuse later, with the pickle module:


In [46]:
import pickle 

def dump_model(model, path='data', name='classifier.pickle'):
    with open(os.path.join(path, name), 'wb') as f:
        pickle.dump(model, f)
        
dump_model(census)

You should also dump meta information about the date and time your model was built, who built the model, etc. But we'll skip that step here, since this post serves as a guide.

Model Operation

Now it's time to explore how to use the model. To do this, we'll create a simple function that gathers input from the user on the command line, and returns a prediction with the classifier model. Moreover, this function will load the pickled model into memory to ensure the latest and greatest saved model is what's being used.


In [47]:
def load_model(path='data/classifier.pickle'):
    with open(path, 'rb') as f:
        return pickle.load(f) 


def predict(model, meta=dataset):
    data = {} # Store the input from the user
    
    for column in meta['feature_names'][:-1]:
        # Get the valid responses
        valid = meta['categorical_features'].get(column)
    
        # Prompt the user for an answer until good 
        while True:
            val = " " + input("enter {} >".format(column))
            if valid and val not in valid:
                print("Not valid, choose one of {}".format(valid))
            else:
                data[column] = val
                break
    
    # Create prediction and label 
    yhat = model.predict(pd.DataFrame([data]))
    return yencode.inverse_transform(yhat)
            
    
# Execute the interface 
model = load_model()
predict(model)


enter age >42
enter workclass >
Not valid, choose one of [' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov', ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay', ' Never-worked']
enter workclass >Federal-gov
enter fnlwgt >32133
enter education >
Not valid, choose one of [' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th', ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th', ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th', ' Preschool', ' 12th']
enter education >Some-college
enter education-num >7
enter marital-status >Maried-civ-spouse
Not valid, choose one of [' Never-married', ' Married-civ-spouse', ' Divorced', ' Married-spouse-absent', ' Separated', ' Married-AF-spouse', ' Widowed']
enter marital-status >Married-civ-spouse
enter occupation >
Not valid, choose one of [' Adm-clerical', ' Exec-managerial', ' Handlers-cleaners', ' Prof-specialty', ' Other-service', ' Sales', ' Craft-repair', ' Transport-moving', ' Farming-fishing', ' Machine-op-inspct', ' Tech-support', ' ?', ' Protective-serv', ' Armed-Forces', ' Priv-house-serv']
enter occupation >Adm-clerical
enter relationship >Wife
enter race >
Not valid, choose one of [' White', ' Black', ' Asian-Pac-Islander', ' Amer-Indian-Eskimo', ' Other']
enter race >Amer-Indian-Eskimo
enter sex >
Not valid, choose one of [' Male', ' Female']
enter sex >Female
enter capital-gain >103223
enter capital-loss >23
enter hours-per-week >40
enter native-country >
Not valid, choose one of [' United-States', ' Cuba', ' Jamaica', ' India', ' ?', ' Mexico', ' South', ' Puerto-Rico', ' Honduras', ' England', ' Canada', ' Germany', ' Iran', ' Philippines', ' Italy', ' Poland', ' Columbia', ' Cambodia', ' Thailand', ' Ecuador', ' Laos', ' Taiwan', ' Haiti', ' Portugal', ' Dominican-Republic', ' El-Salvador', ' France', ' Guatemala', ' China', ' Japan', ' Yugoslavia', ' Peru', ' Outlying-US(Guam-USVI-etc)', ' Scotland', ' Trinadad&Tobago', ' Greece', ' Nicaragua', ' Vietnam', ' Hong', ' Ireland', ' Hungary', ' Holand-Netherlands']
enter native-country >Canada
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-30971d7e9d4c> in <module>()
     27 # Execute the interface
     28 model = load_model()
---> 29 predict(model)

<ipython-input-47-30971d7e9d4c> in predict(model, meta)
     21 
     22     # Create prediction and label
---> 23     yhat = model.predict(pd.DataFrame([data]))
     24     return yencode.inverse_transform(yhat)
     25 

/usr/local/lib/python3.5/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
     35             self.get_attribute(obj)
     36         # lambda, but not partial, allows help() to work with update_wrapper
---> 37         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
     38         # update the docstring of the returned function
     39         update_wrapper(out, self.fn)

/usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in predict(self, X)
    201         Xt = X
    202         for name, transform in self.steps[:-1]:
--> 203             Xt = transform.transform(Xt)
    204         return self.steps[-1][-1].predict(Xt)
    205 

/usr/local/lib/python3.5/site-packages/sklearn/preprocessing/data.py in transform(self, X)
   1867         """
   1868         return _transform_selected(X, self._transform,
-> 1869                                    self.categorical_features, copy=True)

/usr/local/lib/python3.5/site-packages/sklearn/preprocessing/data.py in _transform_selected(X, transform, selected, copy)
   1644         return transform(X)
   1645     else:
-> 1646         X_sel = transform(X[:, ind[sel]])
   1647         X_not_sel = X[:, ind[not_sel]]
   1648 

/usr/local/lib/python3.5/site-packages/sklearn/preprocessing/data.py in _transform(self, X)
   1839             if self.handle_unknown == 'error':
   1840                 raise ValueError("unknown categorical feature present %s "
-> 1841                                  "during transform." % X.ravel()[~mask])
   1842 
   1843         column_indices = (X + indices[:-1]).ravel()[mask]

ValueError: unknown categorical feature present [103223  32133     40] during transform.

The hardest part about operationalizing the model is collecting user input. Obviously in a bigger application this could be handled with forms, automatic data gathering, and other advanced techniques. For now, hopefully this is enough to highlight how you might use the model in practice to make predictions on unknown data.

Conclusion

This walkthrough was an end-to-end look at how I performed a classification analysis of a dataset that I downloaded from the Internet. I tried to stay true to my exact workflow so that you could get a sense for how I had to go about doing things with little to no advanced knowledge. As a result, there are definitely some things I might change if I was going to do this over.

One place that I struggled with was trying to decide if I should write out wrangled data back to disk, then load it again, or if I should maintain a feature extraction of the raw data. I kept going back and forth, particularly because of silly things like the spaces in front of the values. This could be fixed by loading the data as follows:

pd.read_csv('adult.data', sep="\s*,", names=names)

Using a regular expression for the seperator that would automatically strip whitespace. However, I'd already gone too far to make these changes!

I also had problems with the ordering of the label encoding and the imputation. Given another chance, I think I would definitely wrangle and clean both datasets and save them back to disk. Even just little things like the "." at the end of the class names in the test set were annoyances that could have been easily dealt with.

Now that you've had a chance to look at my walkthrough, I hope you'll try a few on your own and send your workflows and analyses to us so that we can post them as well!