Title: Preprocessing Categorical Features
Slug: preprocessing_categorical_features
Summary: Preprocessing Categorical Features
Date: 2016-11-01 12:00
Category: Machine Learning
Tags: Preprocessing Structured Data
Authors: Chris Albon

Often, machine learning methods (e.g. logistic regression, SVM with a linear kernel, etc) will require that categorical variables be converted into dummy variables (also called OneHot encoding). For example, a single feature Fruit would be converted into three features, Apples, Oranges, and Bananas, one for each category in the categorical feature.

There are common ways to preprocess categorical features: using pandas or scikit-learn.

Preliminaries



In [1]:

    
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
import pandas as pd

Create Data



In [2]:

    
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 
        'age': [42, 52, 36, 24, 73], 
        'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'city'])
df









    Out[2]:






  
    
      
      first_name
      last_name
      age
      city
    
  
  
    
      0
      Jason
      Miller
      42
      San Francisco
    
    
      1
      Molly
      Jacobson
      52
      Baltimore
    
    
      2
      Tina
      Ali
      36
      Miami
    
    
      3
      Jake
      Milner
      24
      Douglas
    
    
      4
      Amy
      Cooze
      73
      Boston

Convert Nominal Categorical Feature Into Dummy Variables Using Pandas



In [3]:

    
# Create dummy variables for every unique category in df.city
pd.get_dummies(df["city"])









    Out[3]:






  
    
      
      Baltimore
      Boston
      Douglas
      Miami
      San Francisco
    
  
  
    
      0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      1
      1.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      3
      0.0
      0.0
      1.0
      0.0
      0.0
    
    
      4
      0.0
      1.0
      0.0
      0.0
      0.0

Convert Nominal Categorical Data Into Dummy (OneHot) Features Using Scikit



In [4]:

    
# Convert strings categorical names to integers
integerized_data = preprocessing.LabelEncoder().fit_transform(df["city"])

# View data
integerized_data









    Out[4]:





array([4, 0, 3, 2, 1])



In [5]:

    
# Convert integer categorical representations to OneHot encodings
preprocessing.OneHotEncoder().fit_transform(integerized_data.reshape(-1,1)).toarray()









    Out[5]:





array([[ 0.,  0.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.]])

Note that the output of pd.get_dummies() and the scikit methods produces the same output matrix.

	first_name	last_name	age	city
0	Jason	Miller	42	San Francisco
1	Molly	Jacobson	52	Baltimore
2	Tina	Ali	36	Miami
3	Jake	Milner	24	Douglas
4	Amy	Cooze	73	Boston

	Baltimore	Boston	Douglas	Miami	San Francisco
0	0.0	0.0	0.0	0.0	1.0
1	1.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	1.0	0.0
3	0.0	0.0	1.0	0.0	0.0
4	0.0	1.0	0.0	0.0	0.0

	Baltimore	Boston	Douglas	Miami	San Francisco
0	0.0	0.0	0.0	0.0	1.0
1	1.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	1.0	0.0
3	0.0	0.0	1.0	0.0	0.0
4	0.0	1.0	0.0	0.0	0.0

	Baltimore	Boston	Douglas	Miami	San Francisco
0	0.0	0.0	0.0	0.0	1.0
1	1.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	1.0	0.0
3	0.0	0.0	1.0	0.0	0.0
4	0.0	1.0	0.0	0.0	0.0