Handling categorical data

In this notebook, I'll demonstrate different ways of mapping or encoding categorical data.



In [1]:

    
# create a pandas dataframe with categorical variables to work with
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df









    Out[1]:






  
    
      
      color
      size
      price
      classlabel
    
  
  
    
      0
      green
      M
      10.1
      class1
    
    
      1
      red
      L
      13.5
      class2
    
    
      2
      blue
      XL
      15.3
      class1

1. Mapping ordinal features

Create a mapping dictionary first and then map the categorical string values into integers.



In [2]:

    
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)
df









    Out[2]:






  
    
      
      color
      size
      price
      classlabel
    
  
  
    
      0
      green
      1
      10.1
      class1
    
    
      1
      red
      2
      13.5
      class2
    
    
      2
      blue
      3
      15.3
      class1



In [3]:

    
# transform integers back to string values using a reverse-mapping dictionary
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)









    Out[3]:





0     M
1     L
2    XL
Name: size, dtype: object

Create a function that converts strings into numbers



In [4]:

    
def size_to_numeric(x):
    if x=='XL':
        return 3
    if x=='L':
        return 2
    if x=='M':
        return 1

df['size_num'] = df['size'].apply(size_to_numeric)
df









    Out[4]:






  
    
      
      color
      size
      price
      classlabel
      size_num
    
  
  
    
      0
      green
      1
      10.1
      class1
      None
    
    
      1
      red
      2
      13.5
      class2
      None
    
    
      2
      blue
      3
      15.3
      class1
      None

2. Convert nominal categorical feature into dummy variables

Often, machine learning algorithms require that categorical variables be converted into dummy variables (also called OneHot encoding). For example, a single feature Fruit would be converted into three features, Apples, Oranges, and Bananas, one for each category in the categorical feature.

There are common ways to preprocess categorical features: using pandas or scikit-learn.



In [5]:

    
# using pandas 'get_dummies'
pd.get_dummies(df[['price','color', 'size']])









    Out[5]:






  
    
      
      price
      size
      color_blue
      color_green
      color_red
    
  
  
    
      0
      10.1
      1
      0.0
      1.0
      0.0
    
    
      1
      13.5
      2
      0.0
      0.0
      1.0
    
    
      2
      15.3
      3
      1.0
      0.0
      0.0



In [6]:

    
# using pandas 'get_dummies'
pd.get_dummies(df['color'])



In [7]:

    
pd.get_dummies(df['color']).join(df[['size', 'price']])



In [8]:

    
# using scikit-learn LabelEncoder and OneHotEncoder
from sklearn.preprocessing import LabelEncoder
color_le = LabelEncoder()
df['color'] = color_le.fit_transform(df['color'])
df









    Out[8]:






  
    
      
      color
      size
      price
      classlabel
      size_num
    
  
  
    
      0
      1
      1
      10.1
      class1
      None
    
    
      1
      2
      2
      13.5
      class2
      None
    
    
      2
      0
      3
      15.3
      class1
      None



In [9]:

    
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
color = ohe.fit_transform(df['color'].reshape(-1,1)).toarray()
df_color = pd.DataFrame(color, columns = ['blue', 'green', 'red'])
df_color



In [10]:

    
df[['size', 'price']].join(df_color)

3. Encoding class labels

Create a mapping dictionary by enumerating unique categories. Note that class labels are not ordinal; they are nominal.



In [11]:

    
import numpy as np

class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Use LabelEncoder in scikit-learn to convert class labels into integers



In [13]:

    
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
df['classlabel'] = class_le.fit_transform(df['classlabel'].values)
df



In [17]:

    
class_le.inverse_transform(df.classlabel)









    Out[17]:





array(['class1', 'class2', 'class1'], dtype=object)

4. Convert categorical variable with Patsy



In [12]:

    
import patsy
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']



In [13]:

    
# Convert df['color'] into a categorical variable, setting one category as the baseline
patsy.dmatrix('color', df, return_type='dataframe')









    Out[13]:






  
    
      
      Intercept
      color[T.green]
      color[T.red]
    
  
  
    
      0
      1.0
      1.0
      0.0
    
    
      1
      1.0
      0.0
      1.0
    
    
      2
      1.0
      0.0
      0.0



In [14]:

    
# Convert df['color'] into a categorical variable without setting one category as baseline
patsy.dmatrix('color-1', df, return_type='dataframe')









    Out[14]:






  
    
      
      color[blue]
      color[green]
      color[red]
    
  
  
    
      0
      0.0
      1.0
      0.0
    
    
      1
      0.0
      0.0
      1.0
    
    
      2
      1.0
      0.0
      0.0

	color	size	price	classlabel	size_num
0	green	1	10.1	class1	None
1	red	2	13.5	class2	None
2	blue	3	15.3	class1	None