Handling categorical data

In this notebook, I'll demonstrate different ways of mapping or encoding categorical data.

In [1]:
# create a pandas dataframe with categorical variables to work with
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']

color size price classlabel
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1

1. Mapping ordinal features

  • Create a mapping dictionary first and then map the categorical string values into integers.

In [2]:
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)

color size price classlabel
0 green 1 10.1 class1
1 red 2 13.5 class2
2 blue 3 15.3 class1

In [3]:
# transform integers back to string values using a reverse-mapping dictionary
inv_size_mapping = {v: k for k, v in size_mapping.items()}

0     M
1     L
2    XL
Name: size, dtype: object
  • Create a function that converts strings into numbers

In [4]:
def size_to_numeric(x):
    if x=='XL':
        return 3
    if x=='L':
        return 2
    if x=='M':
        return 1

df['size_num'] = df['size'].apply(size_to_numeric)

color size price classlabel size_num
0 green 1 10.1 class1 None
1 red 2 13.5 class2 None
2 blue 3 15.3 class1 None

2. Convert nominal categorical feature into dummy variables

Often, machine learning algorithms require that categorical variables be converted into dummy variables (also called OneHot encoding). For example, a single feature Fruit would be converted into three features, Apples, Oranges, and Bananas, one for each category in the categorical feature.

There are common ways to preprocess categorical features: using pandas or scikit-learn.

In [5]:
# using pandas 'get_dummies'
pd.get_dummies(df[['price','color', 'size']])

price size color_blue color_green color_red
0 10.1 1 0.0 1.0 0.0
1 13.5 2 0.0 0.0 1.0
2 15.3 3 1.0 0.0 0.0

In [6]:
# using pandas 'get_dummies'

blue green red
0 0.0 1.0 0.0
1 0.0 0.0 1.0
2 1.0 0.0 0.0

In [7]:
pd.get_dummies(df['color']).join(df[['size', 'price']])

blue green red size price
0 0.0 1.0 0.0 1 10.1
1 0.0 0.0 1.0 2 13.5
2 1.0 0.0 0.0 3 15.3

In [8]:
# using scikit-learn LabelEncoder and OneHotEncoder
from sklearn.preprocessing import LabelEncoder
color_le = LabelEncoder()
df['color'] = color_le.fit_transform(df['color'])

color size price classlabel size_num
0 1 1 10.1 class1 None
1 2 2 13.5 class2 None
2 0 3 15.3 class1 None

In [9]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
color = ohe.fit_transform(df['color'].reshape(-1,1)).toarray()
df_color = pd.DataFrame(color, columns = ['blue', 'green', 'red'])

blue green red
0 0.0 1.0 0.0
1 0.0 0.0 1.0
2 1.0 0.0 0.0

In [10]:
df[['size', 'price']].join(df_color)

size price blue green red
0 1 10.1 0.0 1.0 0.0
1 2 13.5 0.0 0.0 1.0
2 3 15.3 1.0 0.0 0.0

3. Encoding class labels

  • Create a mapping dictionary by enumerating unique categories. Note that class labels are not ordinal; they are nominal.

In [11]:
import numpy as np

class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
df['classlabel'] = df['classlabel'].map(class_mapping)

color size price classlabel
0 green 1 10.1 0
1 red 2 13.5 1
2 blue 3 15.3 0
  • Use LabelEncoder in scikit-learn to convert class labels into integers

In [13]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
df['classlabel'] = class_le.fit_transform(df['classlabel'].values)

color size price classlabel
0 green M 10.1 0
1 red L 13.5 1
2 blue XL 15.3 0

In [17]:

array(['class1', 'class2', 'class1'], dtype=object)

4. Convert categorical variable with Patsy

In [12]:
import patsy
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']

In [13]:
# Convert df['color'] into a categorical variable, setting one category as the baseline
patsy.dmatrix('color', df, return_type='dataframe')

Intercept color[T.green] color[T.red]
0 1.0 1.0 0.0
1 1.0 0.0 1.0
2 1.0 0.0 0.0

In [14]:
# Convert df['color'] into a categorical variable without setting one category as baseline
patsy.dmatrix('color-1', df, return_type='dataframe')

color[blue] color[green] color[red]
0 0.0 1.0 0.0
1 0.0 0.0 1.0
2 1.0 0.0 0.0