In this notebook, I'll demonstrate different ways of mapping or encoding categorical data.
In [1]:
# create a pandas dataframe with categorical variables to work with
import pandas as pd
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'price', 'classlabel']
df
Out[1]:
In [2]:
size_mapping = {'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].map(size_mapping)
df
Out[2]:
In [3]:
# transform integers back to string values using a reverse-mapping dictionary
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)
Out[3]:
In [4]:
def size_to_numeric(x):
if x=='XL':
return 3
if x=='L':
return 2
if x=='M':
return 1
df['size_num'] = df['size'].apply(size_to_numeric)
df
Out[4]:
Often, machine learning algorithms require that categorical variables be converted into dummy variables (also called OneHot encoding). For example, a single feature Fruit would be converted into three features, Apples, Oranges, and Bananas, one for each category in the categorical feature.
There are common ways to preprocess categorical features: using pandas or scikit-learn.
In [5]:
# using pandas 'get_dummies'
pd.get_dummies(df[['price','color', 'size']])
Out[5]:
In [6]:
# using pandas 'get_dummies'
pd.get_dummies(df['color'])
Out[6]:
In [7]:
pd.get_dummies(df['color']).join(df[['size', 'price']])
Out[7]:
In [8]:
# using scikit-learn LabelEncoder and OneHotEncoder
from sklearn.preprocessing import LabelEncoder
color_le = LabelEncoder()
df['color'] = color_le.fit_transform(df['color'])
df
Out[8]:
In [9]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
color = ohe.fit_transform(df['color'].reshape(-1,1)).toarray()
df_color = pd.DataFrame(color, columns = ['blue', 'green', 'red'])
df_color
Out[9]:
In [10]:
df[['size', 'price']].join(df_color)
Out[10]:
In [11]:
import numpy as np
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
df['classlabel'] = df['classlabel'].map(class_mapping)
df
Out[11]:
In [13]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
df['classlabel'] = class_le.fit_transform(df['classlabel'].values)
df
Out[13]:
In [17]:
class_le.inverse_transform(df.classlabel)
Out[17]:
In [12]:
import patsy
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'price', 'classlabel']
In [13]:
# Convert df['color'] into a categorical variable, setting one category as the baseline
patsy.dmatrix('color', df, return_type='dataframe')
Out[13]:
In [14]:
# Convert df['color'] into a categorical variable without setting one category as baseline
patsy.dmatrix('color-1', df, return_type='dataframe')
Out[14]: