Dealing With Catagorical Features

Often features are not given as continuous values, but rather as categorical classes. For example, variables may be defined as ["male", "female"], ["Europe", "US", "Asia"], ["Disease A", "Disease B", "Disease C"]. Such features can be efficiently coded as integers, for instance ["male", "US", "Disease B"] could be expressed as [0, 1, 1].

Unfortunately, an integer representation can not be used directly with estimators in scikit-learn, because these expect continuous input, and would therefore interpret the categories as being ordered, which for the above examples, would be inappropriate.

There are two ways in which we can handle categorical data

  1. Convert the categorical data to labels
  2. Convert the labels to binary variables (one-hot encoding)

Additional resources


In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder, Binarizer

One hot encoding

One approach is to use a "one-of-K" or "one-hot" encoding, which is implemented in OneHotEncoder. This estimator transforms a categorical feature with m possible values into m binary features, with only one active.

With one-hot encoding, a categorical feature becomes an array whose size is the number of possible choices for that features

Sklearn's OneHotEncoder cannot process string values directly. If your nominal features are strings, then you need to first map them into integers.

pandas.get_dummies by default only converts string columns into one-hot representation, unless columns are specified.


In [3]:
## Create a simple data frame as an example
df = pd.DataFrame({'color': ['red', 'blue', 'green','green'],'score':[1,1,2,3]})
df.head()


Out[3]:
color score
0 red 1
1 blue 1
2 green 2
3 green 3

In [4]:
## use pandas to do one hot encoding
pd.get_dummies(df,prefix=['color'])


Out[4]:
score color_blue color_green color_red
0 1 0 0 1
1 1 1 0 0
2 2 0 1 0
3 3 0 1 0

In [5]:
ohe = OneHotEncoder()
column = df['color'].values.reshape(-1,1)
ohe.fit(column)
labels = ohe.categories_[0].tolist()
X = ohe.transform(column).toarray()
df2 = pd.DataFrame({value: X[:,i].astype(int) for (i, value) in enumerate(labels)})
df2.head()


Out[5]:
blue green red
0 0 0 1
1 1 0 0
2 0 1 0
3 0 1 0

In [19]:
df_new = pd.concat([df,df2],axis=1)
df_new.head()


Out[19]:
color score blue green red
0 red 1 0 0 1
1 blue 1 1 0 0
2 green 2 0 1 0
3 green 3 0 1 0

In [86]:
## It is worth noting that there is also a feature binarizer in Sklearn
scores = df['score'].values
scores = scores.reshape(1,scores.size)
bnz = Binarizer(threshold=1)
bnz.fit(scores)
transformed_scores = bnz.transform(scores)
print(scores)
print(transformed_scores)


[[1 1 2 3]]
[[0 0 1 1]]

In [89]:
## An example of how to handle missing values from Sklearn
import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(X)
print(imp.transform(X))


[[nan, 2], [6, nan], [7, 6]]
[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]

Assignment

Part 1: load the data/titanic.csv and do some EDA. Make at least two plots.

Part 2: Handle missing values by replacing missing ones with the mean or the most frequent

Part 3: Run a logistic regression with each of the following preprocessing methods:

  • one hot encoding and using Pandas get_dummies
  • one hot encoding all from within a Sklearn Pipeline

Part 4: Draw an ROC curve for the train data and for the test data


In [ ]:


In [ ]: