Often features are not given as continuous values, but rather as categorical classes. For example, variables may be defined as ["male", "female"]
, ["Europe", "US", "Asia"]
, ["Disease A", "Disease B", "Disease C"]
. Such features can be efficiently coded as integers, for instance ["male", "US", "Disease B"]
could be expressed as [0, 1, 1]
.
Unfortunately, an integer representation can not be used directly with estimators in scikit-learn, because these expect continuous input, and would therefore interpret the categories as being ordered, which for the above examples, would be inappropriate.
There are two ways in which we can handle categorical data
Additional resources
In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder, Binarizer
One approach is to use a "one-of-K" or "one-hot" encoding, which is implemented in OneHotEncoder
. This estimator transforms a categorical feature with m
possible values into m
binary features, with only one active.
With one-hot encoding, a categorical feature becomes an array whose size is the number of possible choices for that features
Sklearn's OneHotEncoder
cannot process string values directly. If your nominal features are strings, then you need to first map them into integers.
pandas.get_dummies
by default only converts string columns into one-hot representation, unless columns are specified.
In [3]:
## Create a simple data frame as an example
df = pd.DataFrame({'color': ['red', 'blue', 'green','green'],'score':[1,1,2,3]})
df.head()
Out[3]:
In [4]:
## use pandas to do one hot encoding
pd.get_dummies(df,prefix=['color'])
Out[4]:
In [5]:
ohe = OneHotEncoder()
column = df['color'].values.reshape(-1,1)
ohe.fit(column)
labels = ohe.categories_[0].tolist()
X = ohe.transform(column).toarray()
df2 = pd.DataFrame({value: X[:,i].astype(int) for (i, value) in enumerate(labels)})
df2.head()
Out[5]:
In [19]:
df_new = pd.concat([df,df2],axis=1)
df_new.head()
Out[19]:
In [86]:
## It is worth noting that there is also a feature binarizer in Sklearn
scores = df['score'].values
scores = scores.reshape(1,scores.size)
bnz = Binarizer(threshold=1)
bnz.fit(scores)
transformed_scores = bnz.transform(scores)
print(scores)
print(transformed_scores)
In [89]:
## An example of how to handle missing values from Sklearn
import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(X)
print(imp.transform(X))
Part 1: load the data/titanic.csv
and do some EDA. Make at least two plots.
Part 2: Handle missing values by replacing missing ones with the mean or the most frequent
Part 3: Run a logistic regression with each of the following preprocessing methods:
get_dummies
Part 4: Draw an ROC curve for the train data and for the test data
In [ ]:
In [ ]: