Often features are not given as continuous values, but rather as categorical classes. For example, variables may be defined as `["male", "female"]`

, `["Europe", "US", "Asia"]`

, `["Disease A", "Disease B", "Disease C"]`

. Such features can be efficiently coded as integers, for instance `["male", "US", "Disease B"]`

could be expressed as `[0, 1, 1]`

.

Unfortunately, an integer representation can not be used directly with estimators in scikit-learn, because these expect *continuous* input, and would therefore interpret the categories as being ordered, which for the above examples, would be inappropriate.

There are two ways in which we can handle categorical data

- Convert the categorical data to labels
- Convert the labels to binary variables (one-hot encoding)

Additional resources

```
In [2]:
```import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder, Binarizer

One approach is to use a "one-of-K" or "one-hot" encoding, which is implemented in `OneHotEncoder`

. This estimator transforms a categorical feature with `m`

possible values into `m`

binary features, with only one active.

With one-hot encoding, a categorical feature becomes an array whose size is the number of possible choices for that features

Sklearn's `OneHotEncoder`

cannot process string values directly. If your nominal features are strings, then you need to first map them into integers.

`pandas.get_dummies`

by default only converts string columns into one-hot representation, unless columns are specified.

```
In [3]:
```## Create a simple data frame as an example
df = pd.DataFrame({'color': ['red', 'blue', 'green','green'],'score':[1,1,2,3]})
df.head()

```
Out[3]:
```

```
In [4]:
```## use pandas to do one hot encoding
pd.get_dummies(df,prefix=['color'])

```
Out[4]:
```

```
In [5]:
```ohe = OneHotEncoder()
column = df['color'].values.reshape(-1,1)
ohe.fit(column)
labels = ohe.categories_[0].tolist()
X = ohe.transform(column).toarray()
df2 = pd.DataFrame({value: X[:,i].astype(int) for (i, value) in enumerate(labels)})
df2.head()

```
Out[5]:
```

```
In [19]:
```df_new = pd.concat([df,df2],axis=1)
df_new.head()

```
Out[19]:
```

```
In [86]:
```## It is worth noting that there is also a feature binarizer in Sklearn
scores = df['score'].values
scores = scores.reshape(1,scores.size)
bnz = Binarizer(threshold=1)
bnz.fit(scores)
transformed_scores = bnz.transform(scores)
print(scores)
print(transformed_scores)

```
```

```
In [89]:
```## An example of how to handle missing values from Sklearn
import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(X)
print(imp.transform(X))

```
```

Part 1: load the `data/titanic.csv`

and do some EDA. Make at least two plots.

Part 2: Handle missing values by replacing missing ones with the mean or the most frequent

Part 3: Run a logistic regression with each of the following preprocessing methods:

- one hot encoding and using Pandas
`get_dummies`

- one hot encoding all from within a Sklearn Pipeline

Part 4: Draw an ROC curve for the train data and for the test data

```
In [ ]:
```

```
In [ ]:
```