In [36]:
import pandas as pd
import numpy as np
In [27]:
data = pd.DataFrame(data=[[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]], columns=['feature 1', 'feature 2', 'feature 3'])
data
Out[27]:
Also create a dataframe of strings to make this a little more intuitive
In [195]:
gender = 'male', 'female'
country = 'France', 'UK', 'Germany'
color = 'blue', 'red', 'green', 'purple'
In [196]:
df = data.copy()
for i, category in enumerate([gender, country, color]):
df.iloc[:,i] = data.iloc[:,i].apply(lambda j: category[j])
df.columns = ['gender', 'country', 'color']
df
Out[196]:
In [197]:
from sklearn.preprocessing import LabelEncoder
In [198]:
le = LabelEncoder()
le.fit(gender + country + color)
print(le.classes_)
values_t = le.transform(df.values.ravel()).reshape(df.shape)
values_t
Out[198]:
In [199]:
df_t = pd.DataFrame(data=values_t, columns=[c + '(int)' for c in df.columns])
df_t
Out[199]:
In [201]:
labenc_lst = []
df_t2 = df.copy()
for category in df.columns:
le2 = LabelEncoder()
df_t2[category] = le2.fit_transform(df[category])
labenc_lst.append(le2)
df_t2
Out[201]:
Note that the Label Encoder is not appropriate for regressions and similar techniques that compute the distance between samples. For example, the distance between 'red' and 'blue' is 3 in our case, whereas the distance between 'purple' and 'red' is 1. This would have an 'unphysical' effect on regression models. To avoid this, use the One Hot Encoder. The drawback of the one hot encoder is that it increases the number of features.
Some algorithms, such as decision trees (e.g. random forests), do not use the pairwise distance so can be used in combination with Label Encoder. See http://stackoverflow.com/questions/17469835/one-hot-encoding-for-machine-learning for more discussion.
In [162]:
enc.n_values_
Out[162]:
In [163]:
enc.feature_indices_
Out[163]:
In [164]:
mapping = {data.columns[i]: list(range(enc.feature_indices_[i], enc.feature_indices_[i+1]))
for i in range(data.shape[1])}
mapping
Out[164]:
So our feature 1 will be transformed into two columns of booleans, (0 or 1), our feature 2 into 3 columns, and our feature 3 into 4 columns. The new columns are listed in the active_features_
attribute of our encoder
In [165]:
enc.active_features_
Out[165]:
In [166]:
def make_dataframe(sample, columns, **kwargs):
return pd.DataFrame(data=sample, columns=columns, **kwargs)
original_features = 'feature 1', 'feature 2', 'feature 3'
new_features = ['category ' + str(i) for i in enc.active_features_]
x1 = make_dataframe([[0, 0, 0]], original_features)
x1
Out[166]:
In [167]:
x1_t = enc.transform(x1)
make_dataframe(x1_t, new_features)
Out[167]:
In [168]:
make_dataframe(x1_t, new_features, dtype='bool')
Out[168]:
In [169]:
x2 = make_dataframe([[1,1,1]], original_features)
x2
Out[169]:
In [170]:
x2_t = make_dataframe(enc.transform(x2), new_features)
x2_t
Out[170]:
In [171]:
data_t = make_dataframe(enc.transform(data), new_features, dtype=bool)
In [172]:
import matplotlib.pyplot as plt
plt.spy(data_t)
Out[172]:
In [ ]:
In [ ]:
In [ ]: