14 - Data Preparation

by Alejandro Correa Bahnsen & Iván Torroledo and Jesus Solano

version 1.5, February 2019

Part of the class Practical Machine Learning

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Kevin Markham

Handling missing values

scikit-learn models expect that all values are numeric and hold meaning. Thus, missing values are not allowed by scikit-learn.


In [2]:
import pandas as pd
url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/titanic.csv.zip'
titanic = pd.read_csv(url, index_col=0)
titanic.head()


Out[2]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [2]:
# check for missing values
titanic.isnull().sum()


Out[2]:
Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

One possible strategy is to drop missing values:


In [3]:
# drop rows with any missing values
titanic.dropna().shape


Out[3]:
(183, 11)

In [4]:
# drop rows where Age is missing
titanic[titanic.Age.notnull()].shape


Out[4]:
(714, 11)

Sometimes a better strategy is to impute missing values:


In [5]:
# mean Age
titanic.Age.mean()


Out[5]:
29.69911764705882

In [6]:
# median Age
titanic.Age.median()


Out[6]:
28.0

In [7]:
titanic.loc[titanic.Age.isnull()]


Out[7]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female NaN 1 0 PC 17569 146.5208 B78 C
33 1 3 Glynn, Miss. Mary Agatha female NaN 0 0 335677 7.7500 NaN Q
37 1 3 Mamee, Mr. Hanna male NaN 0 0 2677 7.2292 NaN C
43 0 3 Kraeff, Mr. Theodor male NaN 0 0 349253 7.8958 NaN C
46 0 3 Rogers, Mr. William John male NaN 0 0 S.C./A.4. 23567 8.0500 NaN S
47 0 3 Lennon, Mr. Denis male NaN 1 0 370371 15.5000 NaN Q
48 1 3 O'Driscoll, Miss. Bridget female NaN 0 0 14311 7.7500 NaN Q
49 0 3 Samaan, Mr. Youssef male NaN 2 0 2662 21.6792 NaN C
56 1 1 Woolner, Mr. Hugh male NaN 0 0 19947 35.5000 C52 S
65 0 1 Stewart, Mr. Albert A male NaN 0 0 PC 17605 27.7208 NaN C
66 1 3 Moubarek, Master. Gerios male NaN 1 1 2661 15.2458 NaN C
77 0 3 Staneff, Mr. Ivan male NaN 0 0 349208 7.8958 NaN S
78 0 3 Moutal, Mr. Rahamin Haim male NaN 0 0 374746 8.0500 NaN S
83 1 3 McDermott, Miss. Brigdet Delia female NaN 0 0 330932 7.7875 NaN Q
88 0 3 Slocovski, Mr. Selman Francis male NaN 0 0 SOTON/OQ 392086 8.0500 NaN S
96 0 3 Shorney, Mr. Charles Joseph male NaN 0 0 374910 8.0500 NaN S
102 0 3 Petroff, Mr. Pastcho ("Pentcho") male NaN 0 0 349215 7.8958 NaN S
108 1 3 Moss, Mr. Albert Johan male NaN 0 0 312991 7.7750 NaN S
110 1 3 Moran, Miss. Bertha female NaN 1 0 371110 24.1500 NaN Q
122 0 3 Moore, Mr. Leonard Charles male NaN 0 0 A4. 54510 8.0500 NaN S
127 0 3 McMahon, Mr. Martin male NaN 0 0 370372 7.7500 NaN Q
129 1 3 Peter, Miss. Anna female NaN 1 1 2668 22.3583 F E69 C
141 0 3 Boulos, Mrs. Joseph (Sultana) female NaN 0 2 2678 15.2458 NaN C
155 0 3 Olsen, Mr. Ole Martin male NaN 0 0 Fa 265302 7.3125 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
719 0 3 McEvoy, Mr. Michael male NaN 0 0 36568 15.5000 NaN Q
728 1 3 Mannion, Miss. Margareth female NaN 0 0 36866 7.7375 NaN Q
733 0 2 Knight, Mr. Robert J male NaN 0 0 239855 0.0000 NaN S
739 0 3 Ivanoff, Mr. Kanio male NaN 0 0 349201 7.8958 NaN S
740 0 3 Nankoff, Mr. Minko male NaN 0 0 349218 7.8958 NaN S
741 1 1 Hawksford, Mr. Walter James male NaN 0 0 16988 30.0000 D45 S
761 0 3 Garfirth, Mr. John male NaN 0 0 358585 14.5000 NaN S
767 0 1 Brewe, Dr. Arthur Jackson male NaN 0 0 112379 39.6000 NaN C
769 0 3 Moran, Mr. Daniel J male NaN 1 0 371110 24.1500 NaN Q
774 0 3 Elias, Mr. Dibo male NaN 0 0 2674 7.2250 NaN C
777 0 3 Tobin, Mr. Roger male NaN 0 0 383121 7.7500 F38 Q
779 0 3 Kilgannon, Mr. Thomas J male NaN 0 0 36865 7.7375 NaN Q
784 0 3 Johnston, Mr. Andrew G male NaN 1 2 W./C. 6607 23.4500 NaN S
791 0 3 Keane, Mr. Andrew "Andy" male NaN 0 0 12460 7.7500 NaN Q
793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.5500 NaN S
794 0 1 Hoyt, Mr. William Fisher male NaN 0 0 PC 17600 30.6958 NaN C
816 0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0000 B102 S
826 0 3 Flynn, Mr. John male NaN 0 0 368323 6.9500 NaN Q
827 0 3 Lam, Mr. Len male NaN 0 0 1601 56.4958 NaN S
829 1 3 McCormack, Mr. Thomas Joseph male NaN 0 0 367228 7.7500 NaN Q
833 0 3 Saad, Mr. Amin male NaN 0 0 2671 7.2292 NaN C
838 0 3 Sirota, Mr. Maurice male NaN 0 0 392092 8.0500 NaN S
840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C
847 0 3 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.5500 NaN S
850 1 1 Goldenberg, Mrs. Samuel L (Edwiga Grabowska) female NaN 1 0 17453 89.1042 C92 C
860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C
864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S

177 rows × 11 columns

most frequent Age

titanic.Age.mode()


In [8]:
# fill missing values for Age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)

Another strategy would be to build a KNN model just to impute missing values. How would we do that?

If values are missing from a categorical feature, we could treat the missing values as another category. Why might that make sense?

How do we choose between all of these strategies?

Handling categorical features

How do we include a categorical feature in our model?

  • Ordered categories: transform them to sensible numeric values (example: small=1, medium=2, large=3)
  • Unordered categories: use dummy encoding (0/1)

In [10]:
titanic.head(10)


Out[10]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
6 0 3 Moran, Mr. James male 28.0 0 0 330877 8.4583 NaN Q
7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

In [11]:
# encode Sex_Female feature
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})

In [12]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [13]:
titanic.head(1)


Out[13]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Sex_Female Embarked_Q Embarked_S
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 NaN S 0 0 1
  • How do we interpret the encoding for Embarked?
  • Why didn't we just encode Embarked using a single feature (C=0, Q=1, S=2)?
  • Does it matter which category we choose to define as the baseline?
  • Why do we only need two dummy variables for Embarked?

In [14]:
# define X and y
feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived

# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))


0.7937219730941704
C:\Users\albah\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Advanced Categorical Encoding

Mushroom Database

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like leaflets three, let it be for Poisonous Oak and Ivy.

Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf


In [87]:
url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/agaricus-lepiota.zip'
data = pd.read_csv(url, index_col=None)
data = data.drop(['capcolor', 'stalkcolorabovering', 'odor', 'gillsize', 'sporeprintcolor', 'stalksurfaceabovering',
                  'ringtype', 'stalkroot', 'bruises'], axis=1)
data = data.sample(frac=1, random_state=42)
data.head()


Out[87]:
class capshape capsurface gillattachment gillspacing gillcolor stalkshape stalksurfacebelowring stalkcolorbelowring veiltype veilcolor ringnumber population habitat
1971 e f f f w h t f w p w o s g
6654 p f s f c b t s p p w o v l
5606 p x y f c b t s p p w o v l
3332 e f y f c n t s p p w o y d
6988 p f s f c b t s p p w o v l

In [88]:
data.columns


Out[88]:
Index(['class', 'capshape', 'capsurface', 'gillattachment', 'gillspacing',
       'gillcolor', 'stalkshape', 'stalksurfacebelowring',
       'stalkcolorbelowring', 'veiltype', 'veilcolor', 'ringnumber',
       'population', 'habitat'],
      dtype='object')

Attribute Information: (classes: edible=e, poisonous=p)

 1. cap-shape:                bell=b,conical=c,convex=x,flat=f,
                              knobbed=k,sunken=s
 2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
 3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,
                              pink=p,purple=u,red=e,white=w,yellow=y
 4. bruises?:                 bruises=t,no=f
 5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,
                              musty=m,none=n,pungent=p,spicy=s
 6. gill-attachment:          attached=a,descending=d,free=f,notched=n
 7. gill-spacing:             close=c,crowded=w,distant=d
 8. gill-size:                broad=b,narrow=n
 9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,
                              green=r,orange=o,pink=p,purple=u,red=e,
                              white=w,yellow=y
10. stalk-shape:              enlarging=e,tapering=t
11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,
                              rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                              pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                              pink=p,red=e,white=w,yellow=y
16. veil-type:                partial=p,universal=u
17. veil-color:               brown=n,orange=o,white=w,yellow=y
18. ring-number:              none=n,one=o,two=t
19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l,
                              none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,
                              orange=o,purple=u,white=w,yellow=y
21. population:               abundant=a,clustered=c,numerous=n,
                              scattered=s,several=v,solitary=y
22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,
                              urban=u,waste=w,woods=d

Dummies


In [89]:
y = (data['class'] == 'p') * 1.0

In [90]:
y.mean(), y.shape


Out[90]:
(0.48202855736090594, (8124,))

In [91]:
X = data.drop(['class'], axis=1)

In [92]:
X = pd.get_dummies(X)
X.head()


Out[92]:
capshape_b capshape_c capshape_f capshape_k capshape_s capshape_x capsurface_f capsurface_g capsurface_s capsurface_y ... population_s population_v population_y habitat_d habitat_g habitat_l habitat_m habitat_p habitat_u habitat_w
1971 0 0 1 0 0 0 1 0 0 0 ... 1 0 0 0 1 0 0 0 0 0
6654 0 0 1 0 0 0 0 0 1 0 ... 0 1 0 0 0 1 0 0 0 0
5606 0 0 0 0 0 1 0 0 0 1 ... 0 1 0 0 0 1 0 0 0 0
3332 0 0 1 0 0 0 0 0 0 1 ... 0 0 1 1 0 0 0 0 0 0
6988 0 0 1 0 0 0 0 0 1 0 ... 0 1 0 0 0 1 0 0 0 0

5 rows × 62 columns


In [93]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [94]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X, y, cv=10, scoring='accuracy')).describe()


Out[94]:
count    10.000000
mean      0.996678
std       0.002094
min       0.993850
25%       0.995386
50%       0.996310
75%       0.998459
max       1.000000
dtype: float64

PCA


In [107]:
from sklearn.decomposition import PCA

In [108]:
X_ = PCA(n_components=8).fit_transform(X)

In [110]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()


Out[110]:
count    10.000000
mean      0.995815
std       0.002186
min       0.992611
25%       0.994463
50%       0.996308
75%       0.996310
max       0.998770
dtype: float64

PCA can only be estimated if num_columns < num_observations

Other encoders


In [97]:
!pip install category_encoders


Collecting category_encoders
  Using cached https://files.pythonhosted.org/packages/f7/d3/82a4b85a87ece114f6d0139d643580c726efa45fa4db3b81aed38c0156c5/category_encoders-1.3.0-py2.py3-none-any.whl
Requirement already satisfied: scipy>=0.17.0 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (1.1.0)
Requirement already satisfied: scikit-learn>=0.17.1 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (0.20.1)
Requirement already satisfied: pandas>=0.20.1 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (0.23.4)
Requirement already satisfied: patsy>=0.4.1 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (0.5.1)
Requirement already satisfied: statsmodels>=0.6.1 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (0.9.0)
Requirement already satisfied: numpy>=1.11.1 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (1.15.4)
Requirement already satisfied: python-dateutil>=2.5.0 in c:\users\albah\anaconda3\lib\site-packages (from pandas>=0.20.1->category_encoders) (2.7.5)
Requirement already satisfied: pytz>=2011k in c:\users\albah\anaconda3\lib\site-packages (from pandas>=0.20.1->category_encoders) (2018.7)
Requirement already satisfied: six in c:\users\albah\anaconda3\lib\site-packages (from patsy>=0.4.1->category_encoders) (1.12.0)
Installing collected packages: category-encoders
Successfully installed category-encoders-1.3.0

In [98]:
import category_encoders as ce

Binary

Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.


In [126]:
X_ = ce.BinaryEncoder().fit_transform(data.drop(['class'], axis=1))

In [127]:
X_.head()


Out[127]:
capshape_0 capshape_1 capshape_2 capshape_3 capsurface_0 capsurface_1 capsurface_2 gillattachment_0 gillattachment_1 gillspacing_0 ... ringnumber_1 ringnumber_2 population_0 population_1 population_2 population_3 habitat_0 habitat_1 habitat_2 habitat_3
1971 0 0 0 1 0 0 1 0 1 0 ... 0 1 0 0 0 1 0 0 0 1
6654 0 0 0 1 0 1 0 0 1 1 ... 0 1 0 0 1 0 0 0 1 0
5606 0 0 1 0 0 1 1 0 1 1 ... 0 1 0 0 1 0 0 0 1 0
3332 0 0 0 1 0 1 1 0 1 1 ... 0 1 0 0 1 1 0 0 1 1
6988 0 0 0 1 0 1 0 0 1 1 ... 0 1 0 0 1 0 0 0 1 0

5 rows × 41 columns


In [128]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()


Out[128]:
count    10.000000
mean      0.997047
std       0.001852
min       0.993850
25%       0.996307
50%       0.996922
75%       0.998460
max       1.000000
dtype: float64

Feature Hashing

Feature Hashing for Large Scale Multitask Learning

https://alex.smola.org/papers/2009/Weinbergeretal09.pdf

Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case — multitask learning with hundreds of thousands of tasks.


In [111]:
X_ = ce.HashingEncoder(n_components=8).fit_transform(data.drop(['class'], axis=1))

In [112]:
X_.head()


Out[112]:
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7
1971 3 1 1 0 2 1 1 4
6654 1 0 3 2 3 0 1 3
5606 1 0 3 2 2 1 2 2
3332 1 1 2 1 2 3 1 2
6988 1 0 3 2 3 0 1 3

In [105]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()


Out[105]:
count    10.000000
mean      0.922451
std       0.008532
min       0.907749
25%       0.919127
50%       0.922365
75%       0.927691
max       0.937269
dtype: float64

Helmert Coding

Helmert coding compares each level of a categorical variable to the mean of the subsequent levels. Hence, the first contrast compares the mean of the dependent variable for level 1 of race with the mean of all of the subsequent levels of race (levels 2, 3, and 4), the second contrast compares the mean of the dependent variable for level 2 of race with the mean of all of the subsequent levels of race (levels 3 and 4), and the third contrast compares the mean of the dependent variable for level 3 of race with the mean of all of the subsequent levels of race (level 4).

Gregory Carey (2003). Coding Categorical Variables

http://psych.colorado.edu/~carey/courses/psyc5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf


In [113]:
X_ = ce.HelmertEncoder().fit_transform(data.drop(['class'], axis=1))

In [114]:
X_.head()


Out[114]:
intercept capshape_0 capshape_1 capshape_2 capshape_3 capshape_4 capsurface_0 capsurface_1 capsurface_2 gillattachment_0 ... population_1 population_2 population_3 population_4 habitat_0 habitat_1 habitat_2 habitat_3 habitat_4 habitat_5
1971 1 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 ... -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
6654 1 -1.0 -1.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0 -1.0 ... -1.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0 -1.0 -1.0 -1.0
5606 1 1.0 -1.0 -1.0 -1.0 -1.0 0.0 2.0 -1.0 -1.0 ... -1.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0 -1.0 -1.0 -1.0
3332 1 -1.0 -1.0 -1.0 -1.0 -1.0 0.0 2.0 -1.0 -1.0 ... 2.0 -1.0 -1.0 -1.0 0.0 2.0 -1.0 -1.0 -1.0 -1.0
6988 1 -1.0 -1.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0 -1.0 ... -1.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0 -1.0 -1.0 -1.0

5 rows × 50 columns


In [115]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()


Out[115]:
count    10.000000
mean      0.996677
std       0.002324
min       0.992611
25%       0.995387
50%       0.996922
75%       0.998460
max       1.000000
dtype: float64

Ordinal


In [122]:
X_ = ce.OrdinalEncoder().fit_transform(data.drop(['class'], axis=1))

In [123]:
X_.head()


Out[123]:
capshape capsurface gillattachment gillspacing gillcolor stalkshape stalksurfacebelowring stalkcolorbelowring veiltype veilcolor ringnumber population habitat
1971 1 1 1 1 1 1 1 1 1 1 1 1 1
6654 1 2 1 2 2 1 2 2 1 1 1 2 2
5606 2 3 1 2 2 1 2 2 1 1 1 2 2
3332 1 3 1 2 3 1 2 2 1 1 1 3 3
6988 1 2 1 2 2 1 2 2 1 1 1 2 2

In [124]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()


Out[124]:
count    10.000000
mean      0.997047
std       0.001852
min       0.993850
25%       0.996307
50%       0.996922
75%       0.998460
max       1.000000
dtype: float64

Polynomial Coding

Polynomial contrast coding for the encoding of categorical features

Orthogonal polynomial coding is a form of trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable. This type of coding system should be used only with an ordinal variable in which the levels are equally spaced. Examples of such a variable might be income or education.


In [119]:
X_ = ce.PolynomialEncoder().fit_transform(data.drop(['class'], axis=1))

In [120]:
X_.head()


Out[120]:
intercept capshape_0 capshape_1 capshape_2 capshape_3 capshape_4 capsurface_0 capsurface_1 capsurface_2 gillattachment_0 ... population_1 population_2 population_3 population_4 habitat_0 habitat_1 habitat_2 habitat_3 habitat_4 habitat_5
1971 1 -0.597614 0.545545 -0.372678 0.188982 -0.062994 -0.670820 0.5 -0.223607 -0.707107 ... 0.545545 -0.372678 0.188982 -0.062994 -0.566947 5.455447e-01 -0.408248 0.241747 -0.109109 0.032898
6654 1 -0.597614 0.545545 -0.372678 0.188982 -0.062994 -0.223607 -0.5 0.670820 -0.707107 ... -0.109109 0.521749 -0.566947 0.314970 -0.377964 9.521795e-17 0.408248 -0.564076 0.436436 -0.197386
5606 1 -0.358569 -0.109109 0.521749 -0.566947 0.314970 0.223607 -0.5 -0.670820 -0.707107 ... -0.109109 0.521749 -0.566947 0.314970 -0.377964 9.521795e-17 0.408248 -0.564076 0.436436 -0.197386
3332 1 -0.597614 0.545545 -0.372678 0.188982 -0.062994 0.223607 -0.5 -0.670820 -0.707107 ... -0.436436 0.298142 0.377964 -0.629941 -0.188982 -3.273268e-01 0.408248 0.080582 -0.545545 0.493464
6988 1 -0.597614 0.545545 -0.372678 0.188982 -0.062994 -0.223607 -0.5 0.670820 -0.707107 ... -0.109109 0.521749 -0.566947 0.314970 -0.377964 9.521795e-17 0.408248 -0.564076 0.436436 -0.197386

5 rows × 50 columns


In [121]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()


Out[121]:
count    10.000000
mean      0.996677
std       0.002324
min       0.992611
25%       0.995387
50%       0.996922
75%       0.998460
max       1.000000
dtype: float64

In [ ]: