14 - Data Preparation

by Alejandro Correa Bahnsen & Iván Torroledo and Jesus Solano

version 1.5, February 2019

Part of the class Practical Machine Learning

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Kevin Markham

Handling missing values

scikit-learn models expect that all values are numeric and hold meaning. Thus, missing values are not allowed by scikit-learn.



In [2]:

    
import pandas as pd
url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/titanic.csv.zip'
titanic = pd.read_csv(url, index_col=0)
titanic.head()









    Out[2]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [2]:

    
# check for missing values
titanic.isnull().sum()









    Out[2]:





Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

One possible strategy is to drop missing values:



In [3]:

    
# drop rows with any missing values
titanic.dropna().shape









    Out[3]:





(183, 11)



In [4]:

    
# drop rows where Age is missing
titanic[titanic.Age.notnull()].shape









    Out[4]:





(714, 11)

Sometimes a better strategy is to impute missing values:



In [5]:

    
# mean Age
titanic.Age.mean()









    Out[5]:





29.69911764705882



In [6]:

    
# median Age
titanic.Age.median()









    Out[6]:





28.0



In [7]:

    
titanic.loc[titanic.Age.isnull()]









    Out[7]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      6
      0
      3
      Moran, Mr. James
      male
      NaN
      0
      0
      330877
      8.4583
      NaN
      Q
    
    
      18
      1
      2
      Williams, Mr. Charles Eugene
      male
      NaN
      0
      0
      244373
      13.0000
      NaN
      S
    
    
      20
      1
      3
      Masselmani, Mrs. Fatima
      female
      NaN
      0
      0
      2649
      7.2250
      NaN
      C
    
    
      27
      0
      3
      Emir, Mr. Farred Chehab
      male
      NaN
      0
      0
      2631
      7.2250
      NaN
      C
    
    
      29
      1
      3
      O'Dwyer, Miss. Ellen "Nellie"
      female
      NaN
      0
      0
      330959
      7.8792
      NaN
      Q
    
    
      30
      0
      3
      Todoroff, Mr. Lalio
      male
      NaN
      0
      0
      349216
      7.8958
      NaN
      S
    
    
      32
      1
      1
      Spencer, Mrs. William Augustus (Marie Eugenie)
      female
      NaN
      1
      0
      PC 17569
      146.5208
      B78
      C
    
    
      33
      1
      3
      Glynn, Miss. Mary Agatha
      female
      NaN
      0
      0
      335677
      7.7500
      NaN
      Q
    
    
      37
      1
      3
      Mamee, Mr. Hanna
      male
      NaN
      0
      0
      2677
      7.2292
      NaN
      C
    
    
      43
      0
      3
      Kraeff, Mr. Theodor
      male
      NaN
      0
      0
      349253
      7.8958
      NaN
      C
    
    
      46
      0
      3
      Rogers, Mr. William John
      male
      NaN
      0
      0
      S.C./A.4. 23567
      8.0500
      NaN
      S
    
    
      47
      0
      3
      Lennon, Mr. Denis
      male
      NaN
      1
      0
      370371
      15.5000
      NaN
      Q
    
    
      48
      1
      3
      O'Driscoll, Miss. Bridget
      female
      NaN
      0
      0
      14311
      7.7500
      NaN
      Q
    
    
      49
      0
      3
      Samaan, Mr. Youssef
      male
      NaN
      2
      0
      2662
      21.6792
      NaN
      C
    
    
      56
      1
      1
      Woolner, Mr. Hugh
      male
      NaN
      0
      0
      19947
      35.5000
      C52
      S
    
    
      65
      0
      1
      Stewart, Mr. Albert A
      male
      NaN
      0
      0
      PC 17605
      27.7208
      NaN
      C
    
    
      66
      1
      3
      Moubarek, Master. Gerios
      male
      NaN
      1
      1
      2661
      15.2458
      NaN
      C
    
    
      77
      0
      3
      Staneff, Mr. Ivan
      male
      NaN
      0
      0
      349208
      7.8958
      NaN
      S
    
    
      78
      0
      3
      Moutal, Mr. Rahamin Haim
      male
      NaN
      0
      0
      374746
      8.0500
      NaN
      S
    
    
      83
      1
      3
      McDermott, Miss. Brigdet Delia
      female
      NaN
      0
      0
      330932
      7.7875
      NaN
      Q
    
    
      88
      0
      3
      Slocovski, Mr. Selman Francis
      male
      NaN
      0
      0
      SOTON/OQ 392086
      8.0500
      NaN
      S
    
    
      96
      0
      3
      Shorney, Mr. Charles Joseph
      male
      NaN
      0
      0
      374910
      8.0500
      NaN
      S
    
    
      102
      0
      3
      Petroff, Mr. Pastcho ("Pentcho")
      male
      NaN
      0
      0
      349215
      7.8958
      NaN
      S
    
    
      108
      1
      3
      Moss, Mr. Albert Johan
      male
      NaN
      0
      0
      312991
      7.7750
      NaN
      S
    
    
      110
      1
      3
      Moran, Miss. Bertha
      female
      NaN
      1
      0
      371110
      24.1500
      NaN
      Q
    
    
      122
      0
      3
      Moore, Mr. Leonard Charles
      male
      NaN
      0
      0
      A4. 54510
      8.0500
      NaN
      S
    
    
      127
      0
      3
      McMahon, Mr. Martin
      male
      NaN
      0
      0
      370372
      7.7500
      NaN
      Q
    
    
      129
      1
      3
      Peter, Miss. Anna
      female
      NaN
      1
      1
      2668
      22.3583
      F E69
      C
    
    
      141
      0
      3
      Boulos, Mrs. Joseph (Sultana)
      female
      NaN
      0
      2
      2678
      15.2458
      NaN
      C
    
    
      155
      0
      3
      Olsen, Mr. Ole Martin
      male
      NaN
      0
      0
      Fa 265302
      7.3125
      NaN
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      719
      0
      3
      McEvoy, Mr. Michael
      male
      NaN
      0
      0
      36568
      15.5000
      NaN
      Q
    
    
      728
      1
      3
      Mannion, Miss. Margareth
      female
      NaN
      0
      0
      36866
      7.7375
      NaN
      Q
    
    
      733
      0
      2
      Knight, Mr. Robert J
      male
      NaN
      0
      0
      239855
      0.0000
      NaN
      S
    
    
      739
      0
      3
      Ivanoff, Mr. Kanio
      male
      NaN
      0
      0
      349201
      7.8958
      NaN
      S
    
    
      740
      0
      3
      Nankoff, Mr. Minko
      male
      NaN
      0
      0
      349218
      7.8958
      NaN
      S
    
    
      741
      1
      1
      Hawksford, Mr. Walter James
      male
      NaN
      0
      0
      16988
      30.0000
      D45
      S
    
    
      761
      0
      3
      Garfirth, Mr. John
      male
      NaN
      0
      0
      358585
      14.5000
      NaN
      S
    
    
      767
      0
      1
      Brewe, Dr. Arthur Jackson
      male
      NaN
      0
      0
      112379
      39.6000
      NaN
      C
    
    
      769
      0
      3
      Moran, Mr. Daniel J
      male
      NaN
      1
      0
      371110
      24.1500
      NaN
      Q
    
    
      774
      0
      3
      Elias, Mr. Dibo
      male
      NaN
      0
      0
      2674
      7.2250
      NaN
      C
    
    
      777
      0
      3
      Tobin, Mr. Roger
      male
      NaN
      0
      0
      383121
      7.7500
      F38
      Q
    
    
      779
      0
      3
      Kilgannon, Mr. Thomas J
      male
      NaN
      0
      0
      36865
      7.7375
      NaN
      Q
    
    
      784
      0
      3
      Johnston, Mr. Andrew G
      male
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
    
      791
      0
      3
      Keane, Mr. Andrew "Andy"
      male
      NaN
      0
      0
      12460
      7.7500
      NaN
      Q
    
    
      793
      0
      3
      Sage, Miss. Stella Anna
      female
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      794
      0
      1
      Hoyt, Mr. William Fisher
      male
      NaN
      0
      0
      PC 17600
      30.6958
      NaN
      C
    
    
      816
      0
      1
      Fry, Mr. Richard
      male
      NaN
      0
      0
      112058
      0.0000
      B102
      S
    
    
      826
      0
      3
      Flynn, Mr. John
      male
      NaN
      0
      0
      368323
      6.9500
      NaN
      Q
    
    
      827
      0
      3
      Lam, Mr. Len
      male
      NaN
      0
      0
      1601
      56.4958
      NaN
      S
    
    
      829
      1
      3
      McCormack, Mr. Thomas Joseph
      male
      NaN
      0
      0
      367228
      7.7500
      NaN
      Q
    
    
      833
      0
      3
      Saad, Mr. Amin
      male
      NaN
      0
      0
      2671
      7.2292
      NaN
      C
    
    
      838
      0
      3
      Sirota, Mr. Maurice
      male
      NaN
      0
      0
      392092
      8.0500
      NaN
      S
    
    
      840
      1
      1
      Marechal, Mr. Pierre
      male
      NaN
      0
      0
      11774
      29.7000
      C47
      C
    
    
      847
      0
      3
      Sage, Mr. Douglas Bullen
      male
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      850
      1
      1
      Goldenberg, Mrs. Samuel L (Edwiga Grabowska)
      female
      NaN
      1
      0
      17453
      89.1042
      C92
      C
    
    
      860
      0
      3
      Razi, Mr. Raihed
      male
      NaN
      0
      0
      2629
      7.2292
      NaN
      C
    
    
      864
      0
      3
      Sage, Miss. Dorothy Edith "Dolly"
      female
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      869
      0
      3
      van Melkebeke, Mr. Philemon
      male
      NaN
      0
      0
      345777
      9.5000
      NaN
      S
    
    
      879
      0
      3
      Laleff, Mr. Kristo
      male
      NaN
      0
      0
      349217
      7.8958
      NaN
      S
    
    
      889
      0
      3
      Johnston, Miss. Catherine Helen "Carrie"
      female
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
  

177 rows × 11 columns

most frequent Age

titanic.Age.mode()



In [8]:

    
# fill missing values for Age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)

Another strategy would be to build a KNN model just to impute missing values. How would we do that?

If values are missing from a categorical feature, we could treat the missing values as another category. Why might that make sense?

How do we choose between all of these strategies?

Handling categorical features

How do we include a categorical feature in our model?

Ordered categories: transform them to sensible numeric values (example: small=1, medium=2, large=3)
Unordered categories: use dummy encoding (0/1)



In [10]:

    
titanic.head(10)









    Out[10]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
    
    
      6
      0
      3
      Moran, Mr. James
      male
      28.0
      0
      0
      330877
      8.4583
      NaN
      Q
    
    
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54.0
      0
      0
      17463
      51.8625
      E46
      S
    
    
      8
      0
      3
      Palsson, Master. Gosta Leonard
      male
      2.0
      3
      1
      349909
      21.0750
      NaN
      S
    
    
      9
      1
      3
      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
      female
      27.0
      0
      2
      347742
      11.1333
      NaN
      S
    
    
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14.0
      1
      0
      237736
      30.0708
      NaN
      C



In [11]:

    
# encode Sex_Female feature
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})



In [12]:

    
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)



In [13]:

    
titanic.head(1)









    Out[13]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Sex_Female
      Embarked_Q
      Embarked_S
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.25
      NaN
      S
      0
      0
      1

How do we interpret the encoding for Embarked?
Why didn't we just encode Embarked using a single feature (C=0, Q=1, S=2)?
Does it matter which category we choose to define as the baseline?
Why do we only need two dummy variables for Embarked?



In [14]:

    
# define X and y
feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived

# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))









    



0.7937219730941704






    



C:\Users\albah\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Advanced Categorical Encoding

Mushroom Database

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like leaflets three, let it be for Poisonous Oak and Ivy.

Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf



In [87]:

    
url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/agaricus-lepiota.zip'
data = pd.read_csv(url, index_col=None)
data = data.drop(['capcolor', 'stalkcolorabovering', 'odor', 'gillsize', 'sporeprintcolor', 'stalksurfaceabovering',
                  'ringtype', 'stalkroot', 'bruises'], axis=1)
data = data.sample(frac=1, random_state=42)
data.head()









    Out[87]:







  
    
      
      class
      capshape
      capsurface
      gillattachment
      gillspacing
      gillcolor
      stalkshape
      stalksurfacebelowring
      stalkcolorbelowring
      veiltype
      veilcolor
      ringnumber
      population
      habitat
    
  
  
    
      1971
      e
      f
      f
      f
      w
      h
      t
      f
      w
      p
      w
      o
      s
      g
    
    
      6654
      p
      f
      s
      f
      c
      b
      t
      s
      p
      p
      w
      o
      v
      l
    
    
      5606
      p
      x
      y
      f
      c
      b
      t
      s
      p
      p
      w
      o
      v
      l
    
    
      3332
      e
      f
      y
      f
      c
      n
      t
      s
      p
      p
      w
      o
      y
      d
    
    
      6988
      p
      f
      s
      f
      c
      b
      t
      s
      p
      p
      w
      o
      v
      l



In [88]:

    
data.columns









    Out[88]:





Index(['class', 'capshape', 'capsurface', 'gillattachment', 'gillspacing',
       'gillcolor', 'stalkshape', 'stalksurfacebelowring',
       'stalkcolorbelowring', 'veiltype', 'veilcolor', 'ringnumber',
       'population', 'habitat'],
      dtype='object')

Attribute Information: (classes: edible=e, poisonous=p)

 1. cap-shape:                bell=b,conical=c,convex=x,flat=f,
                              knobbed=k,sunken=s
 2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
 3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,
                              pink=p,purple=u,red=e,white=w,yellow=y
 4. bruises?:                 bruises=t,no=f
 5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,
                              musty=m,none=n,pungent=p,spicy=s
 6. gill-attachment:          attached=a,descending=d,free=f,notched=n
 7. gill-spacing:             close=c,crowded=w,distant=d
 8. gill-size:                broad=b,narrow=n
 9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,
                              green=r,orange=o,pink=p,purple=u,red=e,
                              white=w,yellow=y
10. stalk-shape:              enlarging=e,tapering=t
11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,
                              rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                              pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                              pink=p,red=e,white=w,yellow=y
16. veil-type:                partial=p,universal=u
17. veil-color:               brown=n,orange=o,white=w,yellow=y
18. ring-number:              none=n,one=o,two=t
19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l,
                              none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,
                              orange=o,purple=u,white=w,yellow=y
21. population:               abundant=a,clustered=c,numerous=n,
                              scattered=s,several=v,solitary=y
22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,
                              urban=u,waste=w,woods=d

Dummies



In [89]:

    
y = (data['class'] == 'p') * 1.0



In [90]:

    
y.mean(), y.shape









    Out[90]:





(0.48202855736090594, (8124,))



In [91]:

    
X = data.drop(['class'], axis=1)



In [92]:

    
X = pd.get_dummies(X)
X.head()









    Out[92]:







  
    
      
      capshape_b
      capshape_c
      capshape_f
      capshape_k
      capshape_s
      capshape_x
      capsurface_f
      capsurface_g
      capsurface_s
      capsurface_y
      ...
      population_s
      population_v
      population_y
      habitat_d
      habitat_g
      habitat_l
      habitat_m
      habitat_p
      habitat_u
      habitat_w
    
  
  
    
      1971
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
      ...
      1
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      6654
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
      ...
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      5606
      0
      0
      0
      0
      0
      1
      0
      0
      0
      1
      ...
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      3332
      0
      0
      1
      0
      0
      0
      0
      0
      0
      1
      ...
      0
      0
      1
      1
      0
      0
      0
      0
      0
      0
    
    
      6988
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
      ...
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
    
  

5 rows × 62 columns



In [93]:

    
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression



In [94]:

    
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X, y, cv=10, scoring='accuracy')).describe()









    Out[94]:





count    10.000000
mean      0.996678
std       0.002094
min       0.993850
25%       0.995386
50%       0.996310
75%       0.998459
max       1.000000
dtype: float64

PCA



In [107]:

    
from sklearn.decomposition import PCA



In [108]:

    
X_ = PCA(n_components=8).fit_transform(X)



In [110]:

    
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()









    Out[110]:





count    10.000000
mean      0.995815
std       0.002186
min       0.992611
25%       0.994463
50%       0.996308
75%       0.996310
max       0.998770
dtype: float64

PCA can only be estimated if num_columns < num_observations

Other encoders

http://contrib.scikit-learn.org/categorical-encoding/



In [97]:

    
!pip install category_encoders









    



Collecting category_encoders
  Using cached https://files.pythonhosted.org/packages/f7/d3/82a4b85a87ece114f6d0139d643580c726efa45fa4db3b81aed38c0156c5/category_encoders-1.3.0-py2.py3-none-any.whl
Requirement already satisfied: scipy>=0.17.0 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (1.1.0)
Requirement already satisfied: scikit-learn>=0.17.1 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (0.20.1)
Requirement already satisfied: pandas>=0.20.1 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (0.23.4)
Requirement already satisfied: patsy>=0.4.1 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (0.5.1)
Requirement already satisfied: statsmodels>=0.6.1 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (0.9.0)
Requirement already satisfied: numpy>=1.11.1 in c:\users\albah\anaconda3\lib\site-packages (from category_encoders) (1.15.4)
Requirement already satisfied: python-dateutil>=2.5.0 in c:\users\albah\anaconda3\lib\site-packages (from pandas>=0.20.1->category_encoders) (2.7.5)
Requirement already satisfied: pytz>=2011k in c:\users\albah\anaconda3\lib\site-packages (from pandas>=0.20.1->category_encoders) (2018.7)
Requirement already satisfied: six in c:\users\albah\anaconda3\lib\site-packages (from patsy>=0.4.1->category_encoders) (1.12.0)
Installing collected packages: category-encoders
Successfully installed category-encoders-1.3.0



In [98]:

    
import category_encoders as ce

Binary

Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.



In [126]:

    
X_ = ce.BinaryEncoder().fit_transform(data.drop(['class'], axis=1))



In [127]:

    
X_.head()









    Out[127]:







  
    
      
      capshape_0
      capshape_1
      capshape_2
      capshape_3
      capsurface_0
      capsurface_1
      capsurface_2
      gillattachment_0
      gillattachment_1
      gillspacing_0
      ...
      ringnumber_1
      ringnumber_2
      population_0
      population_1
      population_2
      population_3
      habitat_0
      habitat_1
      habitat_2
      habitat_3
    
  
  
    
      1971
      0
      0
      0
      1
      0
      0
      1
      0
      1
      0
      ...
      0
      1
      0
      0
      0
      1
      0
      0
      0
      1
    
    
      6654
      0
      0
      0
      1
      0
      1
      0
      0
      1
      1
      ...
      0
      1
      0
      0
      1
      0
      0
      0
      1
      0
    
    
      5606
      0
      0
      1
      0
      0
      1
      1
      0
      1
      1
      ...
      0
      1
      0
      0
      1
      0
      0
      0
      1
      0
    
    
      3332
      0
      0
      0
      1
      0
      1
      1
      0
      1
      1
      ...
      0
      1
      0
      0
      1
      1
      0
      0
      1
      1
    
    
      6988
      0
      0
      0
      1
      0
      1
      0
      0
      1
      1
      ...
      0
      1
      0
      0
      1
      0
      0
      0
      1
      0
    
  

5 rows × 41 columns



In [128]:

    
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()









    Out[128]:





count    10.000000
mean      0.997047
std       0.001852
min       0.993850
25%       0.996307
50%       0.996922
75%       0.998460
max       1.000000
dtype: float64

Feature Hashing

Feature Hashing for Large Scale Multitask Learning

https://alex.smola.org/papers/2009/Weinbergeretal09.pdf

Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case — multitask learning with hundreds of thousands of tasks.



In [111]:

    
X_ = ce.HashingEncoder(n_components=8).fit_transform(data.drop(['class'], axis=1))



In [112]:

    
X_.head()



In [105]:

    
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()









    Out[105]:





count    10.000000
mean      0.922451
std       0.008532
min       0.907749
25%       0.919127
50%       0.922365
75%       0.927691
max       0.937269
dtype: float64

Helmert Coding

Helmert coding compares each level of a categorical variable to the mean of the subsequent levels. Hence, the first contrast compares the mean of the dependent variable for level 1 of race with the mean of all of the subsequent levels of race (levels 2, 3, and 4), the second contrast compares the mean of the dependent variable for level 2 of race with the mean of all of the subsequent levels of race (levels 3 and 4), and the third contrast compares the mean of the dependent variable for level 3 of race with the mean of all of the subsequent levels of race (level 4).

Gregory Carey (2003). Coding Categorical Variables

http://psych.colorado.edu/~carey/courses/psyc5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf



In [113]:

    
X_ = ce.HelmertEncoder().fit_transform(data.drop(['class'], axis=1))



In [114]:

    
X_.head()









    Out[114]:







  
    
      
      intercept
      capshape_0
      capshape_1
      capshape_2
      capshape_3
      capshape_4
      capsurface_0
      capsurface_1
      capsurface_2
      gillattachment_0
      ...
      population_1
      population_2
      population_3
      population_4
      habitat_0
      habitat_1
      habitat_2
      habitat_3
      habitat_4
      habitat_5
    
  
  
    
      1971
      1
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      ...
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
    
    
      6654
      1
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      1.0
      -1.0
      -1.0
      -1.0
      ...
      -1.0
      -1.0
      -1.0
      -1.0
      1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
    
    
      5606
      1
      1.0
      -1.0
      -1.0
      -1.0
      -1.0
      0.0
      2.0
      -1.0
      -1.0
      ...
      -1.0
      -1.0
      -1.0
      -1.0
      1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
    
    
      3332
      1
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      0.0
      2.0
      -1.0
      -1.0
      ...
      2.0
      -1.0
      -1.0
      -1.0
      0.0
      2.0
      -1.0
      -1.0
      -1.0
      -1.0
    
    
      6988
      1
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      1.0
      -1.0
      -1.0
      -1.0
      ...
      -1.0
      -1.0
      -1.0
      -1.0
      1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
    
  

5 rows × 50 columns



In [115]:

    
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()









    Out[115]:





count    10.000000
mean      0.996677
std       0.002324
min       0.992611
25%       0.995387
50%       0.996922
75%       0.998460
max       1.000000
dtype: float64

Ordinal



In [122]:

    
X_ = ce.OrdinalEncoder().fit_transform(data.drop(['class'], axis=1))



In [123]:

    
X_.head()









    Out[123]:







  
    
      
      capshape
      capsurface
      gillattachment
      gillspacing
      gillcolor
      stalkshape
      stalksurfacebelowring
      stalkcolorbelowring
      veiltype
      veilcolor
      ringnumber
      population
      habitat
    
  
  
    
      1971
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
    
    
      6654
      1
      2
      1
      2
      2
      1
      2
      2
      1
      1
      1
      2
      2
    
    
      5606
      2
      3
      1
      2
      2
      1
      2
      2
      1
      1
      1
      2
      2
    
    
      3332
      1
      3
      1
      2
      3
      1
      2
      2
      1
      1
      1
      3
      3
    
    
      6988
      1
      2
      1
      2
      2
      1
      2
      2
      1
      1
      1
      2
      2



In [124]:

    
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()









    Out[124]:





count    10.000000
mean      0.997047
std       0.001852
min       0.993850
25%       0.996307
50%       0.996922
75%       0.998460
max       1.000000
dtype: float64

Polynomial Coding

Polynomial contrast coding for the encoding of categorical features

Orthogonal polynomial coding is a form of trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable. This type of coding system should be used only with an ordinal variable in which the levels are equally spaced. Examples of such a variable might be income or education.



In [119]:

    
X_ = ce.PolynomialEncoder().fit_transform(data.drop(['class'], axis=1))



In [120]:

    
X_.head()









    Out[120]:







  
    
      
      intercept
      capshape_0
      capshape_1
      capshape_2
      capshape_3
      capshape_4
      capsurface_0
      capsurface_1
      capsurface_2
      gillattachment_0
      ...
      population_1
      population_2
      population_3
      population_4
      habitat_0
      habitat_1
      habitat_2
      habitat_3
      habitat_4
      habitat_5
    
  
  
    
      1971
      1
      -0.597614
      0.545545
      -0.372678
      0.188982
      -0.062994
      -0.670820
      0.5
      -0.223607
      -0.707107
      ...
      0.545545
      -0.372678
      0.188982
      -0.062994
      -0.566947
      5.455447e-01
      -0.408248
      0.241747
      -0.109109
      0.032898
    
    
      6654
      1
      -0.597614
      0.545545
      -0.372678
      0.188982
      -0.062994
      -0.223607
      -0.5
      0.670820
      -0.707107
      ...
      -0.109109
      0.521749
      -0.566947
      0.314970
      -0.377964
      9.521795e-17
      0.408248
      -0.564076
      0.436436
      -0.197386
    
    
      5606
      1
      -0.358569
      -0.109109
      0.521749
      -0.566947
      0.314970
      0.223607
      -0.5
      -0.670820
      -0.707107
      ...
      -0.109109
      0.521749
      -0.566947
      0.314970
      -0.377964
      9.521795e-17
      0.408248
      -0.564076
      0.436436
      -0.197386
    
    
      3332
      1
      -0.597614
      0.545545
      -0.372678
      0.188982
      -0.062994
      0.223607
      -0.5
      -0.670820
      -0.707107
      ...
      -0.436436
      0.298142
      0.377964
      -0.629941
      -0.188982
      -3.273268e-01
      0.408248
      0.080582
      -0.545545
      0.493464
    
    
      6988
      1
      -0.597614
      0.545545
      -0.372678
      0.188982
      -0.062994
      -0.223607
      -0.5
      0.670820
      -0.707107
      ...
      -0.109109
      0.521749
      -0.566947
      0.314970
      -0.377964
      9.521795e-17
      0.408248
      -0.564076
      0.436436
      -0.197386
    
  

5 rows × 50 columns



In [121]:

    
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()









    Out[121]:





count    10.000000
mean      0.996677
std       0.002324
min       0.992611
25%       0.995387
50%       0.996922
75%       0.998460
max       1.000000
dtype: float64



In [ ]:

	col_0	col_1	col_2	col_3	col_4	col_5	col_6	col_7
1971	3	1	1	0	2	1	1	4
6654	1	0	3	2	3	0	1	3
5606	1	0	3	2	2	1	2	2
3332	1	1	2	1	2	3	1	2
6988	1	0	3	2	3	0	1	3

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
18	1	2	Williams, Mr. Charles Eugene	male	NaN	0	0	244373	13.0000	NaN	S
20	1	3	Masselmani, Mrs. Fatima	female	NaN	0	0	2649	7.2250	NaN	C
27	0	3	Emir, Mr. Farred Chehab	male	NaN	0	0	2631	7.2250	NaN	C
29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	0	0	330959	7.8792	NaN	Q
30	0	3	Todoroff, Mr. Lalio	male	NaN	0	0	349216	7.8958	NaN	S
32	1	1	Spencer, Mrs. William Augustus (Marie Eugenie)	female	NaN	1	0	PC 17569	146.5208	B78	C
33	1	3	Glynn, Miss. Mary Agatha	female	NaN	0	0	335677	7.7500	NaN	Q
37	1	3	Mamee, Mr. Hanna	male	NaN	0	0	2677	7.2292	NaN	C
43	0	3	Kraeff, Mr. Theodor	male	NaN	0	0	349253	7.8958	NaN	C
46	0	3	Rogers, Mr. William John	male	NaN	0	0	S.C./A.4. 23567	8.0500	NaN	S
47	0	3	Lennon, Mr. Denis	male	NaN	1	0	370371	15.5000	NaN	Q
48	1	3	O'Driscoll, Miss. Bridget	female	NaN	0	0	14311	7.7500	NaN	Q
49	0	3	Samaan, Mr. Youssef	male	NaN	2	0	2662	21.6792	NaN	C
56	1	1	Woolner, Mr. Hugh	male	NaN	0	0	19947	35.5000	C52	S
65	0	1	Stewart, Mr. Albert A	male	NaN	0	0	PC 17605	27.7208	NaN	C
66	1	3	Moubarek, Master. Gerios	male	NaN	1	1	2661	15.2458	NaN	C
77	0	3	Staneff, Mr. Ivan	male	NaN	0	0	349208	7.8958	NaN	S
78	0	3	Moutal, Mr. Rahamin Haim	male	NaN	0	0	374746	8.0500	NaN	S
83	1	3	McDermott, Miss. Brigdet Delia	female	NaN	0	0	330932	7.7875	NaN	Q
88	0	3	Slocovski, Mr. Selman Francis	male	NaN	0	0	SOTON/OQ 392086	8.0500	NaN	S
96	0	3	Shorney, Mr. Charles Joseph	male	NaN	0	0	374910	8.0500	NaN	S
102	0	3	Petroff, Mr. Pastcho ("Pentcho")	male	NaN	0	0	349215	7.8958	NaN	S
108	1	3	Moss, Mr. Albert Johan	male	NaN	0	0	312991	7.7750	NaN	S
110	1	3	Moran, Miss. Bertha	female	NaN	1	0	371110	24.1500	NaN	Q
122	0	3	Moore, Mr. Leonard Charles	male	NaN	0	0	A4. 54510	8.0500	NaN	S
127	0	3	McMahon, Mr. Martin	male	NaN	0	0	370372	7.7500	NaN	Q
129	1	3	Peter, Miss. Anna	female	NaN	1	1	2668	22.3583	F E69	C
141	0	3	Boulos, Mrs. Joseph (Sultana)	female	NaN	0	2	2678	15.2458	NaN	C
155	0	3	Olsen, Mr. Ole Martin	male	NaN	0	0	Fa 265302	7.3125	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
719	0	3	McEvoy, Mr. Michael	male	NaN	0	0	36568	15.5000	NaN	Q
728	1	3	Mannion, Miss. Margareth	female	NaN	0	0	36866	7.7375	NaN	Q
733	0	2	Knight, Mr. Robert J	male	NaN	0	0	239855	0.0000	NaN	S
739	0	3	Ivanoff, Mr. Kanio	male	NaN	0	0	349201	7.8958	NaN	S
740	0	3	Nankoff, Mr. Minko	male	NaN	0	0	349218	7.8958	NaN	S
741	1	1	Hawksford, Mr. Walter James	male	NaN	0	0	16988	30.0000	D45	S
761	0	3	Garfirth, Mr. John	male	NaN	0	0	358585	14.5000	NaN	S
767	0	1	Brewe, Dr. Arthur Jackson	male	NaN	0	0	112379	39.6000	NaN	C
769	0	3	Moran, Mr. Daniel J	male	NaN	1	0	371110	24.1500	NaN	Q
774	0	3	Elias, Mr. Dibo	male	NaN	0	0	2674	7.2250	NaN	C
777	0	3	Tobin, Mr. Roger	male	NaN	0	0	383121	7.7500	F38	Q
779	0	3	Kilgannon, Mr. Thomas J	male	NaN	0	0	36865	7.7375	NaN	Q
784	0	3	Johnston, Mr. Andrew G	male	NaN	1	2	W./C. 6607	23.4500	NaN	S
791	0	3	Keane, Mr. Andrew "Andy"	male	NaN	0	0	12460	7.7500	NaN	Q
793	0	3	Sage, Miss. Stella Anna	female	NaN	8	2	CA. 2343	69.5500	NaN	S
794	0	1	Hoyt, Mr. William Fisher	male	NaN	0	0	PC 17600	30.6958	NaN	C
816	0	1	Fry, Mr. Richard	male	NaN	0	0	112058	0.0000	B102	S
826	0	3	Flynn, Mr. John	male	NaN	0	0	368323	6.9500	NaN	Q
827	0	3	Lam, Mr. Len	male	NaN	0	0	1601	56.4958	NaN	S
829	1	3	McCormack, Mr. Thomas Joseph	male	NaN	0	0	367228	7.7500	NaN	Q
833	0	3	Saad, Mr. Amin	male	NaN	0	0	2671	7.2292	NaN	C
838	0	3	Sirota, Mr. Maurice	male	NaN	0	0	392092	8.0500	NaN	S
840	1	1	Marechal, Mr. Pierre	male	NaN	0	0	11774	29.7000	C47	C
847	0	3	Sage, Mr. Douglas Bullen	male	NaN	8	2	CA. 2343	69.5500	NaN	S
850	1	1	Goldenberg, Mrs. Samuel L (Edwiga Grabowska)	female	NaN	1	0	17453	89.1042	C92	C
860	0	3	Razi, Mr. Raihed	male	NaN	0	0	2629	7.2292	NaN	C
864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
879	0	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S

	intercept	capshape_0	capshape_1	capshape_2	capshape_3	capshape_4	capsurface_0	capsurface_1	capsurface_2	gillattachment_0	...	population_1	population_2	population_3	population_4	habitat_0	habitat_1	habitat_2	habitat_3	habitat_4	habitat_5
1971	1	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	...	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0
6654	1	-1.0	-1.0	-1.0	-1.0	-1.0	1.0	-1.0	-1.0	-1.0	...	-1.0	-1.0	-1.0	-1.0	1.0	-1.0	-1.0	-1.0	-1.0	-1.0
5606	1	1.0	-1.0	-1.0	-1.0	-1.0	0.0	2.0	-1.0	-1.0	...	-1.0	-1.0	-1.0	-1.0	1.0	-1.0	-1.0	-1.0	-1.0	-1.0
3332	1	-1.0	-1.0	-1.0	-1.0	-1.0	0.0	2.0	-1.0	-1.0	...	2.0	-1.0	-1.0	-1.0	0.0	2.0	-1.0	-1.0	-1.0	-1.0
6988	1	-1.0	-1.0	-1.0	-1.0	-1.0	1.0	-1.0	-1.0	-1.0	...	-1.0	-1.0	-1.0	-1.0	1.0	-1.0	-1.0	-1.0	-1.0	-1.0

	intercept	capshape_0	capshape_1	capshape_2	capshape_3	capshape_4	capsurface_0	capsurface_1	capsurface_2	gillattachment_0	...	population_1	population_2	population_3	population_4	habitat_0	habitat_1	habitat_2	habitat_3	habitat_4	habitat_5
1971	1	-0.597614	0.545545	-0.372678	0.188982	-0.062994	-0.670820	0.5	-0.223607	-0.707107	...	0.545545	-0.372678	0.188982	-0.062994	-0.566947	5.455447e-01	-0.408248	0.241747	-0.109109	0.032898
6654	1	-0.597614	0.545545	-0.372678	0.188982	-0.062994	-0.223607	-0.5	0.670820	-0.707107	...	-0.109109	0.521749	-0.566947	0.314970	-0.377964	9.521795e-17	0.408248	-0.564076	0.436436	-0.197386
5606	1	-0.358569	-0.109109	0.521749	-0.566947	0.314970	0.223607	-0.5	-0.670820	-0.707107	...	-0.109109	0.521749	-0.566947	0.314970	-0.377964	9.521795e-17	0.408248	-0.564076	0.436436	-0.197386
3332	1	-0.597614	0.545545	-0.372678	0.188982	-0.062994	0.223607	-0.5	-0.670820	-0.707107	...	-0.436436	0.298142	0.377964	-0.629941	-0.188982	-3.273268e-01	0.408248	0.080582	-0.545545	0.493464
6988	1	-0.597614	0.545545	-0.372678	0.188982	-0.062994	-0.223607	-0.5	0.670820	-0.707107	...	-0.109109	0.521749	-0.566947	0.314970	-0.377964	9.521795e-17	0.408248	-0.564076	0.436436	-0.197386