Building Good Training Sets – Data Preprocessing

It is important to examine and preprocess a dataset before feeding it to a learning algorithm. In this notebook, I will go through some essential data preprocessing techniques including:

• Removing and imputing missing values from the dataset
• Getting categorical data into shape for machine learning algorithms
• Selecting relevant features for the model construction

Dealing with missing data



In [26]:

    
import pandas as pd
from io import StringIO

csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,,'''

data = pd.read_csv(StringIO(csv_data))



In [3]:

    
## checking for missing data
df.isnull().sum()









    Out[3]:





A    0
B    0
C    2
D    1
dtype: int64



In [18]:

    
# Another example of a dataframe with missing data
# creating dataframe from dictionary; key is the colume name
import numpy as np

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
        'age': [42, np.nan, 36, 24, 73],
        'sex': ['m', np.nan, 'f', 'm', 'f'],
        'preTestScore': [4, np.nan, np.nan, 2, 3],
        'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df









    Out[18]:






  
    
      
      first_name
      last_name
      age
      sex
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42.0
      m
      4.0
      25.0
    
    
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      25.0
    
    
      2
      Tina
      Ali
      36.0
      f
      3.0
      NaN
    
    
      3
      Jake
      Milner
      24.0
      m
      2.0
      62.0
    
    
      4
      Amy
      Cooze
      73.0
      f
      3.0
      70.0

Eliminating samples or features with missing values



In [10]:

    
# default: drop all rows containing NAN
df.dropna()









    Out[10]:






  
    
      
      first_name
      last_name
      age
      sex
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42.0
      m
      4.0
      25.0
    
    
      3
      Jake
      Milner
      24.0
      m
      2.0
      62.0
    
    
      4
      Amy
      Cooze
      73.0
      f
      3.0
      70.0



In [11]:

    
# Only drop rows where all cells in that row is NA
df.dropna(how='all')









    Out[11]:






  
    
      
      first_name
      last_name
      age
      sex
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42.0
      m
      4.0
      25.0
    
    
      2
      Tina
      Ali
      36.0
      f
      NaN
      NaN
    
    
      3
      Jake
      Milner
      24.0
      m
      2.0
      62.0
    
    
      4
      Amy
      Cooze
      73.0
      f
      3.0
      70.0



In [12]:

    
# Drop all columns if they contain missing values (seldom used)
df.dropna(axis=1)



In [14]:

    
# Drop rows that contain less than five observations, mostly useful for time series
df.dropna(thresh=5)









    Out[14]:






  
    
      
      first_name
      last_name
      age
      sex
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42.0
      m
      4.0
      25.0
    
    
      3
      Jake
      Milner
      24.0
      m
      2.0
      62.0
    
    
      4
      Amy
      Cooze
      73.0
      f
      3.0
      70.0



In [20]:

    
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['preTestScore', 'postTestScore'])









    Out[20]:






  
    
      
      first_name
      last_name
      age
      sex
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42.0
      m
      4.0
      25.0
    
    
      3
      Jake
      Milner
      24.0
      m
      2.0
      62.0
    
    
      4
      Amy
      Cooze
      73.0
      f
      3.0
      70.0

Filling/imputing missing values



In [21]:

    
# Fill in missing data with zeros
df.fillna(0)









    Out[21]:






  
    
      
      first_name
      last_name
      age
      sex
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42.0
      m
      4.0
      25.0
    
    
      1
      0
      0
      0.0
      0
      0.0
      25.0
    
    
      2
      Tina
      Ali
      36.0
      f
      3.0
      0.0
    
    
      3
      Jake
      Milner
      24.0
      m
      2.0
      62.0
    
    
      4
      Amy
      Cooze
      73.0
      f
      3.0
      70.0



In [22]:

    
# Fill in missing in preTestScore with the mean value of preTestScore
df['preTestScore'].fillna(df['preTestScore'].mean(), inplace=True)
df









    Out[22]:






  
    
      
      first_name
      last_name
      age
      sex
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42.0
      m
      4.0
      25.0
    
    
      1
      NaN
      NaN
      NaN
      NaN
      3.0
      25.0
    
    
      2
      Tina
      Ali
      36.0
      f
      3.0
      NaN
    
    
      3
      Jake
      Milner
      24.0
      m
      2.0
      62.0
    
    
      4
      Amy
      Cooze
      73.0
      f
      3.0
      70.0



In [33]:

    
# Fill in missing in postTestScore with each sex's mean value of postTestScore
df['postTestScore'].fillna(df.groupby('sex')['postTestScore'].transform('mean'), inplace=True)
df









    Out[33]:






  
    
      
      first_name
      last_name
      age
      sex
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42.0
      m
      4.0
      25.0
    
    
      1
      NaN
      NaN
      NaN
      NaN
      3.0
      25.0
    
    
      2
      Tina
      Ali
      36.0
      f
      3.0
      70.0
    
    
      3
      Jake
      Milner
      24.0
      m
      2.0
      62.0
    
    
      4
      Amy
      Cooze
      73.0
      f
      3.0
      70.0



In [35]:

    
# Select the rows of df where age is not NaN and sex is not NaN

df[df['age'].notnull() & df['sex'].notnull()]









    Out[35]:






  
    
      
      first_name
      last_name
      age
      sex
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42.0
      m
      4.0
      25.0
    
    
      2
      Tina
      Ali
      36.0
      f
      3.0
      70.0
    
    
      3
      Jake
      Milner
      24.0
      m
      2.0
      62.0
    
    
      4
      Amy
      Cooze
      73.0
      f
      3.0
      70.0



In [40]:

    
# Fill in missing data with row mean
fill_value = pd.DataFrame({col: data.mean(axis=1) for col in data.columns})
data.fillna(fill_value, inplace=True)
data



In [31]:

    
# Another method using the Imputer class from scikit-learn, only work with numerical dataframe
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean',axis=0)
imr = imr.fit_transform(data)
imr









    Out[31]:





array([[  1.,   2.,   3.,   4.],
       [  5.,   6.,   3.,   8.],
       [ 10.,  11.,   3.,   6.]])

	first_name	last_name	age	sex	preTestScore	postTestScore
0	Jason	Miller	42.0	m	4.0	25.0
1	NaN	NaN	NaN	NaN	NaN	25.0
2	Tina	Ali	36.0	f	3.0	NaN
3	Jake	Milner	24.0	m	2.0	62.0
4	Amy	Cooze	73.0	f	3.0	70.0