Building Good Training Sets – Data Preprocessing

It is important to examine and preprocess a dataset before feeding it to a learning algorithm. In this notebook, I will go through some essential data preprocessing techniques including:

• Removing and imputing missing values from the dataset
• Getting categorical data into shape for machine learning algorithms
• Selecting relevant features for the model construction

Dealing with missing data


In [26]:
import pandas as pd
from io import StringIO

csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,,'''

data = pd.read_csv(StringIO(csv_data))

In [3]:
## checking for missing data
df.isnull().sum()


Out[3]:
A    0
B    0
C    2
D    1
dtype: int64

In [18]:
# Another example of a dataframe with missing data
# creating dataframe from dictionary; key is the colume name
import numpy as np

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
        'age': [42, np.nan, 36, 24, 73],
        'sex': ['m', np.nan, 'f', 'm', 'f'],
        'preTestScore': [4, np.nan, np.nan, 2, 3],
        'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df


Out[18]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
1 NaN NaN NaN NaN NaN 25.0
2 Tina Ali 36.0 f 3.0 NaN
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

Eliminating samples or features with missing values


In [10]:
# default: drop all rows containing NAN
df.dropna()


Out[10]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

In [11]:
# Only drop rows where all cells in that row is NA
df.dropna(how='all')


Out[11]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
2 Tina Ali 36.0 f NaN NaN
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

In [12]:
# Drop all columns if they contain missing values (seldom used)
df.dropna(axis=1)


Out[12]:
0
1
2
3
4

In [14]:
# Drop rows that contain less than five observations, mostly useful for time series
df.dropna(thresh=5)


Out[14]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

In [20]:
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['preTestScore', 'postTestScore'])


Out[20]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

Filling/imputing missing values


In [21]:
# Fill in missing data with zeros
df.fillna(0)


Out[21]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
1 0 0 0.0 0 0.0 25.0
2 Tina Ali 36.0 f 3.0 0.0
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

In [22]:
# Fill in missing in preTestScore with the mean value of preTestScore
df['preTestScore'].fillna(df['preTestScore'].mean(), inplace=True)
df


Out[22]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
1 NaN NaN NaN NaN 3.0 25.0
2 Tina Ali 36.0 f 3.0 NaN
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

In [33]:
# Fill in missing in postTestScore with each sex's mean value of postTestScore
df['postTestScore'].fillna(df.groupby('sex')['postTestScore'].transform('mean'), inplace=True)
df


Out[33]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
1 NaN NaN NaN NaN 3.0 25.0
2 Tina Ali 36.0 f 3.0 70.0
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

In [35]:
# Select the rows of df where age is not NaN and sex is not NaN

df[df['age'].notnull() & df['sex'].notnull()]


Out[35]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
2 Tina Ali 36.0 f 3.0 70.0
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

In [40]:
# Fill in missing data with row mean
fill_value = pd.DataFrame({col: data.mean(axis=1) for col in data.columns})
data.fillna(fill_value, inplace=True)
data


Out[40]:
A B C D
0 1.0 2.0 3.000000 4.0
1 5.0 6.0 6.333333 8.0
2 10.0 11.0 10.500000 10.5

In [31]:
# Another method using the Imputer class from scikit-learn, only work with numerical dataframe
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean',axis=0)
imr = imr.fit_transform(data)
imr


Out[31]:
array([[  1.,   2.,   3.,   4.],
       [  5.,   6.,   3.,   8.],
       [ 10.,  11.,   3.,   6.]])