It is important to examine and preprocess a dataset before feeding it to a learning algorithm. In this notebook, I will go through some essential data preprocessing techniques including:
• Removing and imputing missing values from the dataset
• Getting categorical data into shape for machine learning algorithms
• Selecting relevant features for the model construction
In [26]:
import pandas as pd
from io import StringIO
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,,'''
data = pd.read_csv(StringIO(csv_data))
In [3]:
## checking for missing data
df.isnull().sum()
Out[3]:
In [18]:
# Another example of a dataframe with missing data
# creating dataframe from dictionary; key is the colume name
import numpy as np
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
'age': [42, np.nan, 36, 24, 73],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'preTestScore': [4, np.nan, np.nan, 2, 3],
'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df
Out[18]:
In [10]:
# default: drop all rows containing NAN
df.dropna()
Out[10]:
In [11]:
# Only drop rows where all cells in that row is NA
df.dropna(how='all')
Out[11]:
In [12]:
# Drop all columns if they contain missing values (seldom used)
df.dropna(axis=1)
Out[12]:
In [14]:
# Drop rows that contain less than five observations, mostly useful for time series
df.dropna(thresh=5)
Out[14]:
In [20]:
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['preTestScore', 'postTestScore'])
Out[20]:
In [21]:
# Fill in missing data with zeros
df.fillna(0)
Out[21]:
In [22]:
# Fill in missing in preTestScore with the mean value of preTestScore
df['preTestScore'].fillna(df['preTestScore'].mean(), inplace=True)
df
Out[22]:
In [33]:
# Fill in missing in postTestScore with each sex's mean value of postTestScore
df['postTestScore'].fillna(df.groupby('sex')['postTestScore'].transform('mean'), inplace=True)
df
Out[33]:
In [35]:
# Select the rows of df where age is not NaN and sex is not NaN
df[df['age'].notnull() & df['sex'].notnull()]
Out[35]:
In [40]:
# Fill in missing data with row mean
fill_value = pd.DataFrame({col: data.mean(axis=1) for col in data.columns})
data.fillna(fill_value, inplace=True)
data
Out[40]:
In [31]:
# Another method using the Imputer class from scikit-learn, only work with numerical dataframe
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean',axis=0)
imr = imr.fit_transform(data)
imr
Out[31]: