02-impute

This notebook presents how to eliminate duplicates and solve the missing values.

Some inital imports:


In [ ]:
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt

Load the dataset that will be used


In [ ]:
data = pd.read_csv('../../data/all_data.csv', index_col=0)
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(15)

Dealing with found issues

Time to deal with the issues previously found.

1) Duplicated data

Drop the duplicated rows (which have all column values the same), check the YPUQAPSOYJ row above. Let us use the drop_duplicates to hel us with that by keeping only the first of the duplicated rows.


In [ ]:
mask_duplicated = data.duplicated(keep='first')
mask_duplicated.head(10)

In [ ]:
data = data.drop_duplicates(keep='first')
print('Our dataset has now %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(10)

You could also consider a duplicate a row with the same index and same age only by setting data.drop_duplicates(subset=['age'], keep='first'), but in our case it would lead to the same result. Note that in general it is not a recommended programming practice to use the argument 'inplace=True' (e.g., data.drop_duplicates(subset=['age'], keep='first', inplace=True)) --> may lead to unnexpected results.

2) Missing Values

This one of the major, if not the biggest, data problems that we faced with. There are several ways to deal with them, e.g.:

  • drop rows which contain missing values.
  • fill missing values with zero.
  • fill missing values with mean of the column the missing value is located.
  • (more advanced) use decision trees to predic the missing values.

In [ ]:
missing_data = data.isnull()
print('Number of missing values (NaN) per column/feature:')
print(missing_data.sum())
print('And we currently have %d rows.' % data.shape[0])

That is not terrible to the point of fully dropping a column/feature due to the amount of missing values. Nevertheless, the action to do that would be data.drop('age', axis=1). The missing_data variable is our mask for the missing values:


In [ ]:
missing_data.head(8)

Drop rows with missing values

This can be done with dropna(), for instance:


In [ ]:
data_aux = data.dropna(how='any')
print('Dataset now with %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))

Fill missing values with a specific value (e.g., 0)

This can be done with fillna(), for instance:


In [ ]:
data_aux = data.fillna(value=0)
print('Dataset has %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))

So, what happened with our dataset? Let's take a look where we had missing values before:


In [ ]:
data_aux[missing_data['age']]

In [ ]:
data_aux[missing_data['height']]

In [ ]:
data_aux[missing_data['gender']]

Looks like what we did was not the most appropriate. For instance, we create a new category in the gender column:


In [ ]:
data_aux['gender'].value_counts()

Fill missing values with mean/mode/median:

This is one of the most common approaches.

height - filling missing values with the mean


In [ ]:
data['height'] = data['height'].replace(np.nan, data['height'].mean())
data[missing_data['height']]

age - filling missing values with median


In [ ]:
data.loc[missing_data['age'], 'age'] = data['age'].median()
data[missing_data['age']]

gender - filling missing values with mode

Remember we had a small problem with the data of this feature (the MALE word instead of male)? Typing problems are very common and they can be hidden problems. That's why it is so important to take a look at the data.


In [ ]:
data['gender'].value_counts(dropna=False)

Let's replace MALE by male to harmonize our feature.


In [ ]:
mask = data['gender'] == 'MALE'
data.loc[mask, 'gender'] = 'male'
# validate we don't have MALE:
data['gender'].value_counts(dropna=False)

Now we don't have the MALE entry anymore. Let us fill the missing values with the mode:


In [ ]:
the_mode = data['gender'].mode()
# note that mode() return a dataframe
the_mode

In [ ]:
data['gender'] = data['gender'].replace(np.nan, data['gender'].mode()[0])
data[missing_data['gender']]

Final check

Always a good idea...


In [ ]:
data.isnull().sum()

One could also made use of sklearn.preprocessing.Imputer to fill with mean, median or most frequent value ;)