Some inital imports:
In [ ]:
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt
In [ ]:
data = pd.read_csv('../../data/all_data.csv', index_col=0)
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(15)
Time to deal with the issues previously found.
Drop the duplicated rows (which have all column values the same), check the YPUQAPSOYJ
row above. Let us use the drop_duplicates
to hel us with that by keeping only the first of the duplicated rows.
In [ ]:
mask_duplicated = data.duplicated(keep='first')
mask_duplicated.head(10)
In [ ]:
data = data.drop_duplicates(keep='first')
print('Our dataset has now %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(10)
You could also consider a duplicate a row with the same index and same age
only by setting data.drop_duplicates(subset=['age'], keep='first')
, but in our case it would lead to the same result. Note that in general it is not a recommended programming practice to use the argument 'inplace=True'
(e.g., data.drop_duplicates(subset=['age'], keep='first', inplace=True)
) --> may lead to unnexpected results.
This one of the major, if not the biggest, data problems that we faced with. There are several ways to deal with them, e.g.:
In [ ]:
missing_data = data.isnull()
print('Number of missing values (NaN) per column/feature:')
print(missing_data.sum())
print('And we currently have %d rows.' % data.shape[0])
That is not terrible to the point of fully dropping a column/feature due to the amount of missing values. Nevertheless, the action to do that would be data.drop('age', axis=1)
. The missing_data
variable is our mask for the missing values:
In [ ]:
missing_data.head(8)
This can be done with dropna()
, for instance:
In [ ]:
data_aux = data.dropna(how='any')
print('Dataset now with %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))
This can be done with fillna()
, for instance:
In [ ]:
data_aux = data.fillna(value=0)
print('Dataset has %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))
So, what happened with our dataset? Let's take a look where we had missing values before:
In [ ]:
data_aux[missing_data['age']]
In [ ]:
data_aux[missing_data['height']]
In [ ]:
data_aux[missing_data['gender']]
Looks like what we did was not the most appropriate. For instance, we create a new category in the gender
column:
In [ ]:
data_aux['gender'].value_counts()
In [ ]:
data['height'] = data['height'].replace(np.nan, data['height'].mean())
data[missing_data['height']]
In [ ]:
data.loc[missing_data['age'], 'age'] = data['age'].median()
data[missing_data['age']]
In [ ]:
data['gender'].value_counts(dropna=False)
Let's replace MALE
by male
to harmonize our feature.
In [ ]:
mask = data['gender'] == 'MALE'
data.loc[mask, 'gender'] = 'male'
# validate we don't have MALE:
data['gender'].value_counts(dropna=False)
Now we don't have the MALE
entry anymore. Let us fill the missing values with the mode:
In [ ]:
the_mode = data['gender'].mode()
# note that mode() return a dataframe
the_mode
In [ ]:
data['gender'] = data['gender'].replace(np.nan, data['gender'].mode()[0])
data[missing_data['gender']]
In [ ]:
data.isnull().sum()
One could also made use of sklearn.preprocessing.Imputer
to fill with mean, median or most frequent value ;)