Some inital imports:
In [ ]:
    
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt
    
In [ ]:
    
data = pd.read_csv('../../data/all_data.csv', index_col=0)
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(15)
    
Time to deal with the issues previously found.
Drop the duplicated rows (which have all column values the same), check the YPUQAPSOYJ row above. Let us use the drop_duplicates to hel us with that by keeping only the first of the duplicated rows.
In [ ]:
    
mask_duplicated = data.duplicated(keep='first')
mask_duplicated.head(10)
    
In [ ]:
    
data = data.drop_duplicates(keep='first')
print('Our dataset has now %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(10)
    
You could also consider a duplicate a row with the same index and same age only by setting data.drop_duplicates(subset=['age'], keep='first'), but in our case it would lead to the same result. Note that in general it is not a recommended programming practice to use the argument 'inplace=True' (e.g., data.drop_duplicates(subset=['age'], keep='first', inplace=True)) --> may lead to unnexpected results.
This one of the major, if not the biggest, data problems that we faced with. There are several ways to deal with them, e.g.:
In [ ]:
    
missing_data = data.isnull()
print('Number of missing values (NaN) per column/feature:')
print(missing_data.sum())
print('And we currently have %d rows.' % data.shape[0])
    
That is not terrible to the point of fully dropping a column/feature due to the amount of missing values. Nevertheless, the action to do that would be data.drop('age', axis=1). The missing_data variable is our mask for the missing values:
In [ ]:
    
missing_data.head(8)
    
This can be done with dropna(), for instance:
In [ ]:
    
data_aux = data.dropna(how='any')
print('Dataset now with %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))
    
This can be done with fillna(), for instance:
In [ ]:
    
data_aux = data.fillna(value=0)
print('Dataset has %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))
    
So, what happened with our dataset? Let's take a look where we had missing values before:
In [ ]:
    
data_aux[missing_data['age']]
    
In [ ]:
    
data_aux[missing_data['height']]
    
In [ ]:
    
data_aux[missing_data['gender']]
    
Looks like what we did was not the most appropriate. For instance, we create a new category in the gender column:
In [ ]:
    
data_aux['gender'].value_counts()
    
In [ ]:
    
data['height'] = data['height'].replace(np.nan, data['height'].mean())
data[missing_data['height']]
    
In [ ]:
    
data.loc[missing_data['age'], 'age'] = data['age'].median()
data[missing_data['age']]
    
In [ ]:
    
data['gender'].value_counts(dropna=False)
    
Let's replace MALE by male to harmonize our feature.
In [ ]:
    
mask = data['gender'] == 'MALE'
data.loc[mask, 'gender'] = 'male'
# validate we don't have MALE:
data['gender'].value_counts(dropna=False)
    
Now we don't have the MALE entry anymore. Let us fill the missing values with the mode:
In [ ]:
    
the_mode = data['gender'].mode()
# note that mode() return a dataframe
the_mode
    
In [ ]:
    
data['gender'] = data['gender'].replace(np.nan, data['gender'].mode()[0])
data[missing_data['gender']]
    
In [ ]:
    
data.isnull().sum()
    
One could also made use of sklearn.preprocessing.Imputer to fill with mean, median or most frequent value ;)