Diagnosing the data issues:


In [ ]:
import pandas as pd 
import numpy as np 
% matplotlib inline
from codefiles.class2 import get_data

The data you'll be exloring:


In [ ]:
data = get_data()

In [ ]:
data.head(10)

Duplicated data:

We seem to have a problem with some duplicated data:


In [ ]:
print('Before duplicates {}'.format(data.shape))
data = data.drop_duplicates()
print('After duplicates {}'.format(data.shape))

In [ ]:
data.head(10)

Missing data:


In [ ]:
heights = data['height']
ages = data['age']
gender = data['gender']
How much missing data do we have for heights?

In [ ]:
missing_height = heights.isnull()

In [ ]:
missing_height.head()

In [ ]:
missing_height.sum()

In [ ]:
data[missing_height]
How about age?

In [ ]:
missing_ages = ages.isnull()

In [ ]:
data[missing_ages]
And gender?

In [ ]:
gender.value_counts(dropna=False)

In [ ]:
missing_gender = data['gender'].isnull()
data[missing_gender]

But wait, we have another problem. We seem to have male and MALE:


In [ ]:
gender.value_counts(dropna=False).plot(kind='bar', rot=0)

In [ ]: