In [ ]:
import pandas as pd
import numpy as np
% matplotlib inline
from codefiles.class2 import get_data
In [ ]:
data = get_data()
In [ ]:
data.head(10)
We seem to have a problem with some duplicated data:
In [ ]:
print('Before duplicates {}'.format(data.shape))
data = data.drop_duplicates()
print('After duplicates {}'.format(data.shape))
In [ ]:
data.head(10)
In [ ]:
heights = data['height']
ages = data['age']
gender = data['gender']
In [ ]:
missing_height = heights.isnull()
In [ ]:
missing_height.head()
In [ ]:
missing_height.sum()
In [ ]:
data[missing_height]
In [ ]:
missing_ages = ages.isnull()
In [ ]:
data[missing_ages]
In [ ]:
gender.value_counts(dropna=False)
In [ ]:
missing_gender = data['gender'].isnull()
data[missing_gender]
But wait, we have another problem. We seem to have male
and MALE
:
In [ ]:
gender.value_counts(dropna=False).plot(kind='bar', rot=0)
In [ ]: