Diagnosing the data issues:



In [ ]:

    
import pandas as pd 
import numpy as np 
% matplotlib inline
from matplotlib import pyplot as plt 
from codefiles.class2 import get_data, print_analysis, plot_standard_deviations



In [ ]:

    
data = get_data()
heights = data['height']
ages = data['age']
gender = data['gender']

Outliers:

What is the distribution of the heights?



In [ ]:

    
heights.hist(bins=40, figsize=(16,4))
plt.xlabel('Height')
plt.ylabel('Count')

Theory of the normal distribution here



In [ ]:

    
plot_standard_deviations(heights, 'Heights')



In [ ]:

    
print_analysis(heights)

Who is outside of 2 standard deviations?

Under:



In [ ]:

    
heights[heights < heights.mean() - 2*heights.std()]

Over:



In [ ]:

    
heights[heights > heights.mean() + 2*heights.std()]

And outside 3 standard deviations?

Under:



In [ ]:

    
heights[heights < heights.mean() - 3*heights.std()]

Over:



In [ ]:

    
heights[heights > heights.mean() + 3*heights.std()]

How about the ages?



In [ ]:

    
ages.hist(bins=40, figsize=(16,4))
plt.xlabel('Age')
plt.ylabel('Count')



In [ ]:

    
plot_standard_deviations(ages, 'ages')



In [ ]:

    
print_analysis(ages)

Using the Standard Deviation makes assumptions about the distribution.

Let's try to solve this with other means...

What's the biggest outlier we have?



In [ ]:

    
ages.max()

What if we used percentiles?



In [ ]:

    
extreme_value = .999



In [ ]:

    
ages.quantile(extreme_value)



In [ ]:

    
under_extreme_value = ages < ages.quantile(extreme_value)

Well this looks a lot more usable:



In [ ]:

    
ages[under_extreme_value].hist(bins=40, figsize=(16,4))
plt.xlabel('Age')
plt.ylabel('Count')



In [ ]:

    
plot_standard_deviations(ages[under_extreme_value], 'ages')



In [ ]:

    
print_analysis(ages[under_extreme_value])

TODO:

analyze the ages further