In [1]:
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt
In [2]:
data = pd.read_csv('all_data.csv')
In [3]:
data.head(10)
Out[3]:
We seem to have a problem with some duplicated data. We can find them using Pandas duplicated
In [4]:
duplicated_data = data.duplicated()
In [5]:
duplicated_data.head()
Out[5]:
So this is actually a mask. We can now ask for the data where the mask applies:
In [6]:
data[duplicated_data]
Out[6]:
In [7]:
heights = data['height']
ages = data['age']
gender = data['gender']
Make a mask, with those who are missing, using isnull
In [8]:
missing_height = heights.isnull()
In [9]:
missing_height.head()
Out[9]:
In python, False evaluates to 0, and True to 1. So we can count the number of missing by doing:
In [10]:
missing_height.sum()
Out[10]:
As before, we can use that mask on our original dataset, and see it:
In [11]:
data[missing_height]
Out[11]:
In [12]:
missing_ages = ages.isnull()
In [13]:
data[missing_ages]
Out[13]:
Here we're going to do something clever. We're going to get the value_counts, but we're going to change the parameter dropna (drop nulls) to false, so that we keep them.
This would not be very useful with numerical data, but given that we know that age is categorical, we might as well:
In [14]:
gender.value_counts(dropna=False)
Out[14]:
In [15]:
missing_gender = data['gender'].isnull()
data[missing_gender]
Out[15]:
But wait, we have another problem. We seem to have male and MALE:
In [16]:
gender.value_counts(dropna=False).plot(kind='bar', rot=0)
Out[16]:
Note: pyplot is used here to make axis labels. Because otherwise, this happens
In [17]:
heights.hist(bins=40, figsize=(16,4))
plt.xlabel('Height')
plt.ylabel('Count')
Out[17]:
This was useful, we can see that there are some really tall people, and some quite small ones. The distribution also looks close to normal
Let's make a quick function to deal with this...
In [18]:
def print_analysis(series):
for nr in range(1, 4):
upper_limit = series.mean() + (nr * series.std())
lower_limit = series.mean() - (nr * series.std())
over_range = series > upper_limit
percent_over_range = over_range.sum() / len(series) * 100
under_range = series < lower_limit
percent_under_range = under_range.sum() / len(series) * 100
in_range = (series < upper_limit) & (series > lower_limit)
percent_in_range = in_range.sum() / len(series) * 100
print('\nFor the range of %0.0f standard deviations:' % nr)
print(' Lower limit: %0.0f' % lower_limit)
print(' Percent under range: %0.1f%%' % percent_under_range)
print(' Upper limit: %0.0f' % upper_limit)
print(' Percent over range: %0.1f%%' % percent_over_range)
print(' Percent within range: %0.1f%%' % percent_in_range)
In [19]:
heights.hist(bins=20)
plt.xlabel('Height')
plt.ylabel('Count')
Out[19]:
In [20]:
print_analysis(heights)
In [21]:
heights[heights < heights.mean() - 2*heights.std()]
Out[21]:
Over:
In [22]:
heights[heights > heights.mean() + 2*heights.std()]
Out[22]:
Note: the 131cm and 208cm actually seem quite plausible.
Under:
In [23]:
heights[heights < heights.mean() - 3*heights.std()]
Out[23]:
In [24]:
heights[heights > heights.mean() + 3*heights.std()]
Out[24]:
In [25]:
ages.hist(bins=40, figsize=(16,4))
plt.xlabel('Age')
plt.ylabel('Count')
Out[25]:
In [26]:
print_analysis(ages)
Well, this is quite useless. The reason is that using standard deviations assumes that the distribution is normal, and we can clearly see it isn't.
What's the biggest outlier we have?
In [27]:
ages.max()
Out[27]:
What if we used percentiles?
In [28]:
extreme_value = .99
ages.dropna().quantile(extreme_value)
Out[28]:
Note: we had to use .dropna() there, as otherwise pandas would raise a runtime error(try it!)
In [29]:
# under_extreme_value = ages, where ages is smaller than the extreme value:
under_extreme_value = ages[ages < ages.dropna().quantile(extreme_value)]
In [30]:
under_extreme_value.hist(bins=40, figsize=(16,4))
plt.xlabel('Age')
plt.ylabel('Count')
Out[30]:
Now that looks a lot better. We can clearly identify that almost everyone is an adult, except the point on the extreme left.
In [31]:
non_babies = under_extreme_value[under_extreme_value > 10]
In [32]:
non_babies.hist(bins=40, figsize=(16,4))
plt.xlabel('Age')
plt.ylabel('Count')
Out[32]:
Starting to look better, what about the point on the far right?
In [33]:
non_babies.max()
Out[33]:
Seems legit.