Some inital imports:
In [ ]:
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt
In [ ]:
data = pd.read_csv('../data/data_with_problems.csv', index_col=0)
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(15)
Let us drop the missing and duplicated values since they don't matter for now (already covered in other notebooks):
In [ ]:
data = data.drop_duplicates()
data = data.dropna()
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
In [ ]:
min_age = 0
max_age = 117 # oldest person currently alive
Create the mask:
In [ ]:
mask_age = (data['age'] >= min_age) & (data['age'] <= max_age)
mask_age.head(10)
Check if some outliers were caught:
In [ ]:
data[~mask_age]
Yes! Two were found! The mask_age variable contains the rows we want to keep, i.e., the rows that meet the bounds above. So, lets drop the above 2 rows:
In [ ]:
data = data[mask_age]
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
Instead of having a range of values you can discretize in classes/bins. Make use of pandas' qcut: Discretize variable into equal-sized buckets.
In [ ]:
data['height'].hist(bins=100)
plt.title('Height population distribution')
plt.xlabel('cm')
plt.ylabel('freq')
plt.show()
Discretize!
In [ ]:
height_bins = pd.qcut(data['height'],
5,
labels=['very short', 'short', 'average', 'tall', 'very tall'],
retbins=True)
In [ ]:
height_bins[0].head(10)
The limits of the defined classes/bins are:
In [ ]:
height_bins[1]
We could replace the height values by the new five categories. Nevertheless, looks like a person with 252 cm is actually an outlier and the best would be to evaluate this value against two-standard deviations or percentile (e.g., 99%).
Lets define the height bounds according to two-standard deviations from the mean.
In [ ]:
# Calculate the mean and standard deviation
std_height = data['height'].std()
mean_height = data['height'].mean()
# The mask!
mask_height = (data['height'] > mean_height-2*std_height) & (data['height'] < mean_height+2*std_height)
print('Height bounds:')
print('> Minimum accepted height: %3.1f' % (mean_height-2*std_height))
print('> Maximum accepted height: %3.1f' % (mean_height+2*std_height))
Which ones are out of the bounds?
In [ ]:
data.loc[~mask_height]
Let's delete these rows (mask_height contains the rows we want to keep)
In [ ]:
data = data[mask_height]
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))