01 - Example - Eliminating Outliers

This notebook presents how to eliminate the diagnosed outliers (in the previous Learning Unit).

By: Hugo Lopes
Learning Unit 08

Some inital imports:


In [ ]:
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt

Load the dataset that will be used


In [ ]:
data = pd.read_csv('../data/data_with_problems.csv', index_col=0)
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(15)

Let us drop the missing and duplicated values since they don't matter for now (already covered in other notebooks):


In [ ]:
data = data.drop_duplicates()
data = data.dropna()
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))

Dealing with outliers

Time to deal with the issues previously found.

1) Delete observations - use feature bounds

The easiest way is to delete the observations (for when you know the bounds of your features). Let's use Age, since know the limits. Set the limits:


In [ ]:
min_age = 0
max_age = 117 # oldest person currently alive

Create the mask:


In [ ]:
mask_age = (data['age'] >= min_age) & (data['age'] <= max_age)
mask_age.head(10)

Check if some outliers were caught:


In [ ]:
data[~mask_age]

Yes! Two were found! The mask_age variable contains the rows we want to keep, i.e., the rows that meet the bounds above. So, lets drop the above 2 rows:


In [ ]:
data = data[mask_age]
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))

2) Create classes/bins

Instead of having a range of values you can discretize in classes/bins. Make use of pandas' qcut: Discretize variable into equal-sized buckets.


In [ ]:
data['height'].hist(bins=100)
plt.title('Height population distribution')
plt.xlabel('cm')
plt.ylabel('freq')
plt.show()

Discretize!


In [ ]:
height_bins = pd.qcut(data['height'], 
                      5, 
                      labels=['very short', 'short', 'average', 'tall', 'very tall'], 
                      retbins=True)

In [ ]:
height_bins[0].head(10)

The limits of the defined classes/bins are:


In [ ]:
height_bins[1]

We could replace the height values by the new five categories. Nevertheless, looks like a person with 252 cm is actually an outlier and the best would be to evaluate this value against two-standard deviations or percentile (e.g., 99%).

Lets define the height bounds according to two-standard deviations from the mean.

3) Delete observations - use the standard deviation


In [ ]:
# Calculate the mean and standard deviation
std_height = data['height'].std()
mean_height = data['height'].mean()
# The mask!
mask_height = (data['height'] > mean_height-2*std_height) & (data['height'] < mean_height+2*std_height)
print('Height bounds:')
print('> Minimum accepted height: %3.1f' % (mean_height-2*std_height))
print('> Maximum accepted height: %3.1f' % (mean_height+2*std_height))

Which ones are out of the bounds?


In [ ]:
data.loc[~mask_height]

Let's delete these rows (mask_height contains the rows we want to keep)


In [ ]:
data = data[mask_height]
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))

Done! So, our initial dataset with 200 rows (173 rows after dropping duplicates and missing values), ended up with 166 rows after this data handling.