Title: Handling Outliers
Slug: handling_outliers
Summary: How to handling outliers for machine learning in Python.
Date: 2016-09-06 12:00
Category: Machine Learning
Tags: Preprocessing Structured Data
Authors: Chris Albon
In [1]:
# Load library
import pandas as pd
In [2]:
# Create DataFrame
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]
houses
Out[2]:
In [3]:
# Drop observations greater than some value
houses[houses['Bathrooms'] < 20]
Out[3]:
In [4]:
# Load library
import numpy as np
# Create feature based on boolean condition
houses['Outlier'] = np.where(houses['Bathrooms'] < 20, 0, 1)
# Show data
houses
Out[4]:
In [5]:
# Log feature
houses['Log_Of_Square_Feet'] = [np.log(x) for x in houses['Square_Feet']]
# Show data
houses
Out[5]: