Title: Handling Outliers
Slug: handling_outliers
Summary: How to handling outliers for machine learning in Python.
Date: 2016-09-06 12:00
Category: Machine Learning
Tags: Preprocessing Structured Data
Authors: Chris Albon

Preliminaries



In [1]:

    
# Load library
import pandas as pd

Create Data



In [2]:

    
# Create DataFrame
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

houses









    Out[2]:







  
    
      
      Price
      Bathrooms
      Square_Feet
    
  
  
    
      0
      534433
      2.0
      1500
    
    
      1
      392333
      3.5
      2500
    
    
      2
      293222
      2.0
      1500
    
    
      3
      4322032
      116.0
      48000

Option 1: Drop



In [3]:

    
# Drop observations greater than some value
houses[houses['Bathrooms'] < 20]









    Out[3]:







  
    
      
      Price
      Bathrooms
      Square_Feet
    
  
  
    
      0
      534433
      2.0
      1500
    
    
      1
      392333
      3.5
      2500
    
    
      2
      293222
      2.0
      1500

Option 2: Mark



In [4]:

    
# Load library
import numpy as np

# Create feature based on boolean condition
houses['Outlier'] = np.where(houses['Bathrooms'] < 20, 0, 1)

# Show data
houses









    Out[4]:







  
    
      
      Price
      Bathrooms
      Square_Feet
      Outlier
    
  
  
    
      0
      534433
      2.0
      1500
      0
    
    
      1
      392333
      3.5
      2500
      0
    
    
      2
      293222
      2.0
      1500
      0
    
    
      3
      4322032
      116.0
      48000
      1

Option 3: Rescale



In [5]:

    
# Log feature
houses['Log_Of_Square_Feet'] = [np.log(x) for x in houses['Square_Feet']]

# Show data
houses









    Out[5]:







  
    
      
      Price
      Bathrooms
      Square_Feet
      Outlier
      Log_Of_Square_Feet
    
  
  
    
      0
      534433
      2.0
      1500
      0
      7.313220
    
    
      1
      392333
      3.5
      2500
      0
      7.824046
    
    
      2
      293222
      2.0
      1500
      0
      7.313220
    
    
      3
      4322032
      116.0
      48000
      1
      10.778956

	Price	Bathrooms	Square_Feet
0	534433	2.0	1500
1	392333	3.5	2500
2	293222	2.0	1500
3	4322032	116.0	48000

	Price	Bathrooms	Square_Feet	Outlier	Log_Of_Square_Feet
0	534433	2.0	1500	0	7.313220
1	392333	3.5	2500	0	7.824046
2	293222	2.0	1500	0	7.313220
3	4322032	116.0	48000	1	10.778956