Lesson 7 - Outlier detection

Task: This project has two parts.

1) In the first part, you will run a regression, and identify and remove the 10% of points that have the largest residual errors. Then you’ll remove those outliers from the dataset and refit the regression, just like the strategy that Sebastian suggested in the lesson videos.
2) In the second part, you will get acquainted with some of the outliers in the Enron finance data, and learn if/how to remove them.



In [1]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [4]:

    
import sys
sys.path.append("../outliers/")

filePath = '/Users/omojumiller/mycode/MachineLearningNanoDegree/IntroToArtificialIntelligence/outliers/'
import random
import numpy
import pickle
import matplotlib.pyplot as pyplt
import seaborn as sns


from outlier_cleaner import outlierCleaner



In [5]:

    
### load up some practice data with outliers in it
ages = pickle.load( open(filePath+'practice_outliers_ages.pkl', "r") )
net_worths = pickle.load( open(filePath+"practice_outliers_net_worths.pkl", "r") )



In [6]:

    
### ages and net_worths need to be reshaped into 2D numpy arrays
### second argument of reshape command is a tuple of integers: (n_rows, n_columns)
### by convention, n_rows is the number of data points
### and n_columns is the number of features

ages = numpy.reshape( numpy.array(ages), (len(ages), 1))
net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

from sklearn.cross_validation import train_test_split
ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, 
                                                                    net_worths, test_size=0.1, random_state=42)



In [7]:

    
### fill in a regression here!  Name the regression object reg so that
### the plotting code below works, and you can see what your regression looks like

from sklearn import linear_model
reg = linear_model.LinearRegression()

reg.fit(ages_train, net_worths_train)
print "slope of regression is %.2f" % reg.coef_
print "intercepts of regression is %.2f" % reg.intercept_









    



slope of regression is 5.08
intercepts of regression is 25.21



In [8]:

    
print "\n ********stats on dataset********\n"
print "r-squared score on testing data: ", reg.score(ages_test, net_worths_test)
print "r-squared score on training data: ", reg.score(ages_train, net_worths_train)









    



 ********stats on dataset********

r-squared score on testing data:  0.878262470366
r-squared score on training data:  0.489872596175



In [9]:

    
pyplt.clf()
pyplt.scatter(ages_train, net_worths_train, color="b", label="train data")
pyplt.scatter(ages_test, net_worths_test, color="r", label="test data")
pyplt.plot(ages_test, reg.predict(ages_test), color="black")
pyplt.legend(loc='upper center', shadow=True, fontsize='x-large')
pyplt.xlabel("ages", fontsize=14)
pyplt.ylabel("net worths", fontsize=14)
pyplt.show()



In [ ]: