Fire up graphlab create



In [2]:

    
import graphlab

Load some house value vs. crime rate data

Dataset is from Philadelphia, PA and includes average house sales price in a number of neighborhoods. The attributes of each neighborhood we have include the crime rate ('CrimeRate'), miles from Center City ('MilesPhila'), town name ('Name'), and county name ('County').



In [3]:

    
sales = graphlab.SFrame.read_csv('Philadelphia_Crime_Rate_noNA.csv/')









    



PROGRESS: Finished parsing file /home/anil/MachineLearning_Mastering/Philadelphia_Crime_Rate_noNA.csv
PROGRESS: Parsing completed. Parsed 99 lines in 0.307255 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,float,float,float,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Read 99 lines. Lines per second: 499.43
PROGRESS: Finished parsing file /home/anil/MachineLearning_Mastering/Philadelphia_Crime_Rate_noNA.csv
PROGRESS: Parsing completed. Parsed 99 lines in 0.200195 secs.



In [4]:

    
sales









    Out[4]:





    
        HousePrice
        HsPrc ($10,000)
        CrimeRate
        MilesPhila
        PopChg
        Name
        County
    
    
        140463
        14.0463
        29.7
        10.0
        -1.0
        Abington
        Montgome
    
    
        113033
        11.3033
        24.1
        18.0
        4.0
        Ambler
        Montgome
    
    
        124186
        12.4186
        19.5
        25.0
        8.0
        Aston
        Delaware
    
    
        110490
        11.049
        49.4
        25.0
        2.7
        Bensalem
        Bucks
    
    
        79124
        7.9124
        54.1
        19.0
        3.9
        Bristol B.
        Bucks
    
    
        92634
        9.2634
        48.6
        20.0
        0.6
        Bristol T.
        Bucks
    
    
        89246
        8.9246
        30.8
        15.0
        -2.6
        Brookhaven
        Delaware
    
    
        195145
        19.5145
        10.8
        20.0
        -3.5
        Bryn Athyn
        Montgome
    
    
        297342
        29.7342
        20.2
        14.0
        0.6
        Bryn Mawr
        Montgome
    
    
        264298
        26.4298
        20.4
        26.0
        6.0
        Buckingham
        Bucks
    

[99 rows x 7 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Exploring the data

The house price in a town is correlated with the crime rate of that town. Low crime towns tend to be associated with higher house prices and vice versa.



In [5]:

    
graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="CrimeRate", y="HousePrice")

Fit the regression model using crime as the feature



In [6]:

    
crime_model = graphlab.linear_regression.create(sales, target='HousePrice', features=['CrimeRate'],validation_set=None,verbose=False)

Let's see what our fit looks like



In [9]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [10]:

    
plt.plot(sales['CrimeRate'],sales['HousePrice'],'.',
        sales['CrimeRate'],crime_model.predict(sales),'-')









    Out[10]:





[<matplotlib.lines.Line2D at 0x5586410>,
 <matplotlib.lines.Line2D at 0x5586890>]

Above: blue dots are original data, green line is the fit from the simple regression.

Remove Center City and redo the analysis

Center City is the one observation with an extremely high crime rate, yet house prices are not very low. This point does not follow the trend of the rest of the data very well. A question is how much including Center City is influencing our fit on the other datapoints. Let's remove this datapoint and see what happens.



In [11]:

    
sales_noCC = sales[sales['MilesPhila'] != 0.0]



In [12]:

    
sales_noCC.show(view="Scatter Plot", x="CrimeRate", y="HousePrice")

Refit our simple regression model on this modified dataset:



In [14]:

    
crime_model_noCC = graphlab.linear_regression.create(sales_noCC, target='HousePrice', features=['CrimeRate'],validation_set=None, verbose=False)

Look at the fit:



In [15]:

    
plt.plot(sales_noCC['CrimeRate'],sales_noCC['HousePrice'],'.',
        sales_noCC['CrimeRate'],crime_model.predict(sales_noCC),'-')









    Out[15]:





[<matplotlib.lines.Line2D at 0x578d190>,
 <matplotlib.lines.Line2D at 0x578d610>]

Compare coefficients for full-data fit versus no-Center-City fit

Visually, the fit seems different, but let's quantify this by examining the estimated coefficients of our original fit and that of the modified dataset with Center City removed.



In [16]:

    
crime_model.get('coefficients')









    Out[16]:





    
        name
        index
        value
    
    
        (intercept)
        None
        176626.046881
    
    
        CrimeRate
        None
        -576.804949058
    

[2 rows x 3 columns]



In [17]:

    
crime_model_noCC.get('coefficients')









    Out[17]:





    
        name
        index
        value
    
    
        (intercept)
        None
        225204.604303
    
    
        CrimeRate
        None
        -2287.69717443
    

[2 rows x 3 columns]

Above: We see that for the "no Center City" version, per unit increase in crime, the predicted decrease in house prices is 2,287. In contrast, for the original dataset, the drop is only 576 per unit increase in crime. This is significantly different!

High leverage points:

Center City is said to be a "high leverage" point because it is at an extreme x value where there are not other observations. As a result, recalling the closed-form solution for simple regression, this point has the potential to dramatically change the least squares line since the center of x mass is heavily influenced by this one point and the least squares line will try to fit close to that outlying (in x) point. If a high leverage point follows the trend of the other data, this might not have much effect. On the other hand, if this point somehow differs, it can be strongly influential in the resulting fit.

Influential observations:

An influential observation is one where the removal of the point significantly changes the fit. As discussed above, high leverage points are good candidates for being influential observations, but need not be. Other observations that are not leverage points can also be influential observations (e.g., strongly outlying in y even if x is a typical value).

Remove high-value outlier neighborhoods and redo analysis

Based on the discussion above, a question is whether the outlying high-value towns are strongly influencing the fit. Let's remove them and see what happens.



In [18]:

    
sales_nohighend = sales_noCC[sales_noCC['HousePrice'] < 350000] 
crime_model_nohighend = graphlab.linear_regression.create(sales_nohighend, target='HousePrice', features=['CrimeRate'],validation_set=None, verbose=False)

Do the coefficients change much?



In [19]:

    
crime_model_noCC.get('coefficients')









    Out[19]:





    
        name
        index
        value
    
    
        (intercept)
        None
        225204.604303
    
    
        CrimeRate
        None
        -2287.69717443
    

[2 rows x 3 columns]



In [20]:

    
crime_model_nohighend.get('coefficients')









    Out[20]:





    
        name
        index
        value
    
    
        (intercept)
        None
        199073.589615
    
    
        CrimeRate
        None
        -1837.71280989
    

[2 rows x 3 columns]

Above: We see that removing the outlying high-value neighborhoods has some effect on the fit, but not nearly as much as our high-leverage Center City datapoint.



In [ ]:

HousePrice	HsPrc ($10,000)	CrimeRate	MilesPhila	PopChg	Name	County
140463	14.0463	29.7	10.0	-1.0	Abington	Montgome
113033	11.3033	24.1	18.0	4.0	Ambler	Montgome
124186	12.4186	19.5	25.0	8.0	Aston	Delaware
110490	11.049	49.4	25.0	2.7	Bensalem	Bucks
79124	7.9124	54.1	19.0	3.9	Bristol B.	Bucks
92634	9.2634	48.6	20.0	0.6	Bristol T.	Bucks
89246	8.9246	30.8	15.0	-2.6	Brookhaven	Delaware
195145	19.5145	10.8	20.0	-3.5	Bryn Athyn	Montgome
297342	29.7342	20.2	14.0	0.6	Bryn Mawr	Montgome
264298	26.4298	20.4	26.0	6.0	Buckingham	Bucks