In [1]:
import graphlab

Work with Philadelphia crime rate data

The dataset has information about the house prices in Philadelphia, additionally, has information about the crime rates in various neighborhoods. So we can see some interesting observations in this dataset as follows

Load data and do initial analysis


In [2]:
crime_rate_data =  graphlab.SFrame.read_csv('Philadelphia_Crime_Rate_noNA.csv')


2016-03-18 09:32:16,476 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.5 started. Logging: /tmp/graphlab_server_1458264734.log
This non-commercial license of GraphLab Create is assigned to akshay.narayan@u.nus.edu and will expire on September 26, 2016. For commercial licensing options, visit https://dato.com/buy/.
Finished parsing file /home/akshay/Workspace/Courses/pyDataAnalysis/ml-regression/week1/Philadelphia_Crime_Rate_noNA.csv
Parsing completed. Parsed 99 lines in 0.011855 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,float,float,float,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /home/akshay/Workspace/Courses/pyDataAnalysis/ml-regression/week1/Philadelphia_Crime_Rate_noNA.csv
Parsing completed. Parsed 99 lines in 0.006661 secs.

In [3]:
crime_rate_data


Out[3]:
HousePrice HsPrc ($10,000) CrimeRate MilesPhila PopChg Name County
140463 14.0463 29.7 10.0 -1.0 Abington Montgome
113033 11.3033 24.1 18.0 4.0 Ambler Montgome
124186 12.4186 19.5 25.0 8.0 Aston Delaware
110490 11.049 49.4 25.0 2.7 Bensalem Bucks
79124 7.9124 54.1 19.0 3.9 Bristol B. Bucks
92634 9.2634 48.6 20.0 0.6 Bristol T. Bucks
89246 8.9246 30.8 15.0 -2.6 Brookhaven Delaware
195145 19.5145 10.8 20.0 -3.5 Bryn Athyn Montgome
297342 29.7342 20.2 14.0 0.6 Bryn Mawr Montgome
264298 26.4298 20.4 26.0 6.0 Buckingham Bucks
[99 rows x 7 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [4]:
graphlab.canvas.set_target('ipynb')

In [5]:
crime_rate_data.show(view='Scatter Plot', x = "CrimeRate", y = "HousePrice")


Fit the regression model using crime rate as the feature


In [6]:
crime_model = graphlab.linear_regression.create(crime_rate_data, 
                                               target = 'HousePrice',
                                               features = ['CrimeRate'],
                                               validation_set = None, 
                                               verbose = False)

In [8]:
import matplotlib.pyplot as plt


/home/akshay/dato-env/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [9]:
%matplotlib inline

Look at the fit of the (initial) model


In [10]:
plt.plot(crime_rate_data['CrimeRate'], crime_rate_data['HousePrice'],
        '.', crime_rate_data['CrimeRate'], 
         crime_model.predict(crime_rate_data), '-')


Out[10]:
[<matplotlib.lines.Line2D at 0x7f3a58692390>,
 <matplotlib.lines.Line2D at 0x7f3a58692490>]

We can see that there is an outlier in the data, where the crime rate is high, but still, the house price is higher, hence not following the trend. This point is the center of the city (Center City data point)

Remove the Center CIty value, and re do the analysis

Center City is one observation with extremely high crime rate and high house prices. This is an outlier in some sense. So we can remove this and re fit the model


In [11]:
crime_rate_data_noCC = crime_rate_data[crime_rate_data['MilesPhila'] != 0.0]

In [12]:
crime_rate_data_noCC.show(view='Scatter Plot', x = "CrimeRate", y = "HousePrice")


Notice the difference in the previous scatter plot and this one after removing the outlier (city center)


In [13]:
crime_model_withNoCC = graphlab.linear_regression.create(crime_rate_data_noCC,
                                                        target = 'HousePrice',
                                                        features = ['CrimeRate'],
                                                        validation_set = None,
                                                        verbose = False)

Look at the fit of the model with outlier removed


In [14]:
plt.plot(crime_rate_data_noCC['CrimeRate'], crime_rate_data_noCC['HousePrice'], '.', 
         crime_rate_data_noCC['CrimeRate'], crime_model_withNoCC.predict(crime_rate_data_noCC), '-')


Out[14]:
[<matplotlib.lines.Line2D at 0x7f3a585985d0>,
 <matplotlib.lines.Line2D at 0x7f3a585986d0>]

Compare coefficients for full data fit Vs. data with CenterCity removed


In [15]:
crime_model.get('coefficients')


Out[15]:
name index value stderr
(intercept) None 176626.046881 11245.5882194
CrimeRate None -576.804949058 226.90225951
[2 rows x 4 columns]

In [16]:
crime_model_withNoCC.get('coefficients')


Out[16]:
name index value stderr
(intercept) None 225204.604303 16404.0247514
CrimeRate None -2287.69717443 491.537478123
[2 rows x 4 columns]

Remove high-value outlier neighborhoods and redo analysis


In [17]:
crime_rate_data_noHighEnd = crime_rate_data_noCC[crime_rate_data_noCC['HousePrice'] < 350000]

In [18]:
crime_model_noHighEnd = graphlab.linear_regression.create(crime_rate_data_noHighEnd, 
                                                         target = 'HousePrice', 
                                                         features = ['CrimeRate'], 
                                                         validation_set = None,
                                                         verbose = False)

How much do the coefficients change?


In [19]:
crime_model_withNoCC.get('coefficients')


Out[19]:
name index value stderr
(intercept) None 225204.604303 16404.0247514
CrimeRate None -2287.69717443 491.537478123
[2 rows x 4 columns]

In [20]:
crime_model_noHighEnd.get('coefficients')


Out[20]:
name index value stderr
(intercept) None 199073.589615 11932.5101105
CrimeRate None -1837.71280989 351.519609333
[2 rows x 4 columns]

We see that removing outliers wrt high-value neighborhoods has some effect on the fit but not as much as the high-leverate City Center data point. Hence, high leverage points may be much stronger candidates for influential observations but outliers may not be so.


In [ ]: