Regression Week 3: Assessing Fit (polynomial regression)

In this notebook you will compare different regression models in order to assess which model fits best. We will be using polynomial regression as a means to examine this topic. In particular you will:

Write a function to take an SArray and a degree and return an SFrame where each column is the SArray to a polynomial value up to the total degree e.g. degree = 3 then column 1 is the SArray column 2 is the SArray squared and column 3 is the SArray cubed
Use matplotlib to visualize polynomial regressions
Use matplotlib to visualize the same polynomial degree on different subsets of the data
Use a validation set to select a polynomial degree
Assess the final fit using test data

We will continue to use the House data from previous notebooks.

Fire up graphlab create



In [1]:

    
import graphlab

Next we're going to write a polynomial function that takes an SArray and a maximal degree and returns an SFrame with columns containing the SArray to all the powers up to the maximal degree.

The easiest way to apply a power to an SArray is to use the .apply() and lambda x: functions. For example to take the example array and compute the third power we can do as follows: (note running this cell the first time may take longer than expected since it loads graphlab)



In [2]:

    
tmp = graphlab.SArray([1., 2., 3.])
tmp_cubed = tmp.apply(lambda x: x**3)
print tmp
print tmp_cubed









    



[INFO] 1450550048 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_FILE to /home/scoaste/miniconda2/envs/dato-env/lib/python2.7/site-packages/certifi/cacert.pem
1450550048 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_DIR to 
This non-commercial license of GraphLab Create is assigned to scoaste@gmail.com and will expire on November 24, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-9059 - Server binary: /home/scoaste/miniconda2/envs/dato-env/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1450550048.log
[INFO] GraphLab Server Version: 1.7.1






    



[1.0, 2.0, 3.0]
[1.0, 8.0, 27.0]

We can create an empty SFrame using graphlab.SFrame() and then add any columns to it with ex_sframe['column_name'] = value. For example we create an empty SFrame and make the column 'power_1' to be the first power of tmp (i.e. tmp itself).



In [3]:

    
ex_sframe = graphlab.SFrame()
ex_sframe['power_1'] = tmp
print ex_sframe









    



+---------+
| power_1 |
+---------+
|   1.0   |
|   2.0   |
|   3.0   |
+---------+
[3 rows x 1 columns]

Polynomial_sframe function

Using the hints above complete the following function to create an SFrame consisting of the powers of an SArray up to a specific degree:



In [16]:

    
def polynomial_sframe(feature, degree):
    # assume that degree >= 1
    # initialize the SFrame:
    poly_sframe = graphlab.SFrame()

    # then loop over the remaining degrees:
    for power in range(1, degree+1): 
        # then assign poly_sframe[name] to the appropriate power of feature
        poly_sframe['power_' + str(power)] = feature.apply(lambda x: x**power)

    return poly_sframe

To test your function consider the smaller tmp variable and what you would expect the outcome of the following call:



In [17]:

    
print polynomial_sframe(tmp, 3)









    



+---------+---------+---------+
| power_1 | power_2 | power_3 |
+---------+---------+---------+
|   1.0   |   1.0   |   1.0   |
|   2.0   |   4.0   |   8.0   |
|   3.0   |   9.0   |   27.0  |
+---------+---------+---------+
[3 rows x 3 columns]

Visualizing polynomial regression

Let's use matplotlib to visualize what a polynomial regression looks like on some real data.



In [20]:

    
sales = graphlab.SFrame('../Data/kc_house_data.gl/')
sales.head()









    Out[20]:





    
        id
        date
        price
        bedrooms
        bathrooms
        sqft_living
        sqft_lot
        floors
        waterfront
    
    
        7129300520
        2014-10-13 00:00:00+00:00
        221900.0
        3.0
        1.0
        1180.0
        5650
        1
        0
    
    
        6414100192
        2014-12-09 00:00:00+00:00
        538000.0
        3.0
        2.25
        2570.0
        7242
        2
        0
    
    
        5631500400
        2015-02-25 00:00:00+00:00
        180000.0
        2.0
        1.0
        770.0
        10000
        1
        0
    
    
        2487200875
        2014-12-09 00:00:00+00:00
        604000.0
        4.0
        3.0
        1960.0
        5000
        1
        0
    
    
        1954400510
        2015-02-18 00:00:00+00:00
        510000.0
        3.0
        2.0
        1680.0
        8080
        1
        0
    
    
        7237550310
        2014-05-12 00:00:00+00:00
        1225000.0
        4.0
        4.5
        5420.0
        101930
        1
        0
    
    
        1321400060
        2014-06-27 00:00:00+00:00
        257500.0
        3.0
        2.25
        1715.0
        6819
        2
        0
    
    
        2008000270
        2015-01-15 00:00:00+00:00
        291850.0
        3.0
        1.5
        1060.0
        9711
        1
        0
    
    
        2414600126
        2015-04-15 00:00:00+00:00
        229500.0
        3.0
        1.0
        1780.0
        7470
        1
        0
    
    
        3793500160
        2015-03-12 00:00:00+00:00
        323000.0
        3.0
        2.5
        1890.0
        6560
        2
        0
    


    
        view
        condition
        grade
        sqft_above
        sqft_basement
        yr_built
        yr_renovated
        zipcode
        lat
    
    
        0
        3
        7
        1180
        0
        1955
        0
        98178
        47.51123398
    
    
        0
        3
        7
        2170
        400
        1951
        1991
        98125
        47.72102274
    
    
        0
        3
        6
        770
        0
        1933
        0
        98028
        47.73792661
    
    
        0
        5
        7
        1050
        910
        1965
        0
        98136
        47.52082
    
    
        0
        3
        8
        1680
        0
        1987
        0
        98074
        47.61681228
    
    
        0
        3
        11
        3890
        1530
        2001
        0
        98053
        47.65611835
    
    
        0
        3
        7
        1715
        0
        1995
        0
        98003
        47.30972002
    
    
        0
        3
        7
        1060
        0
        1963
        0
        98198
        47.40949984
    
    
        0
        3
        7
        1050
        730
        1960
        0
        98146
        47.51229381
    
    
        0
        3
        7
        1890
        0
        2003
        0
        98038
        47.36840673
    


    
        long
        sqft_living15
        sqft_lot15
    
    
        -122.25677536
        1340.0
        5650.0
    
    
        -122.3188624
        1690.0
        7639.0
    
    
        -122.23319601
        2720.0
        8062.0
    
    
        -122.39318505
        1360.0
        5000.0
    
    
        -122.04490059
        1800.0
        7503.0
    
    
        -122.00528655
        4760.0
        101930.0
    
    
        -122.32704857
        2238.0
        6819.0
    
    
        -122.31457273
        1650.0
        9711.0
    
    
        -122.33659507
        1780.0
        8113.0
    
    
        -122.0308176
        2390.0
        7570.0
    

[10 rows x 21 columns]

As in Week 3, we will use the sqft_living variable. For plotting purposes (connecting the dots), you'll need to sort by the values of sqft_living. For houses with identical square footage, we break the tie by their prices.



In [21]:

    
sales = sales.sort(['sqft_living', 'price'])

Let's start with a degree 1 polynomial using 'sqft_living' (i.e. a line) to predict 'price' and plot what it looks like.



In [25]:

    
poly1_data = polynomial_sframe(sales['sqft_living'], 1)
print poly1_data
poly1_data['price'] = sales['price'] # add price to the data since it's the target
print poly1_data









    



+---------+
| power_1 |
+---------+
|  290.0  |
|  370.0  |
|  380.0  |
|  384.0  |
|  390.0  |
|  390.0  |
|  410.0  |
|  420.0  |
|  420.0  |
|  430.0  |
+---------+
[21613 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
+---------+----------+
| power_1 |  price   |
+---------+----------+
|  290.0  | 142000.0 |
|  370.0  | 276000.0 |
|  380.0  | 245000.0 |
|  384.0  | 265000.0 |
|  390.0  | 228000.0 |
|  390.0  | 245000.0 |
|  410.0  | 325000.0 |
|  420.0  | 229050.0 |
|  420.0  | 280000.0 |
|  430.0  | 80000.0  |
+---------+----------+
[21613 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

NOTE: for all the models in this notebook use validation_set = None to ensure that all results are consistent across users.



In [26]:

    
model1 = graphlab.linear_regression.create(poly1_data, target = 'price', features = ['power_1'], validation_set = None)









    



PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 21613
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 1
PROGRESS: Number of coefficients    : 2
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 1.014469     | 4362074.696077     | 261440.790724 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:



In [27]:

    
#let's take a look at the weights before we plot
model1.get("coefficients")









    Out[27]:





    
        name
        index
        value
    
    
        (intercept)
        None
        -43579.0852515
    
    
        power_1
        None
        280.622770886
    

[2 rows x 3 columns]



In [28]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [29]:

    
plt.plot(poly1_data['power_1'],poly1_data['price'],'.',
        poly1_data['power_1'], model1.predict(poly1_data),'-')









    Out[29]:





[<matplotlib.lines.Line2D at 0x7fb6ef4c5210>,
 <matplotlib.lines.Line2D at 0x7fb6ef45f450>]

Let's unpack that plt.plot() command. The first pair of SArrays we passed are the 1st power of sqft and the actual price we then ask it to print these as dots '.'. The next pair we pass is the 1st power of sqft and the predicted values from the linear model. We ask these to be plotted as a line '-'.

We can see, not surprisingly, that the predicted values all fall on a line, specifically the one with slope 280 and intercept -43579. What if we wanted to plot a second degree polynomial?



In [31]:

    
poly2_data = polynomial_sframe(sales['sqft_living'], 2)
print poly2_data
my_features = poly2_data.column_names() # get the name of the features
print my_features
poly2_data['price'] = sales['price'] # add price to the data since it's the target
print poly2_data
model2 = graphlab.linear_regression.create(poly2_data, target = 'price', features = my_features, validation_set = None)









    



+---------+----------+
| power_1 | power_2  |
+---------+----------+
|  290.0  | 84100.0  |
|  370.0  | 136900.0 |
|  380.0  | 144400.0 |
|  384.0  | 147456.0 |
|  390.0  | 152100.0 |
|  390.0  | 152100.0 |
|  410.0  | 168100.0 |
|  420.0  | 176400.0 |
|  420.0  | 176400.0 |
|  430.0  | 184900.0 |
+---------+----------+
[21613 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
['power_1', 'power_2']
+---------+----------+----------+
| power_1 | power_2  |  price   |
+---------+----------+----------+
|  290.0  | 84100.0  | 142000.0 |
|  370.0  | 136900.0 | 276000.0 |
|  380.0  | 144400.0 | 245000.0 |
|  384.0  | 147456.0 | 265000.0 |
|  390.0  | 152100.0 | 228000.0 |
|  390.0  | 152100.0 | 245000.0 |
|  410.0  | 168100.0 | 325000.0 |
|  420.0  | 176400.0 | 229050.0 |
|  420.0  | 176400.0 | 280000.0 |
|  430.0  | 184900.0 | 80000.0  |
+---------+----------+----------+
[21613 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 21613
PROGRESS: Number of features          : 2
PROGRESS: Number of unpacked features : 2
PROGRESS: Number of coefficients    : 3
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.022914     | 5913020.984255     | 250948.368758 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:



In [32]:

    
model2.get("coefficients")









    Out[32]:





    
        name
        index
        value
    
    
        (intercept)
        None
        199222.496445
    
    
        power_1
        None
        67.9940640677
    
    
        power_2
        None
        0.0385812312789
    

[3 rows x 3 columns]



In [33]:

    
plt.plot(poly2_data['power_1'],poly2_data['price'],'.',
        poly2_data['power_1'], model2.predict(poly2_data),'-')









    Out[33]:





[<matplotlib.lines.Line2D at 0x7fb6f31c2a10>,
 <matplotlib.lines.Line2D at 0x7fb6ef4d9d50>]

The resulting model looks like half a parabola. Try on your own to see what the cubic looks like:



In [34]:

    
poly3_data = polynomial_sframe(sales['sqft_living'], 3)
print poly3_data
my_features3 = poly3_data.column_names() # get the name of the features
print my_features3
poly3_data['price'] = sales['price'] # add price to the data since it's the target
print poly3_data
model3 = graphlab.linear_regression.create(poly3_data, target = 'price', features = my_features3, validation_set = None)









    



+---------+----------+------------+
| power_1 | power_2  |  power_3   |
+---------+----------+------------+
|  290.0  | 84100.0  | 24389000.0 |
|  370.0  | 136900.0 | 50653000.0 |
|  380.0  | 144400.0 | 54872000.0 |
|  384.0  | 147456.0 | 56623104.0 |
|  390.0  | 152100.0 | 59319000.0 |
|  390.0  | 152100.0 | 59319000.0 |
|  410.0  | 168100.0 | 68921000.0 |
|  420.0  | 176400.0 | 74088000.0 |
|  420.0  | 176400.0 | 74088000.0 |
|  430.0  | 184900.0 | 79507000.0 |
+---------+----------+------------+
[21613 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
['power_1', 'power_2', 'power_3']
+---------+----------+------------+----------+
| power_1 | power_2  |  power_3   |  price   |
+---------+----------+------------+----------+
|  290.0  | 84100.0  | 24389000.0 | 142000.0 |
|  370.0  | 136900.0 | 50653000.0 | 276000.0 |
|  380.0  | 144400.0 | 54872000.0 | 245000.0 |
|  384.0  | 147456.0 | 56623104.0 | 265000.0 |
|  390.0  | 152100.0 | 59319000.0 | 228000.0 |
|  390.0  | 152100.0 | 59319000.0 | 245000.0 |
|  410.0  | 168100.0 | 68921000.0 | 325000.0 |
|  420.0  | 176400.0 | 74088000.0 | 229050.0 |
|  420.0  | 176400.0 | 74088000.0 | 280000.0 |
|  430.0  | 184900.0 | 79507000.0 | 80000.0  |
+---------+----------+------------+----------+
[21613 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 21613
PROGRESS: Number of features          : 3
PROGRESS: Number of unpacked features : 3
PROGRESS: Number of coefficients    : 4
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.023075     | 3261066.736007     | 249261.286346 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:



In [35]:

    
model3.get("coefficients")









    Out[35]:





    
        name
        index
        value
    
    
        (intercept)
        None
        336788.117952
    
    
        power_1
        None
        -90.1476236119
    
    
        power_2
        None
        0.087036715081
    
    
        power_3
        None
        -3.8398521196e-06
    

[4 rows x 3 columns]



In [36]:

    
plt.plot(poly3_data['power_1'],poly3_data['price'],'.',
        poly3_data['power_1'], model3.predict(poly3_data),'-')









    Out[36]:





[<matplotlib.lines.Line2D at 0x7fb6eca2ffd0>,
 <matplotlib.lines.Line2D at 0x7fb6eca3c210>]

Now try a 15th degree polynomial:



In [37]:

    
poly15_data = polynomial_sframe(sales['sqft_living'], 15)
print poly15_data
my_features15 = poly15_data.column_names() # get the name of the features
print my_features15
poly15_data['price'] = sales['price'] # add price to the data since it's the target
print poly15_data
model15 = graphlab.linear_regression.create(poly15_data, target = 'price', features = my_features15, validation_set = None)









    



+---------+----------+------------+---------------+-------------------+
| power_1 | power_2  |  power_3   |    power_4    |      power_5      |
+---------+----------+------------+---------------+-------------------+
|  290.0  | 84100.0  | 24389000.0 |  7072810000.0 |   2.0511149e+12   |
|  370.0  | 136900.0 | 50653000.0 | 18741610000.0 |   6.9343957e+12   |
|  380.0  | 144400.0 | 54872000.0 | 20851360000.0 |   7.9235168e+12   |
|  384.0  | 147456.0 | 56623104.0 | 21743271936.0 | 8.34941642342e+12 |
|  390.0  | 152100.0 | 59319000.0 | 23134410000.0 |   9.0224199e+12   |
|  390.0  | 152100.0 | 59319000.0 | 23134410000.0 |   9.0224199e+12   |
|  410.0  | 168100.0 | 68921000.0 | 28257610000.0 |   1.15856201e+13  |
|  420.0  | 176400.0 | 74088000.0 | 31116960000.0 |   1.30691232e+13  |
|  420.0  | 176400.0 | 74088000.0 | 31116960000.0 |   1.30691232e+13  |
|  430.0  | 184900.0 | 79507000.0 | 34188010000.0 |   1.47008443e+13  |
+---------+----------+------------+---------------+-------------------+
+-------------------+-------------------+-------------------+-------------------+
|      power_6      |      power_7      |      power_8      |      power_9      |
+-------------------+-------------------+-------------------+-------------------+
|   5.94823321e+14  |  1.7249876309e+17 | 5.00246412961e+19 | 1.45071459759e+22 |
|  2.565726409e+15  |  9.4931877133e+17 | 3.51247945392e+20 | 1.29961739795e+23 |
|  3.010936384e+15  | 1.14415582592e+18 |  4.3477921385e+20 | 1.65216101263e+23 |
| 3.20617590659e+15 | 1.23117154813e+18 | 4.72769874483e+20 | 1.81543631801e+23 |
|  3.518743761e+15  | 1.37231006679e+18 | 5.35200926048e+20 | 2.08728361159e+23 |
|  3.518743761e+15  | 1.37231006679e+18 | 5.35200926048e+20 | 2.08728361159e+23 |
|  4.750104241e+15  | 1.94754273881e+18 | 7.98492522912e+20 | 3.27381934394e+23 |
|  5.489031744e+15  | 2.30539333248e+18 | 9.68265199642e+20 | 4.06671383849e+23 |
|  5.489031744e+15  | 2.30539333248e+18 | 9.68265199642e+20 | 4.06671383849e+23 |
|  6.321363049e+15  | 2.71818611107e+18 | 1.16882002776e+21 | 5.02592611937e+23 |
+-------------------+-------------------+-------------------+-------------------+
+-------------------+-------------------+-------------------+-------------------+
|      power_10     |      power_11     |      power_12     |      power_13     |
+-------------------+-------------------+-------------------+-------------------+
|  4.207072333e+24  | 1.22005097657e+27 | 3.53814783205e+29 |  1.0260628713e+32 |
| 4.80858437242e+25 | 1.77917621779e+28 | 6.58295200584e+30 | 2.43569224216e+33 |
| 6.27821184799e+25 | 2.38572050224e+28 | 9.06573790849e+30 | 3.44498040523e+33 |
| 6.97127546117e+25 | 2.67696977709e+28 |  1.0279563944e+31 | 3.94735255451e+33 |
| 8.14040608519e+25 | 3.17475837322e+28 | 1.23815576556e+31 | 4.82880748567e+33 |
| 8.14040608519e+25 | 3.17475837322e+28 | 1.23815576556e+31 | 4.82880748567e+33 |
| 1.34226593102e+26 | 5.50329031716e+28 | 2.25634903004e+31 | 9.25103102315e+33 |
| 1.70801981217e+26 |  7.1736832111e+28 | 3.01294694866e+31 | 1.26543771844e+34 |
| 1.70801981217e+26 |  7.1736832111e+28 | 3.01294694866e+31 | 1.26543771844e+34 |
| 2.16114823133e+26 | 9.29293739471e+28 | 3.99596307973e+31 | 1.71826412428e+34 |
+-------------------+-------------------+-------------------+-------------------+
+-------------------+-------------------+
|      power_14     |      power_15     |
+-------------------+-------------------+
| 2.97558232676e+34 |  8.6291887476e+36 |
|  9.012061296e+35  | 3.33446267952e+38 |
| 1.30909255399e+36 | 4.97455170515e+38 |
| 1.51578338093e+36 | 5.82060818277e+38 |
| 1.88323491941e+36 | 7.34461618571e+38 |
| 1.88323491941e+36 | 7.34461618571e+38 |
| 3.79292271949e+36 | 1.55509831499e+39 |
| 5.31483841744e+36 | 2.23223213533e+39 |
| 5.31483841744e+36 | 2.23223213533e+39 |
| 7.38853573441e+36 |  3.1770703658e+39 |
+-------------------+-------------------+
[21613 rows x 15 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11', 'power_12', 'power_13', 'power_14', 'power_15']
+---------+----------+------------+---------------+-------------------+
| power_1 | power_2  |  power_3   |    power_4    |      power_5      |
+---------+----------+------------+---------------+-------------------+
|  290.0  | 84100.0  | 24389000.0 |  7072810000.0 |   2.0511149e+12   |
|  370.0  | 136900.0 | 50653000.0 | 18741610000.0 |   6.9343957e+12   |
|  380.0  | 144400.0 | 54872000.0 | 20851360000.0 |   7.9235168e+12   |
|  384.0  | 147456.0 | 56623104.0 | 21743271936.0 | 8.34941642342e+12 |
|  390.0  | 152100.0 | 59319000.0 | 23134410000.0 |   9.0224199e+12   |
|  390.0  | 152100.0 | 59319000.0 | 23134410000.0 |   9.0224199e+12   |
|  410.0  | 168100.0 | 68921000.0 | 28257610000.0 |   1.15856201e+13  |
|  420.0  | 176400.0 | 74088000.0 | 31116960000.0 |   1.30691232e+13  |
|  420.0  | 176400.0 | 74088000.0 | 31116960000.0 |   1.30691232e+13  |
|  430.0  | 184900.0 | 79507000.0 | 34188010000.0 |   1.47008443e+13  |
+---------+----------+------------+---------------+-------------------+
+-------------------+-------------------+-------------------+-------------------+
|      power_6      |      power_7      |      power_8      |      power_9      |
+-------------------+-------------------+-------------------+-------------------+
|   5.94823321e+14  |  1.7249876309e+17 | 5.00246412961e+19 | 1.45071459759e+22 |
|  2.565726409e+15  |  9.4931877133e+17 | 3.51247945392e+20 | 1.29961739795e+23 |
|  3.010936384e+15  | 1.14415582592e+18 |  4.3477921385e+20 | 1.65216101263e+23 |
| 3.20617590659e+15 | 1.23117154813e+18 | 4.72769874483e+20 | 1.81543631801e+23 |
|  3.518743761e+15  | 1.37231006679e+18 | 5.35200926048e+20 | 2.08728361159e+23 |
|  3.518743761e+15  | 1.37231006679e+18 | 5.35200926048e+20 | 2.08728361159e+23 |
|  4.750104241e+15  | 1.94754273881e+18 | 7.98492522912e+20 | 3.27381934394e+23 |
|  5.489031744e+15  | 2.30539333248e+18 | 9.68265199642e+20 | 4.06671383849e+23 |
|  5.489031744e+15  | 2.30539333248e+18 | 9.68265199642e+20 | 4.06671383849e+23 |
|  6.321363049e+15  | 2.71818611107e+18 | 1.16882002776e+21 | 5.02592611937e+23 |
+-------------------+-------------------+-------------------+-------------------+
+-------------------+-------------------+-------------------+-------------------+
|      power_10     |      power_11     |      power_12     |      power_13     |
+-------------------+-------------------+-------------------+-------------------+
|  4.207072333e+24  | 1.22005097657e+27 | 3.53814783205e+29 |  1.0260628713e+32 |
| 4.80858437242e+25 | 1.77917621779e+28 | 6.58295200584e+30 | 2.43569224216e+33 |
| 6.27821184799e+25 | 2.38572050224e+28 | 9.06573790849e+30 | 3.44498040523e+33 |
| 6.97127546117e+25 | 2.67696977709e+28 |  1.0279563944e+31 | 3.94735255451e+33 |
| 8.14040608519e+25 | 3.17475837322e+28 | 1.23815576556e+31 | 4.82880748567e+33 |
| 8.14040608519e+25 | 3.17475837322e+28 | 1.23815576556e+31 | 4.82880748567e+33 |
| 1.34226593102e+26 | 5.50329031716e+28 | 2.25634903004e+31 | 9.25103102315e+33 |
| 1.70801981217e+26 |  7.1736832111e+28 | 3.01294694866e+31 | 1.26543771844e+34 |
| 1.70801981217e+26 |  7.1736832111e+28 | 3.01294694866e+31 | 1.26543771844e+34 |
| 2.16114823133e+26 | 9.29293739471e+28 | 3.99596307973e+31 | 1.71826412428e+34 |
+-------------------+-------------------+-------------------+-------------------+
+-------------------+-------------------+----------+
|      power_14     |      power_15     |  price   |
+-------------------+-------------------+----------+
| 2.97558232676e+34 |  8.6291887476e+36 | 142000.0 |
|  9.012061296e+35  | 3.33446267952e+38 | 276000.0 |
| 1.30909255399e+36 | 4.97455170515e+38 | 245000.0 |
| 1.51578338093e+36 | 5.82060818277e+38 | 265000.0 |
| 1.88323491941e+36 | 7.34461618571e+38 | 228000.0 |
| 1.88323491941e+36 | 7.34461618571e+38 | 245000.0 |
| 3.79292271949e+36 | 1.55509831499e+39 | 325000.0 |
| 5.31483841744e+36 | 2.23223213533e+39 | 229050.0 |
| 5.31483841744e+36 | 2.23223213533e+39 | 280000.0 |
| 7.38853573441e+36 |  3.1770703658e+39 | 80000.0  |
+-------------------+-------------------+----------+
[21613 rows x 16 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 21613
PROGRESS: Number of features          : 15
PROGRESS: Number of unpacked features : 15
PROGRESS: Number of coefficients    : 16
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.058605     | 2662308.584338     | 245690.511190 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:



In [38]:

    
model15.get("coefficients")









    Out[38]:





    
        name
        index
        value
    
    
        (intercept)
        None
        73619.7521118
    
    
        power_1
        None
        410.287462537
    
    
        power_2
        None
        -0.23045071443
    
    
        power_3
        None
        7.58840542459e-05
    
    
        power_4
        None
        -5.65701802671e-09
    
    
        power_5
        None
        -4.57028130588e-13
    
    
        power_6
        None
        2.66360206482e-17
    
    
        power_7
        None
        3.38584769245e-21
    
    
        power_8
        None
        1.14723104028e-25
    
    
        power_9
        None
        -4.65293584963e-30
    

[16 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [39]:

    
plt.plot(poly15_data['power_1'],poly15_data['price'],'.',
        poly15_data['power_1'], model15.predict(poly15_data),'-')









    Out[39]:





[<matplotlib.lines.Line2D at 0x7fb6eca51f90>,
 <matplotlib.lines.Line2D at 0x7fb6ec9830d0>]

What do you think of the 15th degree polynomial? Do you think this is appropriate? If we were to change the data do you think you'd get pretty much the same curve? Let's take a look.

Changing the data and re-learning

We're going to split the sales data into four subsets of roughly equal size. Then you will estimate a 15th degree polynomial model on all four subsets of the data. Print the coefficients (you should use .print_rows(num_rows = 16) to view all of them) and plot the resulting fit (as we did above). The quiz will ask you some questions about these results.

To split the sales data into four subsets, we perform the following steps:

First split sales into 2 subsets with .random_split(0.5, seed=0).
Next split the resulting subsets into 2 more subsets each. Use .random_split(0.5, seed=0).

We set seed=0 in these steps so that different users get consistent results. You should end up with 4 subsets (set_1, set_2, set_3, set_4) of approximately equal size.



In [41]:

    
set_a, set_b = sales.random_split(0.5,seed=0)
set_1, set_2 = set_a.random_split(0.5,seed=0)
set_3, set_4 = set_b.random_split(0.5,seed=0)



In [42]:

    
print len(set_1), len(set_2), len(set_3), len(set_4)









    



5404 5398 5409 5402

Fit a 15th degree polynomial on set_1, set_2, set_3, and set_4 using sqft_living to predict prices. Print the coefficients and make a plot of the resulting model.



In [44]:

    
set_1_15_data = polynomial_sframe(set_1['sqft_living'], 15)
set_2_15_data = polynomial_sframe(set_2['sqft_living'], 15)
set_3_15_data = polynomial_sframe(set_3['sqft_living'], 15)
set_4_15_data = polynomial_sframe(set_4['sqft_living'], 15)
#
my_features_x_15 = set_1_15_data.column_names() # get the name of the features
#
set_1_15_data['price'] = set_1['price'] # add price to the data since it's the target
set_2_15_data['price'] = set_2['price'] # add price to the data since it's the target
set_3_15_data['price'] = set_3['price'] # add price to the data since it's the target
set_4_15_data['price'] = set_4['price'] # add price to the data since it's the target
#
model_1_15 = graphlab.linear_regression.create(set_1_15_data, target='price', features=my_features_x_15, validation_set=None)
model_2_15 = graphlab.linear_regression.create(set_2_15_data, target='price', features=my_features_x_15, validation_set=None)
model_3_15 = graphlab.linear_regression.create(set_3_15_data, target='price', features=my_features_x_15, validation_set=None)
model_4_15 = graphlab.linear_regression.create(set_4_15_data, target='price', features=my_features_x_15, validation_set=None)









    



PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 5404
PROGRESS: Number of features          : 15
PROGRESS: Number of unpacked features : 15
PROGRESS: Number of coefficients    : 16
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.022166     | 2195218.932305     | 248858.822200 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:
PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 5398
PROGRESS: Number of features          : 15
PROGRESS: Number of unpacked features : 15
PROGRESS: Number of coefficients    : 16
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.019653     | 2069212.978546     | 234840.067186 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:
PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 5409
PROGRESS: Number of features          : 15
PROGRESS: Number of unpacked features : 15
PROGRESS: Number of coefficients    : 16
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.022215     | 2269769.506520     | 251460.072754 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:
PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 5402
PROGRESS: Number of features          : 15
PROGRESS: Number of unpacked features : 15
PROGRESS: Number of coefficients    : 16
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.023585     | 2314893.173831     | 244563.136754 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:



In [45]:

    
model_1_15.get("coefficients").print_rows(num_rows = 16)
model_2_15.get("coefficients").print_rows(num_rows = 16)
model_3_15.get("coefficients").print_rows(num_rows = 16)
model_4_15.get("coefficients").print_rows(num_rows = 16)









    



+-------------+-------+--------------------+
|     name    | index |       value        |
+-------------+-------+--------------------+
| (intercept) |  None |   223312.750252    |
|   power_1   |  None |    118.08612758    |
|   power_2   |  None |  -0.0473482011285  |
|   power_3   |  None | 3.25310342447e-05  |
|   power_4   |  None | -3.32372152522e-09 |
|   power_5   |  None | -9.75830458062e-14 |
|   power_6   |  None | 1.15440303427e-17  |
|   power_7   |  None | 1.05145869414e-21  |
|   power_8   |  None | 3.46049616451e-26  |
|   power_9   |  None | -1.09654454041e-30 |
|   power_10  |  None | -2.42031812105e-34 |
|   power_11  |  None | -1.99601206809e-38 |
|   power_12  |  None | -1.07709903875e-42 |
|   power_13  |  None | -2.72862817696e-47 |
|   power_14  |  None | 2.44782693411e-51  |
|   power_15  |  None | 5.01975232672e-55  |
+-------------+-------+--------------------+
[16 rows x 3 columns]

+-------------+-------+--------------------+
|     name    | index |       value        |
+-------------+-------+--------------------+
| (intercept) |  None |   89836.5077384    |
|   power_1   |  None |    319.80694675    |
|   power_2   |  None |   -0.10331539703   |
|   power_3   |  None | 1.06682476033e-05  |
|   power_4   |  None | 5.75577097726e-09  |
|   power_5   |  None | -2.5466346458e-13  |
|   power_6   |  None | -1.09641345095e-16 |
|   power_7   |  None | -6.36458441478e-21 |
|   power_8   |  None | 5.52560416694e-25  |
|   power_9   |  None | 1.35082039014e-28  |
|   power_10  |  None | 1.18408188227e-32  |
|   power_11  |  None | 1.98348000372e-37  |
|   power_12  |  None | -9.92533589916e-41 |
|   power_13  |  None | -1.60834847097e-44 |
|   power_14  |  None | -9.12006024374e-49 |
|   power_15  |  None | 1.68636658352e-52  |
+-------------+-------+--------------------+
[16 rows x 3 columns]

+-------------+-------+--------------------+
|     name    | index |       value        |
+-------------+-------+--------------------+
| (intercept) |  None |    87317.97956     |
|   power_1   |  None |   356.304911035    |
|   power_2   |  None |  -0.164817442802   |
|   power_3   |  None | 4.40424992681e-05  |
|   power_4   |  None | 6.48234876146e-10  |
|   power_5   |  None | -6.75253226523e-13 |
|   power_6   |  None | -3.36842592782e-17 |
|   power_7   |  None | 3.60999704445e-21  |
|   power_8   |  None | 6.46999725427e-25  |
|   power_9   |  None | 4.23639388795e-29  |
|   power_10  |  None | -3.62149425127e-34 |
|   power_11  |  None | -4.2711952729e-37  |
|   power_12  |  None | -5.61445971742e-41 |
|   power_13  |  None | -3.87452772843e-45 |
|   power_14  |  None | 4.69430359221e-50  |
|   power_15  |  None | 6.39045885998e-53  |
+-------------+-------+--------------------+
[16 rows x 3 columns]

+-------------+-------+--------------------+
|     name    | index |       value        |
+-------------+-------+--------------------+
| (intercept) |  None |   259020.879451    |
|   power_1   |  None |   -31.727716201    |
|   power_2   |  None |   0.109702769613   |
|   power_3   |  None | -1.58383847321e-05 |
|   power_4   |  None | -4.47660623765e-09 |
|   power_5   |  None |  1.1397657346e-12  |
|   power_6   |  None | 1.97669120585e-16  |
|   power_7   |  None | -6.15783678733e-21 |
|   power_8   |  None | -4.88012304066e-24 |
|   power_9   |  None | -6.62186781443e-28 |
|   power_10  |  None | -2.70631583103e-32 |
|   power_11  |  None | 6.72370411494e-36  |
|   power_12  |  None | 1.74115646278e-39  |
|   power_13  |  None | 2.09188375693e-43  |
|   power_14  |  None |  4.7801556656e-48  |
|   power_15  |  None | -4.74535333128e-51 |
+-------------+-------+--------------------+
[16 rows x 3 columns]



In [47]:

    
plt.plot(set_1_15_data['power_1'],set_1_15_data['price'],'.',set_1_15_data['power_1'],model_1_15.predict(set_1_15_data),'-')









    Out[47]:





[<matplotlib.lines.Line2D at 0x7fb6ec8c5490>,
 <matplotlib.lines.Line2D at 0x7fb6ec74b890>]



In [48]:

    
plt.plot(set_2_15_data['power_1'],set_2_15_data['price'],'.',set_2_15_data['power_1'],model_2_15.predict(set_2_15_data),'-')









    Out[48]:





[<matplotlib.lines.Line2D at 0x7fb6ec691210>,
 <matplotlib.lines.Line2D at 0x7fb6ec691450>]



In [49]:

    
plt.plot(set_3_15_data['power_1'],set_3_15_data['price'],'.',set_3_15_data['power_1'],model_3_15.predict(set_3_15_data),'-')









    Out[49]:





[<matplotlib.lines.Line2D at 0x7fb6ec6bced0>,
 <matplotlib.lines.Line2D at 0x7fb6ec5bb650>]



In [50]:

    
plt.plot(set_4_15_data['power_1'],set_4_15_data['price'],'.',set_4_15_data['power_1'],model_4_15.predict(set_4_15_data),'-')









    Out[50]:





[<matplotlib.lines.Line2D at 0x7fb6ec6afed0>,
 <matplotlib.lines.Line2D at 0x7fb6ec4f8e90>]

Some questions you will be asked on your quiz:

Quiz Question: Is the sign (positive or negative) for power_15 the same in all four models?

no

Quiz Question: (True/False) the plotted fitted lines look the same in all four plots

false

Selecting a Polynomial Degree

Whenever we have a "magic" parameter like the degree of the polynomial there is one well-known way to select these parameters: validation set. (We will explore another approach in week 4).

We split the sales dataset 3-way into training set, test set, and validation set as follows:

Split our sales data into 2 sets: training_and_validation and testing. Use random_split(0.9, seed=1).
Further split our training data into two sets: training and validation. Use random_split(0.5, seed=1).

Again, we set seed=1 to obtain consistent results for different users.



In [51]:

    
training_and_validation, testing = sales.random_split(0.9,seed=1)
training, validation = training_and_validation.random_split(0.5,seed=1)

Next you should write a loop that does the following:

For degree in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] (to get this in python type range(1, 15+1))
- Build an SFrame of polynomial data of train_data['sqft_living'] at the current degree
- hint: my_features = poly_data.column_names() gives you a list e.g. ['power_1', 'power_2', 'power_3'] which you might find useful for graphlab.linear_regression.create( features = my_features)
- Add train_data['price'] to the polynomial SFrame
- Learn a polynomial regression model to sqft vs price with that degree on TRAIN data
- Compute the RSS on VALIDATION data (here you will want to use .predict()) for that degree and you will need to make a polynmial SFrame using validation data.
Report which degree had the lowest RSS on validation data (remember python indexes from 0)

(Note you can turn off the print out of linear_regression.create() with verbose = False)



In [53]:

    
def get_residual_sum_of_squares(model, data, outcome):
    # First get the predictions
    predictions = model.predict(data)
    # Then compute the residuals/errors
    residuals = predictions - outcome
    # Then square and add them up
    RSS = sum(pow(residuals,2))
    return(RSS)



In [78]:

    
def minimize_rss( training_data, validation_data, degrees ):
    degree_rss = {}
    
    for degree in range(1,degrees+1):
        poly_degree = polynomial_sframe(training_data['sqft_living'], degree)
        poly_features = poly_degree.column_names()
        poly_degree['price'] = training_data['price']
        poly_model = graphlab.linear_regression.create(
            poly_degree, target='price', features=poly_features, validation_set=None, verbose=False)
        poly_validation_data = polynomial_sframe(validation_data['sqft_living'], degree)
        degree_rss[degree] = get_residual_sum_of_squares(poly_model, poly_validation_data, validation_data['price'])
        print (degree,degree_rss[degree])

    min_value = min(degree_rss.values())
    min_key = [key for key, value in degree_rss.iteritems() if value == min_value]
    return( min_key[0], min_value )



In [79]:

    
print minimize_rss( training, validation, 15 )









    



(1, 676709775198045.5)
(2, 607090530698013.1)
(3, 616714574532764.2)
(4, 609129230654381.8)
(5, 599177138583690.0)
(6, 589182477809982.5)
(7, 591717038418558.4)
(8, 601558237776881.1)
(9, 612563853985468.0)
(10, 621744288938723.9)
(11, 627012012706873.9)
(12, 627757914769243.2)
(13, 624738503264746.9)
(14, 619369705901546.6)
(15, 613089202403779.5)
(6, 589182477809982.5)

Quiz Question: Which degree (1, 2, …, 15) had the lowest RSS on Validation data?

6

Now that you have chosen the degree of your polynomial using validation data, compute the RSS of this model on TEST data. Report the RSS on your quiz.



In [80]:

    
print minimize_rss( training, testing, 15 )









    



(1, 129030470094700.17)
(2, 125190117212859.94)
(3, 125939457007407.2)
(4, 127147719016435.19)
(5, 127037730210194.83)
(6, 125529337848240.75)
(7, 124608773610067.72)
(8, 124458964534457.19)
(9, 124525299776489.55)
(10, 124602928485231.69)
(11, 124641733743997.84)
(12, 124646108317075.89)
(13, 124632209805122.16)
(14, 124612578389292.56)
(15, 124593903712840.27)
(8, 124458964534457.19)

Quiz Question: what is the RSS on TEST data for the model with the degree selected from Validation data? (Make sure you got the correct degree from the previous question)

127037730210194.83



In [ ]:

id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors
7129300520	2014-10-13 00:00:00+00:00	221900.0	3.0	1.0	1180.0	5650	1
6414100192	2014-12-09 00:00:00+00:00	538000.0	3.0	2.25	2570.0	7242	2
5631500400	2015-02-25 00:00:00+00:00	180000.0	2.0	1.0	770.0	10000	1
2487200875	2014-12-09 00:00:00+00:00	604000.0	4.0	3.0	1960.0	5000	1
1954400510	2015-02-18 00:00:00+00:00	510000.0	3.0	2.0	1680.0	8080	1
7237550310	2014-05-12 00:00:00+00:00	1225000.0	4.0	4.5	5420.0	101930	1
1321400060	2014-06-27 00:00:00+00:00	257500.0	3.0	2.25	1715.0	6819	2
2008000270	2015-01-15 00:00:00+00:00	291850.0	3.0	1.5	1060.0	9711	1
2414600126	2015-04-15 00:00:00+00:00	229500.0	3.0	1.0	1780.0	7470	1
3793500160	2015-03-12 00:00:00+00:00	323000.0	3.0	2.5	1890.0	6560	2

condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat
3	7	1180	0	1955	0	98178	47.51123398
3	7	2170	400	1951	1991	98125	47.72102274
3	6	770	0	1933	0	98028	47.73792661
5	7	1050	910	1965	0	98136	47.52082
3	8	1680	0	1987	0	98074	47.61681228
3	11	3890	1530	2001	0	98053	47.65611835
3	7	1715	0	1995	0	98003	47.30972002
3	7	1060	0	1963	0	98198	47.40949984
3	7	1050	730	1960	0	98146	47.51229381
3	7	1890	0	2003	0	98038	47.36840673

long	sqft_living15	sqft_lot15
-122.25677536	1340.0	5650.0
-122.3188624	1690.0	7639.0
-122.23319601	2720.0	8062.0
-122.39318505	1360.0	5000.0
-122.04490059	1800.0	7503.0
-122.00528655	4760.0	101930.0
-122.32704857	2238.0	6819.0
-122.31457273	1650.0	9711.0
-122.33659507	1780.0	8113.0
-122.0308176	2390.0	7570.0

name	index	value
(intercept)	None	199222.496445
power_1	None	67.9940640677
power_2	None	0.0385812312789

name	index	value
(intercept)	None	336788.117952
power_1	None	-90.1476236119
power_2	None	0.087036715081
power_3	None	-3.8398521196e-06

name	index	value
(intercept)	None	73619.7521118
power_1	None	410.287462537
power_2	None	-0.23045071443
power_3	None	7.58840542459e-05
power_4	None	-5.65701802671e-09
power_5	None	-4.57028130588e-13
power_6	None	2.66360206482e-17
power_7	None	3.38584769245e-21
power_8	None	1.14723104028e-25
power_9	None	-4.65293584963e-30