Regression Week 3: Assessing Fit (polynomial regression)

In this notebook you will compare different regression models in order to assess which model fits best. We will be using polynomial regression as a means to examine this topic. In particular you will:

  • Write a function to take an SArray and a degree and return an SFrame where each column is the SArray to a polynomial value up to the total degree e.g. degree = 3 then column 1 is the SArray column 2 is the SArray squared and column 3 is the SArray cubed
  • Use matplotlib to visualize polynomial regressions
  • Use matplotlib to visualize the same polynomial degree on different subsets of the data
  • Use a validation set to select a polynomial degree
  • Assess the final fit using test data

We will continue to use the House data from previous notebooks.

Fire up graphlab create


In [1]:
import graphlab

Next we're going to write a polynomial function that takes an SArray and a maximal degree and returns an SFrame with columns containing the SArray to all the powers up to the maximal degree.

The easiest way to apply a power to an SArray is to use the .apply() and lambda x: functions. For example to take the example array and compute the third power we can do as follows: (note running this cell the first time may take longer than expected since it loads graphlab)


In [2]:
tmp = graphlab.SArray([1., 2., 3.])
tmp_cubed = tmp.apply(lambda x: x**3)
print tmp
print tmp_cubed


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1475547141.log
This non-commercial license of GraphLab Create for academic use is assigned to sudhanshu.shekhar.iitd@gmail.com and will expire on September 18, 2017.
[1.0, 2.0, 3.0]
[1.0, 8.0, 27.0]

We can create an empty SFrame using graphlab.SFrame() and then add any columns to it with ex_sframe['column_name'] = value. For example we create an empty SFrame and make the column 'power_1' to be the first power of tmp (i.e. tmp itself).


In [3]:
ex_sframe = graphlab.SFrame()
ex_sframe['power_1'] = tmp
print ex_sframe


+---------+
| power_1 |
+---------+
|   1.0   |
|   2.0   |
|   3.0   |
+---------+
[3 rows x 1 columns]

Polynomial_sframe function

Using the hints above complete the following function to create an SFrame consisting of the powers of an SArray up to a specific degree:


In [4]:
def polynomial_sframe(feature, degree):
    # assume that degree >= 1
    # initialize the SFrame:
    poly_sframe = graphlab.SFrame()
    # and set poly_sframe['power_1'] equal to the passed feature
    poly_sframe['power_1'] = feature

    # first check if degree > 1
    if degree > 1:
        # then loop over the remaining degrees:
        # range usually starts at 0 and stops at the endpoint-1. We want it to start at 2 and stop at degree
        for power in range(2, degree+1): 
            # first we'll give the column a name:
            name = 'power_' + str(power)
            # then assign poly_sframe[name] to the appropriate power of feature
            poly_sframe[name] = feature.apply(lambda x : x**power) 
            
    return poly_sframe

To test your function consider the smaller tmp variable and what you would expect the outcome of the following call:


In [5]:
print polynomial_sframe(tmp, 3)


+---------+---------+---------+
| power_1 | power_2 | power_3 |
+---------+---------+---------+
|   1.0   |   1.0   |   1.0   |
|   2.0   |   4.0   |   8.0   |
|   3.0   |   9.0   |   27.0  |
+---------+---------+---------+
[3 rows x 3 columns]

Visualizing polynomial regression

Let's use matplotlib to visualize what a polynomial regression looks like on some real data.


In [7]:
sales = graphlab.SFrame('kc_house_data.gl/')

In [8]:
sales.head()


Out[8]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900.0 3.0 1.0 1180.0 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000.0 3.0 2.25 2570.0 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000.0 2.0 1.0 770.0 10000 1 0
2487200875 2014-12-09 00:00:00+00:00 604000.0 4.0 3.0 1960.0 5000 1 0
1954400510 2015-02-18 00:00:00+00:00 510000.0 3.0 2.0 1680.0 8080 1 0
7237550310 2014-05-12 00:00:00+00:00 1225000.0 4.0 4.5 5420.0 101930 1 0
1321400060 2014-06-27 00:00:00+00:00 257500.0 3.0 2.25 1715.0 6819 2 0
2008000270 2015-01-15 00:00:00+00:00 291850.0 3.0 1.5 1060.0 9711 1 0
2414600126 2015-04-15 00:00:00+00:00 229500.0 3.0 1.0 1780.0 7470 1 0
3793500160 2015-03-12 00:00:00+00:00 323000.0 3.0 2.5 1890.0 6560 2 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
0 5 7 1050 910 1965 0 98136 47.52082
0 3 8 1680 0 1987 0 98074 47.61681228
0 3 11 3890 1530 2001 0 98053 47.65611835
0 3 7 1715 0 1995 0 98003 47.30972002
0 3 7 1060 0 1963 0 98198 47.40949984
0 3 7 1050 730 1960 0 98146 47.51229381
0 3 7 1890 0 2003 0 98038 47.36840673
long sqft_living15 sqft_lot15
-122.25677536 1340.0 5650.0
-122.3188624 1690.0 7639.0
-122.23319601 2720.0 8062.0
-122.39318505 1360.0 5000.0
-122.04490059 1800.0 7503.0
-122.00528655 4760.0 101930.0
-122.32704857 2238.0 6819.0
-122.31457273 1650.0 9711.0
-122.33659507 1780.0 8113.0
-122.0308176 2390.0 7570.0
[10 rows x 21 columns]

As in Week 3, we will use the sqft_living variable. For plotting purposes (connecting the dots), you'll need to sort by the values of sqft_living. For houses with identical square footage, we break the tie by their prices.


In [9]:
sales = sales.sort(['sqft_living', 'price'])

Let's start with a degree 1 polynomial using 'sqft_living' (i.e. a line) to predict 'price' and plot what it looks like.


In [10]:
poly1_data = polynomial_sframe(sales['sqft_living'], 1)
poly1_data['price'] = sales['price'] # add price to the data since it's the target

In [11]:
poly1_data


Out[11]:
power_1 price
290.0 142000.0
370.0 276000.0
380.0 245000.0
384.0 265000.0
390.0 228000.0
390.0 245000.0
410.0 325000.0
420.0 229050.0
420.0 280000.0
430.0 80000.0
[21613 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

NOTE: for all the models in this notebook use validation_set = None to ensure that all results are consistent across users.


In [12]:
model1 = graphlab.linear_regression.create(poly1_data, target = 'price', features = ['power_1'], validation_set = None)


Linear regression:
--------------------------------------------------------
Number of examples          : 21613
Number of features          : 1
Number of unpacked features : 1
Number of coefficients    : 2
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 1.028805     | 4362074.696077     | 261440.790724 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.


In [13]:
#let's take a look at the weights before we plot
model1.get("coefficients")


Out[13]:
name index value stderr
(intercept) None -43579.0852515 4402.68969743
power_1 None 280.622770886 1.93639855513
[2 rows x 4 columns]

In [14]:
import matplotlib.pyplot as plt
%matplotlib inline

In [15]:
plt.plot(poly1_data['power_1'],poly1_data['price'],'.',
        poly1_data['power_1'], model1.predict(poly1_data),'-')


Out[15]:
[<matplotlib.lines.Line2D at 0x12420e910>,
 <matplotlib.lines.Line2D at 0x1201895d0>]

Let's unpack that plt.plot() command. The first pair of SArrays we passed are the 1st power of sqft and the actual price we then ask it to print these as dots '.'. The next pair we pass is the 1st power of sqft and the predicted values from the linear model. We ask these to be plotted as a line '-'.

We can see, not surprisingly, that the predicted values all fall on a line, specifically the one with slope 280 and intercept -43579. What if we wanted to plot a second degree polynomial?


In [16]:
poly2_data = polynomial_sframe(sales['sqft_living'], 2)
my_features = poly2_data.column_names() # get the name of the features
poly2_data['price'] = sales['price'] # add price to the data since it's the target
model2 = graphlab.linear_regression.create(poly2_data, target = 'price', features = my_features, validation_set = None)


Linear regression:
--------------------------------------------------------
Number of examples          : 21613
Number of features          : 2
Number of unpacked features : 2
Number of coefficients    : 3
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.044153     | 5913020.984255     | 250948.368758 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.


In [17]:
model2.get("coefficients")


Out[17]:
name index value stderr
(intercept) None 199222.496445 7058.00483552
power_1 None 67.9940640677 5.28787201316
power_2 None 0.0385812312789 0.000898246547032
[3 rows x 4 columns]

In [18]:
plt.plot(poly2_data['power_1'],poly2_data['price'],'.',
        poly2_data['power_1'], model2.predict(poly2_data),'-')


Out[18]:
[<matplotlib.lines.Line2D at 0x11de2f690>,
 <matplotlib.lines.Line2D at 0x11de2f750>]

The resulting model looks like half a parabola. Try on your own to see what the cubic looks like:


In [19]:
poly3_data = polynomial_sframe(sales['sqft_living'], 3)
my_features = poly3_data.column_names() # get the name of the features
poly3_data['price'] = sales['price'] # add price to the data since it's the target
model3 = graphlab.linear_regression.create(poly3_data, target = 'price', features = my_features, validation_set = None)


Linear regression:
--------------------------------------------------------
Number of examples          : 21613
Number of features          : 3
Number of unpacked features : 3
Number of coefficients    : 4
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.055091     | 3261066.736007     | 249261.286346 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.


In [20]:
plt.plot(poly3_data['power_1'],poly3_data['price'],'.',
        poly3_data['power_1'], model3.predict(poly3_data),'-')


Out[20]:
[<matplotlib.lines.Line2D at 0x102d93250>,
 <matplotlib.lines.Line2D at 0x11de529d0>]

Now try a 15th degree polynomial:


In [21]:
poly15_data = polynomial_sframe(sales['sqft_living'], 15)
my_features = poly15_data.column_names() # get the name of the features
poly15_data['price'] = sales['price'] # add price to the data since it's the target
model15 = graphlab.linear_regression.create(poly15_data, target = 'price', features = my_features, validation_set = None)


Linear regression:
--------------------------------------------------------
Number of examples          : 21613
Number of features          : 15
Number of unpacked features : 15
Number of coefficients    : 16
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.025119     | 2662308.584344     | 245690.511190 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.


In [22]:
plt.plot(poly15_data['power_1'],poly15_data['price'],'.',
        poly15_data['power_1'], model15.predict(poly15_data),'-')


Out[22]:
[<matplotlib.lines.Line2D at 0x11de99210>,
 <matplotlib.lines.Line2D at 0x1205ccf90>]

What do you think of the 15th degree polynomial? Do you think this is appropriate? If we were to change the data do you think you'd get pretty much the same curve? Let's take a look.

Changing the data and re-learning

We're going to split the sales data into four subsets of roughly equal size. Then you will estimate a 15th degree polynomial model on all four subsets of the data. Print the coefficients (you should use .print_rows(num_rows = 16) to view all of them) and plot the resulting fit (as we did above). The quiz will ask you some questions about these results.

To split the sales data into four subsets, we perform the following steps:

  • First split sales into 2 subsets with .random_split(0.5, seed=0).
  • Next split the resulting subsets into 2 more subsets each. Use .random_split(0.5, seed=0).

We set seed=0 in these steps so that different users get consistent results. You should end up with 4 subsets (set_1, set_2, set_3, set_4) of approximately equal size.


In [27]:
tmp_1, tmp_2 = sales.random_split(0.5, seed=0)
set_1, set_2 = tmp_1.random_split(0.5, seed=0)
set_3, set_4 = tmp_2.random_split(0.5, seed=0)
print "size of set_1 = " + str(len(set_1)) + " ; set_2 = " + str(len(set_2)) + " ; set_3 = " + str(len(set_3)) + " ; set_4 = " + str(len(set_4))
print "size of sales/4 = " + str(len(sales) / 4)
if (len(set_1) + len(set_2) + len(set_3) + len(set_4)) == len(sales):
    print "assertion passed"
else:
    print "check the code"


size of set_1 = 5404 ; set_2 = 5398 ; set_3 = 5409 ; set_4 = 5402
size of sales/4 = 5403
assertion passed

Fit a 15th degree polynomial on set_1, set_2, set_3, and set_4 using sqft_living to predict prices. Print the coefficients and make a plot of the resulting model.


In [28]:
set_1_15_data = polynomial_sframe(set_1['sqft_living'], 15)
my_features = set_1_15_data.column_names() # get the name of the features
set_1_15_data['price'] = set_1['price'] # add price to the data since it's the target
model_set_1_15 = graphlab.linear_regression.create(
    set_1_15_data, 
    target = 'price', 
    features = my_features, 
    validation_set = None
)
plt.plot(set_1_15_data['power_1'],set_1_15_data['price'],'.',
        set_1_15_data['power_1'], model_set_1_15.predict(set_1_15_data),'-')
print "set_1"
model_set_1_15.get("coefficients").print_rows(16)


Linear regression:
--------------------------------------------------------
Number of examples          : 5404
Number of features          : 15
Number of unpacked features : 15
Number of coefficients    : 16
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.032798     | 2195218.932304     | 248858.822200 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

set_1
+-------------+-------+--------------------+-------------------+
|     name    | index |       value        |       stderr      |
+-------------+-------+--------------------+-------------------+
| (intercept) |  None |    223312.75025    |   720102.105813   |
|   power_1   |  None |   118.086127586    |   2991.25004237   |
|   power_2   |  None |  -0.0473482011336  |   5.14643044195   |
|   power_3   |  None | 3.25310342468e-05  |  0.00486477316272 |
|   power_4   |  None | -3.3237215256e-09  | 2.83839733901e-06 |
|   power_5   |  None | -9.75830457822e-14 | 1.09654745207e-09 |
|   power_6   |  None | 1.15440303425e-17  | 2.97349100848e-13 |
|   power_7   |  None | 1.05145869431e-21  | 5.95020515282e-17 |
|   power_8   |  None | 3.46049616301e-26  | 8.94450727081e-21 |
|   power_9   |  None | -1.09654454076e-30 |  1.0057647634e-24 |
|   power_10  |  None | -2.42031812181e-34 | 8.73926154652e-29 |
|   power_11  |  None | -1.99601206791e-38 | 4.36576035029e-33 |
|   power_12  |  None | -1.0770990379e-42  |        nan        |
|   power_13  |  None | -2.7286281761e-47  |        nan        |
|   power_14  |  None | 2.44782693234e-51  |  7.7054896902e-46 |
|   power_15  |  None |  5.019752326e-55   | 2.76020885311e-50 |
+-------------+-------+--------------------+-------------------+
[16 rows x 4 columns]


In [29]:
set_2_15_data = polynomial_sframe(set_2['sqft_living'], 15)
my_features = set_2_15_data.column_names() # get the name of the features
set_2_15_data['price'] = set_2['price'] # add price to the data since it's the target
model_set_2_15 = graphlab.linear_regression.create(
    set_2_15_data, 
    target = 'price', 
    features = my_features, 
    validation_set = None
)
plt.plot(set_2_15_data['power_1'],set_2_15_data['price'],'.',
        set_2_15_data['power_1'], model_set_2_15.predict(set_2_15_data),'-')
print "set_2"
model_set_2_15.get("coefficients").print_rows(16)


Linear regression:
--------------------------------------------------------
Number of examples          : 5398
Number of features          : 15
Number of unpacked features : 15
Number of coefficients    : 16
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.052297     | 2069212.978547     | 234840.067186 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

set_2
+-------------+-------+--------------------+-------------------+
|     name    | index |       value        |       stderr      |
+-------------+-------+--------------------+-------------------+
| (intercept) |  None |   89836.5077348    |   1425584.05986   |
|   power_1   |  None |    319.80694676    |   7890.23598628   |
|   power_2   |  None |  -0.103315397038   |   18.4602850363   |
|   power_3   |  None | 1.06682476058e-05  |  0.0241806306245  |
|   power_4   |  None | 5.75577097729e-09  | 1.98048020398e-05 |
|   power_5   |  None | -2.5466346474e-13  | 1.06980818362e-08 |
|   power_6   |  None | -1.09641345066e-16 | 3.89653322973e-12 |
|   power_7   |  None | -6.36458441707e-21 | 9.52793175493e-16 |
|   power_8   |  None | 5.52560416968e-25  | 1.49691897587e-19 |
|   power_9   |  None |  1.3508203898e-28  | 1.24414181467e-23 |
|   power_10  |  None | 1.18408188241e-32  |        nan        |
|   power_11  |  None | 1.98348000462e-37  | 1.49946194435e-31 |
|   power_12  |  None | -9.9253359052e-41  | 3.69876070678e-35 |
|   power_13  |  None | -1.60834847033e-44 | 4.03887349367e-39 |
|   power_14  |  None | -9.12006024135e-49 | 2.27937785326e-43 |
|   power_15  |  None | 1.68636658315e-52  | 5.29130378172e-48 |
+-------------+-------+--------------------+-------------------+
[16 rows x 4 columns]


In [30]:
set_3_15_data = polynomial_sframe(set_3['sqft_living'], 15)
my_features = set_3_15_data.column_names() # get the name of the features
set_3_15_data['price'] = set_3['price'] # add price to the data since it's the target
model_set_3_15 = graphlab.linear_regression.create(
    set_3_15_data, 
    target = 'price', 
    features = my_features, 
    validation_set = None
)
plt.plot(set_3_15_data['power_1'],set_3_15_data['price'],'.',
        set_3_15_data['power_1'], model_set_3_15.predict(set_3_15_data),'-')
print "set_3"
model_set_3_15.get("coefficients").print_rows(16)


Linear regression:
--------------------------------------------------------
Number of examples          : 5409
Number of features          : 15
Number of unpacked features : 15
Number of coefficients    : 16
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.033905     | 2269769.506523     | 251460.072754 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

set_3
+-------------+-------+--------------------+-------------------+
|     name    | index |       value        |       stderr      |
+-------------+-------+--------------------+-------------------+
| (intercept) |  None |    87317.97956     |   1265937.15642   |
|   power_1   |  None |   356.304911031    |   6169.20706961   |
|   power_2   |  None |  -0.164817442795   |   12.6758534148   |
|   power_3   |  None | 4.40424992635e-05  |   0.014531934999  |
|   power_4   |  None | 6.48234877396e-10  | 1.03509477362e-05 |
|   power_5   |  None | -6.75253226641e-13 | 4.80452542019e-09 |
|   power_6   |  None | -3.36842592784e-17 |  1.4688465776e-12 |
|   power_7   |  None | 3.60999704377e-21  | 2.86648472244e-16 |
|   power_8   |  None | 6.46999725636e-25  | 3.17165326664e-20 |
|   power_9   |  None | 4.23639388651e-29  |  2.0816121888e-24 |
|   power_10  |  None | -3.62149423631e-34 | 4.62448511879e-28 |
|   power_11  |  None | -4.27119527371e-37 | 5.18893736143e-32 |
|   power_12  |  None | -5.61445971691e-41 | 3.72608286766e-36 |
|   power_13  |  None | -3.87452772941e-45 | 3.71645954028e-40 |
|   power_14  |  None | 4.69430357729e-50  | 2.15979936194e-44 |
|   power_15  |  None | 6.39045886165e-53  | 4.75282916159e-49 |
+-------------+-------+--------------------+-------------------+
[16 rows x 4 columns]


In [31]:
set_4_15_data = polynomial_sframe(set_4['sqft_living'], 15)
my_features = set_4_15_data.column_names() # get the name of the features
set_4_15_data['price'] = set_4['price'] # add price to the data since it's the target
model_set_4_15 = graphlab.linear_regression.create(
    set_4_15_data, 
    target = 'price', 
    features = my_features, 
    validation_set = None
)
plt.plot(set_4_15_data['power_1'],set_4_15_data['price'],'.',
        set_4_15_data['power_1'], model_set_4_15.predict(set_4_15_data),'-')
print "set_4"
model_set_4_15.get("coefficients").print_rows(16)


Linear regression:
--------------------------------------------------------
Number of examples          : 5402
Number of features          : 15
Number of unpacked features : 15
Number of coefficients    : 16
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.038725     | 2314893.173824     | 244563.136754 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

set_4
+-------------+-------+--------------------+-------------------+
|     name    | index |       value        |       stderr      |
+-------------+-------+--------------------+-------------------+
| (intercept) |  None |   259020.879447    |   5299947.80822   |
|   power_1   |  None |   -31.7277161932   |   37868.2949634   |
|   power_2   |  None |   0.109702769609   |   109.794778973   |
|   power_3   |  None | -1.58383847314e-05 |   0.173877488964  |
|   power_4   |  None | -4.47660623787e-09 |  0.00016956349828 |
|   power_5   |  None | 1.13976573478e-12  | 1.08145862046e-07 |
|   power_6   |  None | 1.97669120543e-16  | 4.64434081715e-11 |
|   power_7   |  None | -6.15783678607e-21 |  1.350569008e-14  |
|   power_8   |  None | -4.88012304096e-24 | 2.60220331641e-18 |
|   power_9   |  None | -6.6218678116e-28  | 3.08677179769e-22 |
|   power_10  |  None | -2.70631583575e-32 | 1.78880431884e-26 |
|   power_11  |  None | 6.72370411717e-36  | 1.03141539371e-30 |
|   power_12  |  None | 1.74115646286e-39  | 1.50370925495e-34 |
|   power_13  |  None |  2.0918837573e-43  | 1.23862262482e-38 |
|   power_14  |  None | 4.78015565447e-48  | 1.00446289765e-42 |
|   power_15  |  None | -4.74535333059e-51 | 3.24762830799e-47 |
+-------------+-------+--------------------+-------------------+
[16 rows x 4 columns]

Some questions you will be asked on your quiz:

Quiz Question: Is the sign (positive or negative) for power_15 the same in all four models?

Quiz Question: (True/False) the plotted fitted lines look the same in all four plots

Selecting a Polynomial Degree

Whenever we have a "magic" parameter like the degree of the polynomial there is one well-known way to select these parameters: validation set. (We will explore another approach in week 4).

We split the sales dataset 3-way into training set, test set, and validation set as follows:

  • Split our sales data into 2 sets: training_and_validation and testing. Use random_split(0.9, seed=1).
  • Further split our training data into two sets: training and validation. Use random_split(0.5, seed=1).

Again, we set seed=1 to obtain consistent results for different users.


In [32]:
training_and_validation, testing = sales.random_split(0.9, seed=1)
training, validation = training_and_validation.random_split(0.5, seed=1)

Next you should write a loop that does the following:

  • For degree in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] (to get this in python type range(1, 15+1))
    • Build an SFrame of polynomial data of train_data['sqft_living'] at the current degree
    • hint: my_features = poly_data.column_names() gives you a list e.g. ['power_1', 'power_2', 'power_3'] which you might find useful for graphlab.linear_regression.create( features = my_features)
    • Add train_data['price'] to the polynomial SFrame
    • Learn a polynomial regression model to sqft vs price with that degree on TRAIN data
    • Compute the RSS on VALIDATION data (here you will want to use .predict()) for that degree and you will need to make a polynmial SFrame using validation data.
  • Report which degree had the lowest RSS on validation data (remember python indexes from 0)

(Note you can turn off the print out of linear_regression.create() with verbose = False)


In [35]:
trained_models_history = []
validation_rss_history = []
for i in xrange(1, 16):
    
    #obtain the model data
    this_model_data = polynomial_sframe(training['sqft_living'], i)
    my_features = this_model_data.column_names() # get the name of the features
    this_model_data['price'] = training['price'] # add price to the data since it's the target
    
    # learn the model for this degree on train data
    this_model = graphlab.linear_regression.create(
        this_model_data, 
        target = 'price', 
        features = my_features, 
        validation_set = None,
        verbose=False
    )
    trained_models_history.append(this_model)
    
    # find rss for the validation data
    this_model_validation_data = polynomial_sframe(validation['sqft_living'], i)
    this_model_prediction = this_model.predict(this_model_validation_data)
    this_model_error = this_model_prediction - validation['price']
    this_model_error_squared = this_model_error * this_model_error
    this_model_rss = this_model_error_squared.sum()
    print "Model " + str(i) + " validation rss = " + str(this_model_rss)
    validation_rss_history.append(this_model_rss)
    
validation_rss_history


Model 1 validation rss = 6.76709775198e+14
Model 2 validation rss = 6.07090530698e+14
Model 3 validation rss = 6.16714574533e+14
Model 4 validation rss = 6.09129230654e+14
Model 5 validation rss = 5.99177138584e+14
Model 6 validation rss = 5.8918247781e+14
Model 7 validation rss = 5.91717038418e+14
Model 8 validation rss = 6.01558237779e+14
Model 9 validation rss = 6.12563853988e+14
Model 10 validation rss = 6.21744288938e+14
Model 11 validation rss = 6.27012012708e+14
Model 12 validation rss = 6.27757914769e+14
Model 13 validation rss = 6.24738503271e+14
Model 14 validation rss = 6.19369705907e+14
Model 15 validation rss = 6.13089202416e+14
Out[35]:
[676709775198047.8,
 607090530698013.5,
 616714574532759.8,
 609129230654382.8,
 599177138583690.1,
 589182477809819.4,
 591717038418121.8,
 601558237778861.0,
 612563853988421.8,
 621744288938063.0,
 627012012707766.9,
 627757914768532.8,
 624738503271447.0,
 619369705907230.6,
 613089202416115.2]

Quiz Question: Which degree (1, 2, …, 15) had the lowest RSS on Validation data?


In [37]:
best_model = 6
print best_model


6

Now that you have chosen the degree of your polynomial using validation data, compute the RSS of this model on TEST data. Report the RSS on your quiz.


In [38]:
# find rss for the testing data
best_model_testing_data = polynomial_sframe(testing['sqft_living'], best_model)
best_model_test_prediction = trained_models_history[best_model - 1].predict(best_model_testing_data)
best_model_test_error = best_model_test_prediction - testing['price']
best_model_test_error_squared = best_model_test_error * best_model_test_error
best_model_test_rss = best_model_test_error_squared.sum()

Quiz Question: what is the RSS on TEST data for the model with the degree selected from Validation data?


In [39]:
print "Testing rss on best model = " + str(best_model_test_rss)


Testing rss on best model = 1.25529337848e+14

In [ ]: