In [ ]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model as lm
from IPython.display import display, Latex, Markdown
In [ ]:
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('hw6.ok')
This assignment is purposefully left nearly open-ended. The Ames data in your possession comes from a larger data set. Your goal is to provide a linear model that accurately predicts the prices of the held-out homes, measured by root mean square error. That is, the score you will see on the Kaggle leaderboard is calculated as follows:
$$score = \sqrt{\dfrac{\sum_{\text{houses in public test set}}(\text{actual price for house} - \text{predicted price for house})^2}{\text{# of houses}}}$$Perfect prediction of house prices would have a score of 0, so you want your score to be as low as possible!
Kaggle Submission Site: https://inclass.kaggle.com/c/ds100-2017-hw6
Max number of submissions per day: 2
Max number of final submissions: 1
The Ames data set consists of 2930 records taken from the Ames Assessor’s Office. The data set has 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers) --- 82 features in total. An explanation of each variable can be found in the included README.txt
file. The information was used in computing assessed values for individual residential properties sold in Ames, Iowa from 2006 to 2010. Since the data is publicly available, we have injected noise into all the sale prices to remove the temptation to do "oracle learning."
The data are split into training and test sets with 2000 and 930 observations, respectively. The actual sale price is withheld from you in the test set. In addition, the test data are further split into public and private test sets. When you upload a test set prediction onto Kaggle for validation, the score you receive will be calculated using the public test set. The private test set will be used in the final evaluation of this homework assignment.
In [ ]:
raw_data = pd.read_csv("ames_train.csv")
In [ ]:
small_data = (
raw_data[["SalePrice", "Gr_Liv_Area", "Lot_Area", "Bedroom_AbvGr"]]
.rename(columns = {
"SalePrice": "price",
"Gr_Liv_Area": "sqft",
"Lot_Area": "lotsize",
"Bedroom_AbvGr": "bedrooms"
})
)
small_data.iloc[1:5]
Grading will be based on a number of set criteria, enumerated below:
Task | Description |
---|---|
EDA | You create exploratory plots for at least 3 (basic) features to motivate your work. The minimal 3 should cover each of the 3 variable types: categorical, discrete, continuous. |
Transformations | Your final model includes transformations of the data. |
Diagnostics | You have diagnostic checks with commentary for your model |
RMSE | Your model beats the RMSE threshold of $30,000. This should be attainable with a well-thought-out model. |
Model | Your modeling pipeline is encapsulated in a pipeline object called final_pipeline . |
Written Questions | Your submission should include answers to the written questions at the bottom of this notebook. |
This assignment requires a Kaggle submission in addition to the usual okpy one. To submit to Kaggle, you should create a csv
file with 930 rows---one for each house in the test data---and 2 columns:
PID
The house identification numberSalePrice
Your estimate for the sale price of the houseAn example kaggle submission file has been included with this assignment.
While we want you to be creative with your models, we want to make it fair to students who are seeing these techniques for the first time. As such, you are only allowed to train linear models and their regularized forms (e.g. ridge and lasso). This means no random forest, CART, neural nets, etc. However, you are free to feature engineer to your heart's content. Remember that domain knowledge is the third component of data science...
That being said, you may want to explore the sklearn API for more information on Lasso, Ridge, and ElasticNet.
Make plots to explore the data. You may create as many plots as you wish and you will choose three of them to be graded.
The 3 plots you submit should cover each of the three variable types (categorical, discrete, continuous).
Insert this comment at the top of the three cells you want to submit for EDA:
# EDA_SUBMIT
We will use this tag to grade your 3 submitted plots.
In [ ]:
# EDA_SUBMIT
In [ ]:
# EDA_SUBMIT
In [ ]:
# EDA_SUBMIT
You have already encountered one of sklearn
's Transformer
classes: the DictVectorizer
from lab 10. A transformer is an object that cleans, reduces, expands, or generates features.
A transformer's fit
method, will learn parameters. For the DictVectorizer
, the parameters are the allowed values for a categorical variable.
The transform
method takes the learned parameters and transforms any inputted new data. For DictVectorizer
, this means taking a vector of categorical values and transforming the data into a matrix where each row has at most one non-zero value (they may all be zero if this vector contains a category that was previously unseen).
fit_transform
simply learns from and transforms the input data all in one go
Since we might want to perform different transformations on different columns, we've provided you with a ColumnSelector
class. You may want to use our code as a template for your own custom transformers.
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin
class ColumnSelector(BaseEstimator, TransformerMixin):
"""
Transformer that extracts a column of a data frame
Example Usage
>> data = pd.DataFrame({'a': [1, 2, 3, 4],
'b': [5, 6, 7, 8],
'c': [9, 10, 11, 12]})
>> cs = ColumnSelector(cols=['a', 'b'])
>> data['a'] == cs.transform(data)
Parameters
----------
col : list of strings, required
The name(s) corresponding to the desired column of a DataFrame.
"""
def __init__(self, cols):
self.cols = cols
def fit(self, X, y=None):
"""
Returns itself, nothing to be fit
"""
return self
def transform(self, X, y=None):
"""
Returns the desired column as a matrix
"""
return X.as_matrix(self.cols)
We have also seen another transformation in class: the polynomial transformation. In practice, you would use sklearn
's nice PolynomialFeatures
. To give you experience implementing your own transformer class, write a bivariate (exactly 2 input features) BiPolyTrans
transformer class that, given two features, $W$ and $Z$ of a matrix $X$, calculates all powers up to a given degree
. That is for every record (row) $x_i = \begin{bmatrix} w_i & z_i \end{bmatrix}$,
$$\phi_{degree}(x_i) = \begin{bmatrix} 1 & w_i & z_i & w_iz_i & w_i^2z_i & w_iz_i^2 & \dots & w_iz_i^{degree-1} & w_i^{degree} & z_i^{degree} \end{bmatrix} $$
If you are worried about efficiency, you may want to make use of Python's itertools
. Namely, chain
and combinations_with_replacement
should be helpful.
In [ ]:
from itertools import chain, combinations_with_replacement
class BiPolyTrans(BaseEstimator, TransformerMixin):
"""
Transforms the data from a n x 2 matrix to a matrix with
polynomial features up to the specified degree.
Example Usage
data = np.array([[1, 2], [3, 4]])
d3polytrans = BiPolyTrans(2)
d3polytrans.fit_transform(data) == np.array([[1, 1, 2, 1, 2, 4], [1, 3, 4, 9, 12,16]])
Parameters
----------
degree : integer, required
largest polynomial degree to calculate with the two features
"""
def __init__(self, degree):
self.degree = ...
def fit(self, X, y=None):
"""
Calculates the number of input and output features
"""
self.n_input_features = ...
self.n_output_features = ...
return self
def transform(self, X, y=None):
"""
Transforms the data into polynomial features
Input
-----
X : an n x 2 matrix, required.
Output
------
A higher-dimensional matrix with polynomial features up to the specified degree
"""
n_records = ...
output = np.empty((..., ...), dtype=X.dtype)
...
return(output)
In [ ]:
_ = ok.grade('qtransform')
In [ ]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures
At this point, we will formalize our data cleaning, extraction, transformation, and training all into an abstraction called a Pipeline. In a nutshell, a pipeline is the recipe for going from a clean but untransformed data set to a trained model. For more information, see sklearn's docs. In the example below, we extract polynomial features from each home's square footage and then fit a linear model.
In [ ]:
ex_pipeline1 = Pipeline([
('selector', ColumnSelector(['sqft'])),
('poly_feats', PolynomialFeatures(3, include_bias=False)),
('lm', lm.LinearRegression(fit_intercept=False))
])
ex_pipeline1.fit(small_data, small_data[['price']])
print("Training RMSE:",
(np.mean((ex_pipeline1.predict(small_data)
- small_data[['price']])**2)**(.5)).item())
As we've learned, training error definitely isn't everything! In addition to our training error, we want to be able to calculate validation error using cross validation. Luckily, sklearn makes this quite easy for us. Below, we calculate validation error for our initial pipeline, using 10-fold cross-validation.
In [ ]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
def score_func(y, y_pred, **kwargs):
return np.mean((y-y_pred)**2)**0.5
scorer = make_scorer(score_func)
cv_scores = cross_val_score(ex_pipeline1, small_data, small_data[['price']], cv=10, scoring=scorer)
print("Validation RMSE:", np.mean(cv_scores))
Of course we wouldn't want to just use one predictor given how rich our dataset is. To append more features, we can use FeatureUnion
, which combines several transformers into a MEGATRANSFORMER, which outputs a concatenation of the output of its constituents. Note: FeatureUnion
does NOT check if the transformations create linearly independent features. In the example below, we combine the polynomial lift of square footage and lot size.
In [ ]:
ex_pipeline2 = Pipeline([
('union', FeatureUnion(n_jobs=1, transformer_list=[
('poly_lotsize', Pipeline([
('selector', ColumnSelector(['lotsize'])),
('poly_feats', PolynomialFeatures(3, include_bias=False))
])),
('poly_sqft', Pipeline([
('selector', ColumnSelector(['sqft'])),
('poly_feats', PolynomialFeatures(3, include_bias=False)),
]))
])),
('lm', lm.LinearRegression(fit_intercept=False))
])
ex_pipeline2.fit(small_data, small_data[['price']])
print("Training RMSE:",
(np.mean((ex_pipeline2.predict(small_data)
- small_data[['price']])**2)**(.5)).item())
cv_scores = cross_val_score(ex_pipeline2, small_data, small_data[['price']], cv=10, scoring=scorer)
print("Validation RMSE:", np.mean(cv_scores))
Your final model should be presented as a data pipeline. We should be able to train the pipeline on a new (clean) data set without any issues.
In [ ]:
final_pipeline = ...
In [ ]:
from datetime import datetime
test_data = pd.read_csv("ames_test.csv")
submission_df = pd.DataFrame(
{
"PID": test_data["PID"],
"SalePrice": final_pipeline.predict(test_data).reshape(-1,)
}
)
timestamp = datetime.isoformat(datetime.now()).split(".")[0]
submission_df.to_csv("submission_{}".format(timestamp), index=False)
Your final kaggle submission should achieve a test-set RMSE threshold of 30,000 or lower. Write your best test-set RMSE (as shown on kaggle) here:
In [ ]:
my_test_RMSE = 53240
_ = ok.grade('qkaggle')
Make some plots to investigate how well your models fit the data. Pick an intermediate (not final) model for your diagnostics submission. Provide commentary about patterns you notice and how you addressed them. Include this comment on top of the cells you would like us to grade.
# DIAGNOSTIC_SUBMIT
In [ ]:
# DIAGNOSTIC_SUBMIT
# Code for plot
# Commentary
diagnostic_commentary = r"""
Put your commentary about diagnostics here, replacing this text.
"""
display(Markdown(diagnostic_commentary))
FYI: Recall from lecture that stability is a measure of how robust your modeling procedure is to perturbations of the data. While the formal definition is a little technical, the concept is intuitive: if you create pseudoreplicates of the data, the coefficients of your model shouldn't change too much since that would mean that your model is too sensitive to small changes in the training data. Below, we use our pipeline to do a five-fold stability check. This method is really a heuristic (as easily noted by the arbitrary choice of 5 folds). To get a better assessment of your model, you could carry out a bootstrap analysis. For this particular model, it would seem that the coefficients are not changing too crazily relative to the magnitude of their impact on home prices.
In [ ]:
from sklearn.model_selection import KFold
fivefold = KFold(n_splits=5, shuffle=True)
def calc_coefs(X, y, modeler):
model = modeler
model.fit(X, y)
return(model.steps[1][1].coef_[0])
np.vstack(calc_coefs(small_data.iloc[fold,:], small_data.iloc[fold, :][['price']], ex_pipeline2)
for (fold, _) in fivefold.split(small_data))
Let $y$ be an $n \times 1$ response vector and $X$ be an $n \times p$ full rank design matrix with a column of 1s. We use the least squares procedure to fit $y$ on $X$: $$\hat y = X\hat\theta$$ where $\hat\theta = (X^TX)^{-1}X^Ty$. The residuals are given by $e = y - \hat y$.
Show that $\sum_{i=1}^n e_i = 0$
In [ ]:
model = lm.LinearRegression(fit_intercept = True)
model.fit(small_data[["sqft"]], small_data[["price"]])
print("Intercept:", model.intercept_[0])
print("Slope:", model.coef_[0,0])
She wants to know if the stochastic model $Y = X\theta + \epsilon$, where $\epsilon$ is a mean 0 vector independent of the columns of the design matrix $X$ is plausible. One assumption is that sqft
must be independent of the noise term $\epsilon$. To test for this, your friend writes the following:
In [ ]:
def test_independent(variable, error):
# Inputs
# variable: n x 1 numpy array with variable of interest
# error: n x 1 numpy array estimates of the error term epsilon given by y - y_fitted
# Outputs
# boolean, True if the variable passes test for independence
return sum(variable * error)[0]
n = small_data.shape[0]
sqft = small_data[["sqft"]].values.reshape(n, 1)
fitted = (small_data[["price"]] - model.predict(sqft)).values.reshape(n, 1)
test_independent(sqft, fitted)
She concludes that since this value is very small, sqft
and the noise are most likely independent of each other. Is this a reasonable conclusion? Why or why not?
Write your answer here, replacing this text.
Centering takes every data point and subtracts the overall mean from it. We can write the transformation function $\phi$ as:
$$\begin{align}\phi(X) &= \left[\begin{array}{c|c|c|c} X_1 - \bar{X}_1 & X_2 - \bar{X}_2 & \dots & X_d - \bar{X}_d \end{array}\right] \\ \phi(y) &= y - \bar{y} \end{align}$$where $\bar{X}_j$ is the arithmetic mean of the $j^{th}$ column of $X$ and $\bar{y}$ is the average of the responses. Show that if a bias/intercept term is included in a regression after centering, then it will always be 0. This, of course, means that adding a column of 1s to your design matrix after centering your data might be a little silly.
Hint: You will want to use what we've proved in Question 1a.
Congratulations, you're done with this homework! Run the next cell to submit the assignment to OkPy so that the staff will know to grade it. You can submit as many times as you want, and you can choose which submission you want us to grade by going to https://okpy.org/cal/data100/sp17/. After you've done that, make sure you've pushed your changes to Github as well!
In [ ]:
_ = ok.submit()