In this exercise we will be working with the Boston housing dataset, which gives information on housing prices for the suburbs of Boston. See the link for a description of the available parameters.
Below we do the first step for you, load the dataset and establish a training and test sets. You will need to implement the following:
1) Train a linear regression model for the training data. You can either use the python code or the sklearn implementation.
2) Test that the regression model works by computing the regression error on the test set
3) (Optional) As the number of records in this dataset is not very extensive, it might be a good idea to use a cross-validation/jackknifing method instead of defining a fixed train/test set. You can give it a try with the function sklearn.cross_validation
Consider doing some plotting and printing out of the data along the way to get a feeling of what you are looking at in here.
In [25]:
%matplotlib inline
import pandas as pd #used for reading/writing data
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library
from sklearn.datasets import load_boston
boston = load_boston()
In [26]:
#import data into a pandas dataFrame to be similar from our example
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target)
#split the data into train and test sets, with a 70-30 split
from sklearn.model_selection import train_test_split #creation of train.test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X.describe()
Out[26]:
In [ ]: