Linear regression exercise

In this exercise we will be working with the Boston housing dataset, which gives information on housing prices for the suburbs of Boston. See the link for a description of the available parameters.

Below we do the first step for you, load the dataset and establish a training and test sets. You will need to implement the following:

1) Train a linear regression model for the training data. You can either use the python code or the sklearn implementation.

2) Test that the regression model works by computing the regression error on the test set

3) (Optional) As the number of records in this dataset is not very extensive, it might be a good idea to use a cross-validation/jackknifing method instead of defining a fixed train/test set. You can give it a try with the function sklearn.cross_validation

Consider doing some plotting and printing out of the data along the way to get a feeling of what you are looking at in here.



In [25]:

    
%matplotlib inline

import pandas as pd #used for reading/writing data 
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library

from sklearn.datasets import load_boston
boston = load_boston()



In [26]:

    
#import data into a pandas dataFrame to be similar from our example
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target)

#split the data into train and test sets, with a 70-30 split
from sklearn.model_selection import train_test_split #creation of train.test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X.describe()



In [ ]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.593761	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063
std	8.596783	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000
75%	3.647423	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000