Linear regression exercise

In this exercise we will be working with the Boston housing dataset, which gives information on housing prices for the suburbs of Boston. See the link for a description of the available parameters.

Below we do the first step for you, load the dataset and establish a training and test sets. You will need to implement the following:

1) Train a linear regression model for the training data. You can either use the python code or the sklearn implementation.

2) Test that the regression model works by computing the regression error on the test set

3) (Optional) As the number of records in this dataset is not very extensive, it might be a good idea to use a cross-validation/jackknifing method instead of defining a fixed train/test set. You can give it a try with the function sklearn.cross_validation

Consider doing some plotting and printing out of the data along the way to get a feeling of what you are looking at in here.


In [25]:
%matplotlib inline

import pandas as pd #used for reading/writing data 
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library

from sklearn.datasets import load_boston
boston = load_boston()

In [26]:
#import data into a pandas dataFrame to be similar from our example
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target)

#split the data into train and test sets, with a 70-30 split
from sklearn.model_selection import train_test_split #creation of train.test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X.describe()


Out[26]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000

In [ ]: