This notebook will demonstrate using linear and polynomial regression to model the relationship between feature variables and the response variable.
We will use the MPG dataset to practice using regression to predict a fuel economy(MPG) of a car given its features.
For the mathematical explanation and theory, please view lecture notes: http://george1328.github.io/lecture_notes/Regression_Regularization.pdf
In [43]:
import pandas as pd
%pylab inline
In [44]:
# Column/Feature label are not available in the dataset, so we create a list of features using auto-mpg.names
features = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name']
In [45]:
# Import the data directly into pandas from the url, specify header=None as column labels are not in dataset
import urllib
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
# file is fixed-width-format so we will use read_fwf instead of read_csv
df = pd.read_fwf(urllib.urlopen(url), header = None)
df.columns = features
In [46]:
# Alternatively, we can download the data
# We use the bang(!) within iPython Notebooks to run command line statements directly from the Notebook
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names
In [47]:
# Since this dataset has no column headings, we need to explicitely state names=features
df = pd.read_fwf("auto-mpg.data", names = features)
In [48]:
# Use head, describe, info and unique to get a sense of the data
df.describe()
Out[48]:
In [49]:
df.head()
Out[49]:
In [50]:
df.info()
In [51]:
# Name and Horsepower are the only non-numeric fields. Name of a car is unlikely to have an influence on the MPG.
df.horsepower.unique()
Out[51]:
In [52]:
(df.horsepower == '?').sum()
Out[52]:
In [53]:
# Lets convert horsepower to a numeric field so we can use it in our analysis
df['horsepower'] = df['horsepower'].convert_objects(convert_numeric = True)
In [54]:
# We will drop the 6 records that are missing horsepower. We could extimate these missing values but for the sake of accuracy
# we will not. Also, its only 6 missing values
df = df.dropna()
In [55]:
df.info()
In [56]:
import seaborn as sns
In [57]:
sns.boxplot(df.mpg, df.cylinders)
# This is interesting. 4 cylinder vehicles have better mileage on average than 3 cylinder vehicles
Out[57]:
In [58]:
three_cyl = df[df.cylinders == 3]
print three_cyl['car name']
## Aha! Tiny Mazda roadsters...
In [59]:
sns.violinplot(df.mpg, df['model year'])
# Fancy seaborn graphing
Out[59]:
In [60]:
sns.barplot(df.mpg, df.horsepower)
Out[60]:
In [61]:
sns.barplot(df.mpg, df.weight)
Out[61]:
In [62]:
sns.boxplot(df.mpg, df.origin)
# Although the values of origin are not given, we can guess that 1=USA, 2=Europe and 3=Japan... Maybe...
Out[62]:
In [63]:
sns.boxplot(df.mpg, df.acceleration)
# Little cars have pretty good accelaration AND good mileage so not a great association
Out[63]:
In [64]:
sns.kdeplot(df.mpg, df.cylinders)
# Showing different plot options in seaborn :-)
Out[64]:
An alternate method to visualizing the data is to print the correlation matrix
In [65]:
df.corr()
Out[65]:
In [66]:
# Create numpy variables X and y with the predictor and class variables
X = df[['weight', 'model year', 'horsepower', 'origin', 'displacement']].values
y = df['mpg'].values
In [67]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
In [68]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In [69]:
model = LinearRegression()
model.fit(X_train, y_train)
Out[69]:
In [70]:
predictions = model.predict(X_test)
In [71]:
for i, prediction in enumerate(predictions):
print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])
In [72]:
model.score(X_test,y_test)
Out[72]:
In [73]:
sns.regplot(predictions, y_test)
Out[73]:
In [74]:
from sklearn.preprocessing import PolynomialFeatures
In [75]:
quad_model =PolynomialFeatures(degree=2)
In [76]:
quad_X_train = quad_model.fit_transform(X_train)
quad_X_test = quad_model.fit_transform(X_test)
In [77]:
model.fit(quad_X_train, y_train)
Out[77]:
In [78]:
predictions = model.predict(quad_X_test)
In [79]:
for i, prediction in enumerate(predictions):
print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])
In [80]:
model.score(quad_X_test,y_test)
Out[80]:
In [81]:
sns.regplot(predictions, y_test)
Out[81]:
In [82]:
from sklearn.cross_validation import cross_val_score
In [83]:
scores = cross_val_score(model, quad_X_train, y_train, cv =10)
In [84]:
# R squared is the proportion of variance in the response variable that is explained by the model.
# The R squared score tells us the accuracy of our model with 1.0 being a perfect prediction.
scores
Out[84]: