Written by: Neeraj Asthana (under Professor Robert Brunner)
University of Illinois at Urbana-Champaign
Summer 2016
Dataset found on UCI Machine Learning repository at: http://archive.ics.uci.edu/ml/datasets/Auto+MPG
This data set tries to predict the mpg (miles per gallon) of a car (continuous) using many different predcitors.
A description of the dataset can be found at: http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names
Predictors:
In [35]:
#Libraries and Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
In [36]:
#Names of all of the columns
names = [
'mpg'
, 'cylinders'
, 'displacement'
, 'horsepower'
, 'weight'
, 'acceleration'
, 'model_year'
, 'origin'
, 'car_name'
]
#Import dataset
data = pd.read_csv('auto-mpg.data', sep = '\s+', header = None, names = names)
data.head()
Out[36]:
In [37]:
data.shape
Out[37]:
In [38]:
#Drop nas (labelled as ? in this dataset) -> dropped 6 rows
data_clean=data.applymap(lambda x: np.nan if x == '?' else x).dropna()
data_clean.shape
Out[38]:
In [39]:
#Select Predictor columns
X = data_clean[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', "origin"]]
#Select target column
y = data_clean['mpg']
In [56]:
#Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)
In [57]:
#Train a simple linear regression model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
#Print coefficients
list(zip(names[1:8], regr.coef_))
Out[57]:
In [58]:
#Mean Squared error and R-squared on the training set
preds = regr.predict(X_train)
mse = np.mean((preds - y_train) ** 2)
rsq = regr.score(X_train, y_train)
print("Mean Squared Error: %.4f \n R-squared: %.4f" % (mse,rsq))
In [59]:
#Test model on held out test set
#Mean Squared error on the testing set
preds_ = regr.predict(X_test)
mse_ = np.mean((preds_ - y_test) ** 2)
rsq_ = regr.score(X_test, y_test)
print("Mean Squared Error: %.4f \n R-squared: %.4f" % (mse_,rsq_))
In [60]:
%pylab inline
#Predicted vs. errors plot -> demonstrates an issue with this fit (high bias)
plt.scatter(regr.predict(X_train), regr.predict(X_train)-y_train)
plt.plot([-5,40],[0,0], color = "red")
#place testing data on the plot as well
plt.scatter(regr.predict(X_test), regr.predict(X_test)-y_test, color = "yellow")
Out[60]:
Read in file
Handle missing values (ex. ?, NA, etc.)
Select columns for the regression tasks
Transform columns or variables
Split data into training and testing sets
Train model using the training data
Perform diagnostics on the model
Test model on held out testing set
Visualizations
Repeat for a new model