Auto-MPG dataset

Authors

Written by: Neeraj Asthana (under Professor Robert Brunner)

University of Illinois at Urbana-Champaign

Summer 2016

Acknowledgements

Dataset found on UCI Machine Learning repository at: http://archive.ics.uci.edu/ml/datasets/Auto+MPG

Dataset Information

This data set tries to predict the mpg (miles per gallon) of a car (continuous) using many different predcitors.

A description of the dataset can be found at: http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names

Predictors:

cylinders (multi-valued discrete)
displacement (continuous)
horsepower (continuous)
weight (continuous)
acceleration (continuous)
model year (multi-valued discrete)
origin (multi-valued discrete)
car name (string - unique)

Imports



In [35]:

    
#Libraries and Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

Read Data



In [36]:

    
#Names of all of the columns
names = [
       'mpg'
    ,  'cylinders'
    ,  'displacement'
    ,  'horsepower'
    ,  'weight'
    ,  'acceleration'
    ,  'model_year'
    ,  'origin'
    ,  'car_name'
]

#Import dataset
data = pd.read_csv('auto-mpg.data', sep = '\s+', header = None, names = names)

data.head()









    Out[36]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      model_year
      origin
      car_name
    
  
  
    
      0
      18.0
      8
      307.0
      130.0
      3504.0
      12.0
      70
      1
      chevrolet chevelle malibu
    
    
      1
      15.0
      8
      350.0
      165.0
      3693.0
      11.5
      70
      1
      buick skylark 320
    
    
      2
      18.0
      8
      318.0
      150.0
      3436.0
      11.0
      70
      1
      plymouth satellite
    
    
      3
      16.0
      8
      304.0
      150.0
      3433.0
      12.0
      70
      1
      amc rebel sst
    
    
      4
      17.0
      8
      302.0
      140.0
      3449.0
      10.5
      70
      1
      ford torino



In [37]:

    
data.shape









    Out[37]:





(398, 9)

Clean Data (remove NaNs)



In [38]:

    
#Drop nas (labelled as ? in this dataset) -> dropped 6 rows
data_clean=data.applymap(lambda x: np.nan if x == '?' else x).dropna()

data_clean.shape









    Out[38]:





(392, 9)

Separate predictors from labels



In [39]:

    
#Select Predictor columns
X = data_clean[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', "origin"]]

#Select target column
y = data_clean['mpg']

Split into Training and Testing Sets



In [56]:

    
#Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)

Train Simple Linear Regression



In [57]:

    
#Train a simple linear regression model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

#Print coefficients
list(zip(names[1:8], regr.coef_))









    Out[57]:





[('cylinders', -0.44780992748457576),
 ('displacement', 0.020278830009325951),
 ('horsepower', -0.024307678410886283),
 ('weight', -0.0062621625124574583),
 ('acceleration', 0.07582477840381352),
 ('model_year', 0.72938384661765776),
 ('origin', 1.3155182530109999)]

Model Evaluation



In [58]:

    
#Mean Squared error and R-squared on the training set
preds = regr.predict(X_train)
mse = np.mean((preds - y_train) ** 2)
rsq = regr.score(X_train, y_train)

print("Mean Squared Error: %.4f \n R-squared: %.4f" % (mse,rsq))









    



Mean Squared Error: 10.4875 
 R-squared: 0.8233



In [59]:

    
#Test model on held out test set
#Mean Squared error on the testing set
preds_ = regr.predict(X_test)
mse_ = np.mean((preds_ - y_test) ** 2)
rsq_ = regr.score(X_test, y_test)

print("Mean Squared Error: %.4f \n R-squared: %.4f" % (mse_,rsq_))









    



Mean Squared Error: 12.4737 
 R-squared: 0.8071

Diagnostic Plot (errors vs. predicted)



In [60]:

    
%pylab inline
#Predicted vs. errors plot -> demonstrates an issue with this fit (high bias)
plt.scatter(regr.predict(X_train), regr.predict(X_train)-y_train)
plt.plot([-5,40],[0,0], color = "red")

#place testing data on the plot as well
plt.scatter(regr.predict(X_test), regr.predict(X_test)-y_test, color = "yellow")









    



Populating the interactive namespace from numpy and matplotlib






    Out[60]:





<matplotlib.collections.PathCollection at 0x7f9a803fd048>

Data Tasks

Read in file
- Different types of separators (',',' ', '\t', '\s', etc.)
- Specify whether there is a header or not
- Name different columns
Handle missing values (ex. ?, NA, etc.)
- remove these examples?
- set these values to an arbitrary value like 0 or NA
- replace missing values with the mean
Select columns for the regression tasks
- Select columns I want to use as predictors
- Select which column I am looking to target and predict
Transform columns or variables
- create new features from the features we already have (combinations, squaring, cubing, etc.)
- PCA?
- scaling?
Split data into training and testing sets
- Set a percentage or value for a training or testing set sizes
- Also create a validation set?
- Crossvalidation instead?
Train model using the training data
- include regularization? (lambda term)
- specify method (lasso, ridge, simple, SVM, etc.)
Perform diagnostics on the model
- See coefficients
- See metrics like mean squared error, residual sum of errors, r-squared, etc.
Test model on held out testing set
- See metrics like mean squared error, residual sum of errors, etc.
Visualizations
- Visualize dataset as a whole (scatter plot matrix)
- See diagnostic plots (cooks distances, deviances, predicted vs. actual, etc.)
- bias or variance issues?
Repeat for a new model

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	car_name
0	18.0	8	307.0	130.0	3504.0	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693.0	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150.0	3436.0	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150.0	3433.0	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140.0	3449.0	10.5	70	1	ford torino