UTSC Machine Learning WorkShop

Cross-validation for feature selection with Linear Regression

From the video series: Introduction to machine learning with scikit-learn

Agenda

  • Put together what we learned, using corss-validation to select features for linear regration models.
  • Practice on a different problem.

Cross-validation example: feature selection

Model Evaluation Metrics for Regression

For classification problems, we have only used classification accuracy as our evaluation metric. What metrics can we used for regression problems?

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

Mean Squared Error (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Read More

http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

Goal: Select whether the Newspaper feature should be included in the linear regression model on the advertising dataset


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score

In [2]:
# read in the advertising dataset
data = pd.read_csv('data/Advertising.csv', index_col=0)

In [3]:
# create a Python list of three feature names
feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the DataFrame (X)
X = data[feature_cols]

# select the Sales column as the response (y)
y = data.Sales

In [5]:
# 10-fold cross-validation with all three features
lm = LinearRegression()
MAEscores = cross_val_score(lm, X, y, cv=10, scoring='mean_absolute_error')
print MAEscores


[-1.41470822 -1.42067103 -1.18520036 -1.39731782 -0.90578551 -0.96357362
 -2.00464419 -1.17610998 -1.18157732 -1.37291164]

MSE is more popular than MAE because MSE "punishes" larger errors. But, RMSE is even more popular than MSE because RMSE is interpretable in the "y" units.


In [6]:
# The MSE scores can be calculated by: 
scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')
print scores


[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754  -1.74163618
 -8.17338214 -2.11409746 -3.04273109 -2.45281793]

In [7]:
# fix the sign of MSE scores
mse_scores = -scores
print mse_scores


[ 3.56038438  3.29767522  2.08943356  2.82474283  1.3027754   1.74163618
  8.17338214  2.11409746  3.04273109  2.45281793]

In [8]:
# convert from MSE to RMSE
rmse_scores = np.sqrt(mse_scores)
print rmse_scores


[ 1.88689808  1.81595022  1.44548731  1.68069713  1.14139187  1.31971064
  2.85891276  1.45399362  1.7443426   1.56614748]

In [9]:
# calculate the average RMSE
print rmse_scores.mean()


1.69135317081

In [10]:
# 10-fold cross-validation with two features (excluding Newspaper)
feature_cols = ['TV', 'Radio']
X = data[feature_cols]
print np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean()


1.67967484191

TASK

Select the best polynomial order for feature Grith to use in the tree problem.


In [11]:
import pydataset
from pydataset import data
trees=data('trees')

In [12]:
#set up features and aimed result
feature_cols=["Girth", "Height"]
X=trees[feature_cols]
y=trees.Volume
# find the cross validation score
scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')
print scores

# find the cross validation score for higher polynomial features


[-74.24982358  -1.89992369  -0.88798886  -1.86808201 -21.28739709
 -26.02129099 -14.27281116  -8.24887952 -36.23728207 -37.69229324]

In [13]:
trees['squared'] = trees['Girth']**2
trees.head()


Out[13]:
Girth Height Volume squared
1 8.3 70 10.3 68.89
2 8.6 65 10.3 73.96
3 8.8 63 10.2 77.44
4 10.5 72 16.4 110.25
5 10.7 81 18.8 114.49

Feature engineering and selection within cross-validation iterations

  • Normally, feature engineering and selection occurs before cross-validation
  • Instead, perform all feature engineering and selection within each cross-validation iteration
  • More reliable estimate of out-of-sample performance since it better mimics the application of the model to out-of-sample data

In [20]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[20]:

In [ ]: