UTSC Machine Learning WorkShop

Cross-validation for feature selection with Linear Regression

From the video series: Introduction to machine learning with scikit-learn

Agenda

Put together what we learned, using corss-validation to select features for linear regration models.
Practice on a different problem.

Cross-validation example: feature selection

Model Evaluation Metrics for Regression

For classification problems, we have only used classification accuracy as our evaluation metric. What metrics can we used for regression problems?

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

Mean Squared Error (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

Goal: Select whether the Newspaper feature should be included in the linear regression model on the advertising dataset



In [1]:

    
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score



In [2]:

    
# read in the advertising dataset
data = pd.read_csv('data/Advertising.csv', index_col=0)



In [3]:

    
# create a Python list of three feature names
feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the DataFrame (X)
X = data[feature_cols]

# select the Sales column as the response (y)
y = data.Sales



In [5]:

    
# 10-fold cross-validation with all three features
lm = LinearRegression()
MAEscores = cross_val_score(lm, X, y, cv=10, scoring='mean_absolute_error')
print MAEscores









    



[-1.41470822 -1.42067103 -1.18520036 -1.39731782 -0.90578551 -0.96357362
 -2.00464419 -1.17610998 -1.18157732 -1.37291164]

MSE is more popular than MAE because MSE "punishes" larger errors. But, RMSE is even more popular than MSE because RMSE is interpretable in the "y" units.



In [6]:

    
# The MSE scores can be calculated by: 
scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')
print scores









    



[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754  -1.74163618
 -8.17338214 -2.11409746 -3.04273109 -2.45281793]



In [7]:

    
# fix the sign of MSE scores
mse_scores = -scores
print mse_scores









    



[ 3.56038438  3.29767522  2.08943356  2.82474283  1.3027754   1.74163618
  8.17338214  2.11409746  3.04273109  2.45281793]



In [8]:

    
# convert from MSE to RMSE
rmse_scores = np.sqrt(mse_scores)
print rmse_scores









    



[ 1.88689808  1.81595022  1.44548731  1.68069713  1.14139187  1.31971064
  2.85891276  1.45399362  1.7443426   1.56614748]



In [9]:

    
# calculate the average RMSE
print rmse_scores.mean()









    



1.69135317081



In [10]:

    
# 10-fold cross-validation with two features (excluding Newspaper)
feature_cols = ['TV', 'Radio']
X = data[feature_cols]
print np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean()









    



1.67967484191

TASK

Select the best polynomial order for feature Grith to use in the tree problem.



In [11]:

    
import pydataset
from pydataset import data
trees=data('trees')



In [12]:

    
#set up features and aimed result
feature_cols=["Girth", "Height"]
X=trees[feature_cols]
y=trees.Volume
# find the cross validation score
scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')
print scores

# find the cross validation score for higher polynomial features









    



[-74.24982358  -1.89992369  -0.88798886  -1.86808201 -21.28739709
 -26.02129099 -14.27281116  -8.24887952 -36.23728207 -37.69229324]



In [13]:

    
trees['squared'] = trees['Girth']**2
trees.head()

Feature engineering and selection within cross-validation iterations

Normally, feature engineering and selection occurs before cross-validation
Instead, perform all feature engineering and selection within each cross-validation iteration
More reliable estimate of out-of-sample performance since it better mimics the application of the model to out-of-sample data



In [20]:

    
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()









    Out[20]:



In [ ]:

	Girth	Height	Volume	squared
1	8.3	70	10.3	68.89
2	8.6	65	10.3	73.96
3	8.8	63	10.2	77.44
4	10.5	72	16.4	110.25
5	10.7	81	18.8	114.49

UTSC Machine Learning WorkShop

Cross-validation for feature selection with Linear Regression

Agenda

Cross-validation example: feature selection

Model Evaluation Metrics for Regression

Read More