From the video series: Introduction to machine learning with scikit-learn
For classification problems, we have only used classification accuracy as our evaluation metric. What metrics can we used for regression problems?
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$Mean Squared Error (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
Goal: Select whether the Newspaper feature should be included in the linear regression model on the advertising dataset
In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score
In [2]:
# read in the advertising dataset
data = pd.read_csv('data/Advertising.csv', index_col=0)
In [3]:
# create a Python list of three feature names
feature_cols = ['TV', 'Radio', 'Newspaper']
# use the list to select a subset of the DataFrame (X)
X = data[feature_cols]
# select the Sales column as the response (y)
y = data.Sales
In [5]:
# 10-fold cross-validation with all three features
lm = LinearRegression()
MAEscores = cross_val_score(lm, X, y, cv=10, scoring='mean_absolute_error')
print MAEscores
MSE is more popular than MAE because MSE "punishes" larger errors. But, RMSE is even more popular than MSE because RMSE is interpretable in the "y" units.
In [6]:
# The MSE scores can be calculated by:
scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')
print scores
In [7]:
# fix the sign of MSE scores
mse_scores = -scores
print mse_scores
In [8]:
# convert from MSE to RMSE
rmse_scores = np.sqrt(mse_scores)
print rmse_scores
In [9]:
# calculate the average RMSE
print rmse_scores.mean()
In [10]:
# 10-fold cross-validation with two features (excluding Newspaper)
feature_cols = ['TV', 'Radio']
X = data[feature_cols]
print np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean()
TASK
Select the best polynomial order for feature Grith to use in the tree problem.
In [11]:
import pydataset
from pydataset import data
trees=data('trees')
In [12]:
#set up features and aimed result
feature_cols=["Girth", "Height"]
X=trees[feature_cols]
y=trees.Volume
# find the cross validation score
scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')
print scores
# find the cross validation score for higher polynomial features
In [13]:
trees['squared'] = trees['Girth']**2
trees.head()
Out[13]:
Feature engineering and selection within cross-validation iterations
In [20]:
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[20]:
In [ ]: