From the video series: Introduction to machine learning with scikit-learn
Pandas: popular Python library for data exploration, manipulation, and analysis
In [2]:
# conventional way to import pandas
import pandas as pd
In [3]:
# read CSV file directly from a URL and save the results
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
# display the first 5 rows
data.head()
Out[3]:
Primary object types:
In [4]:
# display the last 5 rows
data.tail()
Out[4]:
In [5]:
# check the shape of the DataFrame (rows, columns)
data.shape
Out[5]:
What are the features?
What is the response?
What else do we know?
Seaborn: Python library for statistical data visualization built on top of Matplotlib
conda install seaborn
from the command line
In [6]:
# conventional way to import seaborn
import seaborn as sns
# allow plots to appear within the notebook
%matplotlib inline
In [7]:
# visualize the relationship between the features and the response using scatterplots
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')
Out[7]:
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$
In this case:
$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$
The $\beta$ values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions!
In [8]:
# create a Python list of feature names
feature_cols = ['TV', 'Radio', 'Newspaper']
# use the list to select a subset of the original DataFrame
X = data[feature_cols]
# equivalent command to do this in one line
X = data[['TV', 'Radio', 'Newspaper']]
# print the first 5 rows
X.head()
Out[8]:
In [9]:
# check the type and shape of X
print type(X)
print X.shape
In [10]:
# select a Series from the DataFrame
y = data['Sales']
# equivalent command that works if there are no spaces in the column name
y = data.Sales
# print the first 5 values
y.head()
Out[10]:
In [11]:
# check the type and shape of y
print type(y)
print y.shape
In [12]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
In [13]:
# default split is 75% for training and 25% for testing
print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape
In [14]:
# import model
from sklearn.linear_model import LinearRegression
# instantiate
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
Out[14]:
In [15]:
# print the intercept and coefficients
print linreg.intercept_
print linreg.coef_
In [16]:
# pair the feature names with the coefficients
zip(feature_cols, linreg.coef_)
Out[16]:
How do we interpret the TV coefficient (0.0466)?
Important notes:
In [17]:
# make predictions on the testing set
y_pred = linreg.predict(X_test)
We need an evaluation metric in order to compare our predictions with the actual values!
Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.
Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:
In [18]:
# define true and predicted response values
true = [100, 50, 30, 20]
pred = [90, 50, 50, 30]
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
In [19]:
# calculate MAE by hand
print (10 + 0 + 20 + 10)/4.
# calculate MAE using scikit-learn
from sklearn import metrics
print metrics.mean_absolute_error(true, pred)
Mean Squared Error (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$
In [20]:
# calculate MSE by hand
print (10**2 + 0**2 + 20**2 + 10**2)/4.
# calculate MSE using scikit-learn
print metrics.mean_squared_error(true, pred)
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
In [21]:
# calculate RMSE by hand
import numpy as np
print np.sqrt((10**2 + 0**2 + 20**2 + 10**2)/4.)
# calculate RMSE using scikit-learn
print np.sqrt(metrics.mean_squared_error(true, pred))
Comparing these metrics:
In [22]:
print np.sqrt(metrics.mean_squared_error(y_test, y_pred))
In [23]:
# create a Python list of feature names
feature_cols = ['TV', 'Radio']
# use the list to select a subset of the original DataFrame
X = data[feature_cols]
# select a Series from the DataFrame
y = data.Sales
# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
# make predictions on the testing set
y_pred = linreg.predict(X_test)
# compute the RMSE of our predictions
print np.sqrt(metrics.mean_squared_error(y_test, y_pred))
The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.
Linear regression:
Pandas:
Seaborn:
In [1]:
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]: