Data science pipeline: pandas, seaborn, scikit-learn

Agenda

  • How do I use the pandas library to read data into Python?
  • How do I use the seaborn library to visualize data?
  • What is linear regression, and how does it work?
  • How do I train and interpret a linear regression model in scikit-learn?
  • What are some evaluation metrics for regression problems?
  • How do I choose which features to include in my model? ## Types of supervised learning
  • Classification: Predict a categorical response
  • Regression: Predict a continuous response ## Reading data using pandas Pandas: popular Python library for data exploration, manipulation, and analysis
  • Anaconda users: pandas is already installed
  • Other users: installation instructions

In [1]:
# conventional way to import pandas
import pandas as pd

In [2]:
# read CSV file directly from a URL and save the results
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

# display the first 5 rows
data.head()


Out[2]:
TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9

Primary object types:

  • DataFrame: rows and columns (like a spreadsheet)
  • Series: a single column

In [3]:
# display the last 5 rows
data.tail()


Out[3]:
TV Radio Newspaper Sales
196 38.2 3.7 13.8 7.6
197 94.2 4.9 8.1 9.7
198 177.0 9.3 6.4 12.8
199 283.6 42.0 66.2 25.5
200 232.1 8.6 8.7 13.4

In [4]:
# check the shape of the DataFrame (rows, columns)
data.shape


Out[4]:
(200, 4)

What are the features?

  • TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
  • Radio: advertising dollars spent on Radio
  • Newspaper: advertising dollars spent on Newspaper What is the response?
  • Sales: sales of a single product in a given market (in thousands of items) What else do we know?
  • Because the response variable is continuous, this is a regression problem.
  • There are 200 observations (represented by the rows), and each observation is a single market. ## Visualizing data using seaborn Seaborn: Python library for statistical data visualization built on top of Matplotlib
  • Anaconda users: run conda install seaborn from the command line
  • Other users: installation instructions

In [6]:
# conventional way to import seaborn
import seaborn as sns

# allow plots to appear within the notebook
%matplotlib inlineb


//anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [7]:
# visualize the relationship between the features and the response using scatterplots
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')


Out[7]:
<seaborn.axisgrid.PairGrid at 0x10c670750>

Linear regression

  • Pros: fast, no tuning required, highly interpretable, well-understood
  • Cons: unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response) ### Form of linear regression y=β0+β1x1+β2x2+...+βnxn
  • y is the response
  • β0 is the intercept
  • β1 is the coefficient for x1 (the first feature)
  • βn is the coefficient for xn (the nth feature) In this case: y=β0+β1×TV+β2×Radio+β3×Newspaper The β values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions! ## Preparing X and y using pandas
  • scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays.
  • However, pandas is built on top of NumPy.
  • Thus, X can be a pandas DataFrame and y can be a pandas Series!

In [8]:
# create a Python list of feature names
feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# equivalent command to do this in one line
X = data[['TV', 'Radio', 'Newspaper']]

# print the first 5 rows
X.head()


Out[8]:
TV Radio Newspaper
1 230.1 37.8 69.2
2 44.5 39.3 45.1
3 17.2 45.9 69.3
4 151.5 41.3 58.5
5 180.8 10.8 58.4

In [9]:
# check the type and shape of X
print type(X)
print X.shape


<class 'pandas.core.frame.DataFrame'>
(200, 3)

In [10]:
# select a Series from the DataFrame
y = data['Sales']

# equivalent command that works if there are no spaces in the column name
y = data.Sales

# print the first 5 values
y.head()


Out[10]:
1    22.1
2    10.4
3     9.3
4    18.5
5    12.9
Name: Sales, dtype: float64

In [11]:
# check the type and shape of y
print type(y)
print y.shape


<class 'pandas.core.series.Series'>
(200,)

Splitting X and y into training and testing sets


In [13]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [14]:
# default split is 75% for training and 25% for testing
print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape


(150, 3)
(150,)
(50, 3)
(50,)

Linear regression in scikit-learn


In [15]:
# import model
from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)


Out[15]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Interpreting model coefficients


In [16]:
# print the intercept and coefficients
print linreg.intercept_
print linreg.coef_


2.87696662232
[ 0.04656457  0.17915812  0.00345046]

In [17]:
# pair the feature names with the coefficients
zip(feature_cols, linreg.coef_)


Out[17]:
[('TV', 0.046564567874150267),
 ('Radio', 0.17915812245088833),
 ('Newspaper', 0.0034504647111804365)]

y=2.88+0.0466×TV+0.179×Radio+0.00345×Newspaper

How do we interpret the TV coefficient (0.0466)?

  • For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
  • Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional \$1,000 spent on TV ads is associated with an increase in sales of 46.6 items.

Important notes:

  • This is a statement of association, not causation.
  • If an increase in TV ad spending was associated with a decrease in sales, β1 would be negative.

Making predictions


In [18]:
# make predictions on the testing set
y_pred = linreg.predict(X_test)

We need an evaluation metric in order to compare our predictions with the actual values!

Model evaluation metrics for regression

  • Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.
  • Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:

In [19]:
# define true and predicted response values
true = [100, 50, 30, 20]
pred = [90, 50, 50, 30]

Mean Absolute Error (MAE) is the mean of the absolute value of the errors: $$\\1/n∑_{i=1}^n|y_i-ŷ_i|$$


In [20]:
# calculate MAE by hand
print (10 + 0 + 20 + 10)/4.

# calculate MAE using scikit-learn
from sklearn import metrics
print metrics.mean_absolute_error(true, pred)


10.0
10.0

Mean Squared Error (MSE) is the mean of the squared errors: $$\\1/n∑_{i=1}^n(y_i-ŷ_i)^2$$


In [21]:
# calculate MSE by hand
print (10**2 + 0**2 + 20**2 + 10**2)/4.

# calculate MSE using scikit-learn
print metrics.mean_squared_error(true, pred)


150.0
150.0

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors: $$\\sqrt{1/n∑_{i=1}^n(y_i-ŷ_i)^2}$$


In [22]:
# calculate RMSE by hand
import numpy as np
print np.sqrt((10**2 + 0**2 + 20**2 + 10**2)/4.)

# calculate RMSE using scikit-learn
print np.sqrt(metrics.mean_squared_error(true, pred))


12.2474487139
12.2474487139

Comparing these metrics:

  • MAE is the easiest to understand, because it's the average error.
  • MSE is more popular than MAE, because MSE "punishes" larger errors.
  • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

Computing the RMSE for our Sales predictions


In [25]:
print np.sqrt(metrics.mean_squared_error(y_test, y_pred))


1.40465142303

Feature selection

  • Does Newspaper "belong" in our model? In other words, does it improve the quality of our predictions?
  • Let's remove it from the model and check the RMSE!

In [26]:
# create a Python list of feature names
feature_cols = ['TV', 'Radio']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# select a Series from the DataFrame
y = data.Sales

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# make predictions on the testing set
y_pred = linreg.predict(X_test)

# compute the RMSE of our predictions
print np.sqrt(metrics.mean_squared_error(y_test, y_pred))


1.38790346994

The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.


In [27]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[27]: