Data science pipeline: pandas, seaborn, scikit-learn

Agenda

How do I use the pandas library to read data into Python?
How do I use the seaborn library to visualize data?
What is linear regression, and how does it work?
How do I train and interpret a linear regression model in scikit-learn?
What are some evaluation metrics for regression problems?
How do I choose which features to include in my model? ## Types of supervised learning
Classification: Predict a categorical response
Regression: Predict a continuous response ## Reading data using pandas Pandas: popular Python library for data exploration, manipulation, and analysis
Anaconda users: pandas is already installed
Other users: installation instructions



In [1]:

    
# conventional way to import pandas
import pandas as pd



In [2]:

    
# read CSV file directly from a URL and save the results
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

# display the first 5 rows
data.head()

Primary object types:

DataFrame: rows and columns (like a spreadsheet)
Series: a single column



In [3]:

    
# display the last 5 rows
data.tail()



In [4]:

    
# check the shape of the DataFrame (rows, columns)
data.shape









    Out[4]:





(200, 4)

What are the features?

TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
Radio: advertising dollars spent on Radio
Newspaper: advertising dollars spent on Newspaper What is the response?
Sales: sales of a single product in a given market (in thousands of items) What else do we know?
Because the response variable is continuous, this is a regression problem.
There are 200 observations (represented by the rows), and each observation is a single market. ## Visualizing data using seaborn Seaborn: Python library for statistical data visualization built on top of Matplotlib
Anaconda users: run conda install seaborn from the command line
Other users: installation instructions



In [6]:

    
# conventional way to import seaborn
import seaborn as sns

# allow plots to appear within the notebook
%matplotlib inlineb









    



//anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')



In [7]:

    
# visualize the relationship between the features and the response using scatterplots
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')









    Out[7]:





<seaborn.axisgrid.PairGrid at 0x10c670750>

Linear regression

Pros: fast, no tuning required, highly interpretable, well-understood
Cons: unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response) ### Form of linear regression y=β0+β1x1+β2x2+...+βnxn
y is the response
β0 is the intercept
β1 is the coefficient for x1 (the first feature)
βn is the coefficient for xn (the nth feature) In this case: y=β0+β1×TV+β2×Radio+β3×Newspaper The β values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions! ## Preparing X and y using pandas
scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays.
However, pandas is built on top of NumPy.
Thus, X can be a pandas DataFrame and y can be a pandas Series!



In [8]:

    
# create a Python list of feature names
feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# equivalent command to do this in one line
X = data[['TV', 'Radio', 'Newspaper']]

# print the first 5 rows
X.head()



In [9]:

    
# check the type and shape of X
print type(X)
print X.shape









    



<class 'pandas.core.frame.DataFrame'>
(200, 3)



In [10]:

    
# select a Series from the DataFrame
y = data['Sales']

# equivalent command that works if there are no spaces in the column name
y = data.Sales

# print the first 5 values
y.head()









    Out[10]:





1    22.1
2    10.4
3     9.3
4    18.5
5    12.9
Name: Sales, dtype: float64



In [11]:

    
# check the type and shape of y
print type(y)
print y.shape









    



<class 'pandas.core.series.Series'>
(200,)

Splitting X and y into training and testing sets



In [13]:

    
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)



In [14]:

    
# default split is 75% for training and 25% for testing
print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape









    



(150, 3)
(150,)
(50, 3)
(50,)

Linear regression in scikit-learn



In [15]:

    
# import model
from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)









    Out[15]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Interpreting model coefficients



In [16]:

    
# print the intercept and coefficients
print linreg.intercept_
print linreg.coef_









    



2.87696662232
[ 0.04656457  0.17915812  0.00345046]



In [17]:

    
# pair the feature names with the coefficients
zip(feature_cols, linreg.coef_)









    Out[17]:





[('TV', 0.046564567874150267),
 ('Radio', 0.17915812245088833),
 ('Newspaper', 0.0034504647111804365)]

y=2.88+0.0466×TV+0.179×Radio+0.00345×Newspaper

How do we interpret the TV coefficient (0.0466)?

For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional \$1,000 spent on TV ads is associated with an increase in sales of 46.6 items.

Important notes:

This is a statement of association, not causation.
If an increase in TV ad spending was associated with a decrease in sales, β1 would be negative.

Making predictions



In [18]:

    
# make predictions on the testing set
y_pred = linreg.predict(X_test)

We need an evaluation metric in order to compare our predictions with the actual values!

Model evaluation metrics for regression

Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.
Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:



In [19]:

    
# define true and predicted response values
true = [100, 50, 30, 20]
pred = [90, 50, 50, 30]

Mean Absolute Error (MAE) is the mean of the absolute value of the errors: $$\\1/n∑_{i=1}^n|y_i-ŷ_i|$$



In [20]:

    
# calculate MAE by hand
print (10 + 0 + 20 + 10)/4.

# calculate MAE using scikit-learn
from sklearn import metrics
print metrics.mean_absolute_error(true, pred)

Mean Squared Error (MSE) is the mean of the squared errors: $$\\1/n∑_{i=1}^n(y_i-ŷ_i)^2$$



In [21]:

    
# calculate MSE by hand
print (10**2 + 0**2 + 20**2 + 10**2)/4.

# calculate MSE using scikit-learn
print metrics.mean_squared_error(true, pred)

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors: $$\\sqrt{1/n∑_{i=1}^n(y_i-ŷ_i)^2}$$



In [22]:

    
# calculate RMSE by hand
import numpy as np
print np.sqrt((10**2 + 0**2 + 20**2 + 10**2)/4.)

# calculate RMSE using scikit-learn
print np.sqrt(metrics.mean_squared_error(true, pred))









    



12.2474487139
12.2474487139

Comparing these metrics:

MAE is the easiest to understand, because it's the average error.
MSE is more popular than MAE, because MSE "punishes" larger errors.
RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

Computing the RMSE for our Sales predictions



In [25]:

    
print np.sqrt(metrics.mean_squared_error(y_test, y_pred))









    



1.40465142303

Feature selection

Does Newspaper "belong" in our model? In other words, does it improve the quality of our predictions?
Let's remove it from the model and check the RMSE!



In [26]:

    
# create a Python list of feature names
feature_cols = ['TV', 'Radio']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# select a Series from the DataFrame
y = data.Sales

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# make predictions on the testing set
y_pred = linreg.predict(X_test)

# compute the RMSE of our predictions
print np.sqrt(metrics.mean_squared_error(y_test, y_pred))









    



1.38790346994

The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.



In [27]:

    
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()









    Out[27]:

	TV	Radio	Newspaper	Sales
1	230.1	37.8	69.2	22.1
2	44.5	39.3	45.1	10.4
3	17.2	45.9	69.3	9.3
4	151.5	41.3	58.5	18.5
5	180.8	10.8	58.4	12.9

	TV	Radio	Newspaper	Sales
196	38.2	3.7	13.8	7.6
197	94.2	4.9	8.1	9.7
198	177.0	9.3	6.4	12.8
199	283.6	42.0	66.2	25.5
200	232.1	8.6	8.7	13.4

	TV	Radio	Newspaper
1	230.1	37.8	69.2
2	44.5	39.3	45.1
3	17.2	45.9	69.3
4	151.5	41.3	58.5
5	180.8	10.8	58.4