Title: Linear Regression
Slug: linear_regression
Summary: A simple example of linear regression in scikit-learn
Date: 2016-08-19 12:00
Category: Machine Learning
Tags: Linear Regression
Authors: Chris Albon
Sources: scikit-learn, DrawMyData.
The purpose of this tutorial is to give a brief introduction into the logic of statistical model building used in machine learning. If you want to read more about the theory behind this tutorial, check out An Introduction To Statistical Learning.
Let us get started.
In [1]:
import pandas as pd
from sklearn import linear_model
import random
import numpy as np
%matplotlib inline
With those libraries added, let us load the dataset (the dataset is avaliable in his site's GitHub repo).
In [2]:
# Load the data
df = pd.read_csv('../data/simulated_data/battledeaths_n300_cor99.csv')
# Shuffle the data's rows (This is only necessary because of the way I created
# the data using DrawMyData. This would not normally be necessary with a real analysis).
df = df.sample(frac=1)
Let us take a look at the first few rows of the data just to get an idea about it.
In [3]:
# View the first few rows
df.head()
Out[3]:
Now let us plot the data so we can see it's structure.
In [4]:
# Plot the two variables against eachother
df.plot(x='friendly_battledeaths', y='enemy_battledeaths', kind='scatter')
Out[4]:
Now for the real work. To judge how how good our model is, we need something to test it against. We can accomplish this using a technique called cross-validation. Cross-validation can get much more complicated and powerful, but in this example we are going do the most simple version of this technique.
In [5]:
# Create our predictor/independent variable
# and our response/dependent variable
X = df['friendly_battledeaths']
y = df['enemy_battledeaths']
# Create our test data from the first 30 observations
X_test = X[0:30].reshape(-1,1)
y_test = y[0:30]
# Create our training data from the remaining observations
X_train = X[30:].reshape(-1,1)
y_train = y[30:]
Let us train the model using our training data.
In [6]:
# Create an object that is an ols regression
ols = linear_model.LinearRegression()
In [7]:
# Train the model using our training data
model = ols.fit(X_train, y_train)
Here are some basic outputs of the model, notably the coefficient and the R-squared score.
In [8]:
# View the training model's coefficient
model.coef_
Out[8]:
In [9]:
# View the R-Squared score
model.score(X_test, y_test)
Out[9]:
Now that we have used the training data to train a model, called model
, we can apply it to the test data's Xs to make predictions of the test data's Ys.
Previously we used X_train
and y_train
to train a linear regression model, which we stored as a variable called model
. The code model.predict(X_test)
applies the trained model to the X_test
data, data the model has never seen before to make predicted values of Y.
This can easily be seen by simply running the code:
In [10]:
# Run the model on X_test and show the first five results
list(model.predict(X_test)[0:5])
Out[10]:
This array of values is the model's best guesses for the values of the test data's Ys. Compare them to the actual test data Y values:
In [11]:
# View the first five test Y values
list(y_test)[0:5]
Out[11]:
The difference between the model's predicted values and the actual values is how is we judge as model's accuracy, because a perfectly accurate model would have residuals of zero.
However, to judge a model, we want a single statistic (number) that we can use as a measure. We want this measure to capture the difference between the predicted values and the actual values across all observations in the data.
The most common statistic used for quantitative Ys is the residual sum of squares:
$$ RSS = \sum_{i=1}^{n}(y_{i}-f(x_{i}))^{2} $$Don't let the mathematical notation throw you off:
model.predict(X_test)
y_test
**2
.sum()
In the residual sum of squares, for each observation we find the difference between the model's predicted Y and the actual Y, then square that difference to make all the values positive. Then we add all those squared differences together to get a single number. The final result is a statistic representing how far the model's predictions were from the real values.
In [12]:
# Apply the model we created using the training data
# to the test data, and calculate the RSS.
((y_test - model.predict(X_test)) **2).sum()
Out[12]:
Note: You can also use Mean Squared Error, which is RSS divided by the degrees of freedom. But I find it helpful to think in terms of RSS.
In [13]:
# Calculate the MSE
np.mean((model.predict(X_test) - y_test) **2)
Out[13]: