Introduction to Linear Regression



In [ ]:

    
# only necessary if you're running Python 2.7 or lower
from __future__ import print_function, division
from six.moves import range

# import matplotlib and define our alias
from matplotlib import pyplot as plt

# plot figures within the notebook rather than externally
%matplotlib inline

# numpy
import numpy as np

# scipy 
import scipy

Overview

We are now going to be taking the skills we learned in Part 2 and applying them to do some data analysis using linear regression. Although linear regression appears simple at first glance, it actually has a surprising amount of depth and is applicable in a bunch of different domains. Most importantly, it provides an accessible way to get a handle on several big concepts in data analysis (and especially model fitting) and provides an excellent intro to using Python to do science.

Data

In honor of the world cup happening this year (right now!), we are going to be analyzing data from the 2014 World Cup! Everything is taken from https://github.com/openfootball/.



In [ ]:

    
# load in JSON
import json



In [ ]:

    
# load in our data
filename = 'worldcup_2014.json'

with open(filename, 'r') as f:
    cup_data = json.load(f)  # world cup JSON data



In [ ]:

    
# check what's in our dictionary
cup_data



In [ ]:

    
# check the relevant keys
cup_data.keys()



In [ ]:

    
# check group-level data
cup_data['rounds']

Problem

Our starting goal here is going to be pretty simple: try to determine a simple linear relationship between the number of goals scored and total number of games played. In other words, we want a model like

$$ y = ax + b $$

where $y$ is the number of games played, $x$ is the number of goals scored, and $a$ and $b$ are coefficients of the fit. We are using the number of games played as a proxy for ranking.

First, we need to figure out how many goals were scored during the group stages for each team. We are going to compute the "effective" number of goals by taking the difference between the number of goals scored "for" the team $x_{\rm for}$ and the number scored "against" $x_{\rm against}$:

$$ x = x_{\rm for} - x_{\rm against} $$

Let's create a dictionary for this.



In [ ]:

    
data = dict()

# read out world cup 2014 data from the dictionary
for matchup in cup_data['rounds']:
    for match in matchup['matches']:
        team1, team2 = match['team1']['name'], match['team2']['name']  # team names
        score1, score2 = match['score1'], match['score2']  # scores
        score = score1 - score2  # effective score
        try:
            data[team1]['goals'] += score
            data[team2]['goals'] -= score
            data[team1]['games'] += 1
            data[team2]['games'] += 1
        except:
            data[team1] = {'goals': score, 'games': 1}
            data[team2] = {'goals': -score, 'games': 1}
        if matchup['name'] == 'Match for third place':
            data[team1]['games'] -= 1
            data[team2]['games'] -= 1

# Check data.
data

Now that our data is in an accessible format, let's try and get it into something we can do math with. Copy over the data from our dictionary into numpy arrays called x and y. Then plot the results.



In [ ]:

    
# copy over our data into x and y
x = ...
y = ...

# plot our results
# remember to label your axes!
...

Extra challenge: If you aren't already, plot with data points instead of lines.

Extra challenge: See if you can plot a line that fits the data reasonably well by just "eyeballing it". Try changing the options to see if you can distinguish which slopes/intercepts appear to fit the data well and which don't.

Loss Function

In the absence of known errors, the way to solve this problem is to define an appropriate loss function $L(\boldsymbol{\Theta} \,|\, \mathbf{D})$ that we hope to minimize, where $\boldsymbol{\Theta}$ contains all the parameters of interest (here $a$ and $b$) and $\mathbf{D}$ contains all our data. Defining $\mathbf{x} \equiv \lbrace x_1, \dots, x_n \rbrace$ and $\mathbf{y} \equiv \lbrace y_1, \dots, y_n \rbrace$, let's write our loss function as the sum of the squares of the residuals

$$ L(\alpha, \beta \,|\, \mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} \left( \Delta y_i \right)^2 \equiv \sum_{i=1}^{n} \left( y(x) - y_i \right)^2 = \sum_{i=1}^{n} \left( \alpha + \beta x_i - y_i \right)^2 \quad . $$

Let's define our linear relationship and loss function below.



In [ ]:

    
# Linear fit (takes input *vector* `theta` -> (a,b))
def linear(theta):
    ...



In [ ]:

    
# Loss function
def loss(theta):
    ...

Test out a few lines and see what the loss function is. Combined with the plot above, see if you can get a reasonable fit to the data.



In [ ]:

    
# testing out the linear fit and loss function

Finding the Best-Fit Line

The best-fit line ($a$ and $b$) is the one that minimizes our loss function. To compute this, we need to locate the critical points where

$$ \frac{\partial f(\boldsymbol{\Theta})}{\partial \Theta_i} = 0 $$

for all parameters $\Theta_i$ of interest within $\boldsymbol{\Theta}$. In our case, we get:

$$ \frac{\partial L}{\partial \alpha} = \sum_{i=1}^{n} 2(\alpha + \beta x_i - y_i) = 0 $$$$ \frac{\partial L}{\partial \beta} = \sum_{i=1}^{n} 2\beta(\alpha + \beta x_i - y_i) = 0 $$

This gives us two linear equations with two unknowns, which we can solve exactly using linear algebra to get an analytic best-fit solution $\hat{\alpha}$ and $\hat{\beta}$. You're welcome to try and solve this explicitly now; we'll come back to this later.



In [ ]:

    
# space if you want to try your hand at solving the linear system explicitly

For now, we'll minimize our loss function using the minimize package contained within scipy.optimize. Using the documentation, see if you can figure out how to use minimize to get the best-fit parameters theta based on our loss function loss.



In [ ]:

    
# Minimize the loss function
results = minimize(...)
theta_best = ...

# Print out results
print(results)
print('Best-Fit:', theta_best)

# Plot comparison between data and results
...

Extra challenge: Spend some time digging around to see if you can understand both how the output is stored (as a dict) and what some of the terms mean. We'll get back to these quantities in more detail later.

Approximate Gaussian Errors

We got a best-fit result. But how reliable is it? How well do we really know what the relationship is between goals scored and final placement? To figure this out, we need to know the overall errors are on our result. Let's explore that a bit now starting with our best-fit result.

A common approximation is to assume our error are Gaussian centered around the best fit $\boldsymbol{\Theta}_{\rm best}$ with a covariance matrix

$$ \mathbf{C} = \begin{bmatrix} \sigma_x^2 & \rho \sigma_x \sigma_y \\ \rho \sigma_x \sigma_y & \sigma_y^2 \\ \end{bmatrix} $$

This is reported above in the hess_inv option.

Using the covariance matrix from hess_inv, generate Nmc=10000 random realizations of the slope $a$ and intercept $b$ using numpy.random.multivariate_normal. Then plot the results as a 2-D histogram with Nbins=20 bins.



In [ ]:

    
Nmc, Nbins = 10000, 20

# Draw random samples around best-fit given errors.
thetas = ...

# Plot draws as 2-D histogram.
# Remember to label your axes!
...

Challenge: Add in a labeled colorbar and change the colormap so that the scale goes from white to gray.

Extra challenge: Using the outputs from hist2d, find the location of the bin with the most counts.

Extra challenge: Plot 100 realizations of the predicted line/values over the data points from above.

This is useful, but doesn't give us everything we're looking for. We want to know, e.g., how many goals it takes to get to the finals. That's going to require us to take our $\boldsymbol{\Theta}$ values and transform them into a concrete number of goals needed to get to round 7 (the finals).



In [ ]:

    
# function to convert from theta to goals for a given round
# i.e., compute x from y=ax+b for y,a,b given
def goals_needed(theta, matchup):  # we're avoiding "round" since it's special
    ...

Using your function, compute the number of goals needed to get to the finals and plot the distribution of the results as a histogram using plt.hist with Nbins=50 bins.



In [ ]:

    
# compute goals needed
final_preds = ...

# plot histogram of results
# Remember to label your axes!
Nbins = 50
...

Challenge: Change the histogram color and style so that you only plot a black outline with no color fill.

Extra challenge: See how the distribution changes for each round, starting from round 4 (the first elimination round at the end of the group stages.

Bootstrap Resampling

In some cases, we might not fully trust our data or the errors reported from our fit. This can happen for any number of reasons, but often is the case if the fit appears to be sensitive to a few "special" data points. Instead, we can try to estimate the errors using bootstrap resampling.

Essentially, the idea is to pretend we observed the data again and re-fit our line. We obviously can't do that, so we approximate it by drawing a new set of data points from our original set of data points, where the probability of selecting any particular data point out of $N$ data points is $p = 1/N$.

We can do this numerically in Python using the numpy.random.choice function. Using the example below, create a function that returns a new set of x_resamp and y_resamp values.



In [ ]:

    
# create example data
N_example = 10  # number of data points
x_example = np.arange(N_example)  # x grid
y_example = np.arange(N_example) * 5  # y grid
print('Original data:')
print(x_example)
print(y_example)

# resample data
idxs = np.random.choice(N_example, size=N_example)  # resample `N_example` indices
x_example_r = x_example[idxs]  # select resampled x values
y_example_r = y_example[idxs]  # select resampled y values
print('\nResampled data:')
print(idxs)
print(x_example_r)
print(y_example_r)



In [ ]:

    
# define resampling function
def resample_data(x, y):
    ...

Now, using the code from earlier and a for loop, resample the data and recompute the best-fit line Nmc=10000 times.

NOTE THAT YOU WILL PROBABLY NEED TO REDEFINE YOUR FITTING FUNCTIONS FROM EARLIER TO WORK WITH THIS RESAMPLED DATA.



In [ ]:

    
# redefine linear fit if necessary
def linear_resamp(theta):
    ...

# redefine loss function if necessary
def loss_resamp(theta):
    ...



In [ ]:

    
Nmc, Nbins = 10000, 20

# resample data and re-fit line
thetas_resamp = ...
for i in range(Nmc):
    x_resamp, y_resamp = resample_data(x, y)  # resample data
    results = minimize(...)  # minimize resample data
    ...  # assign value to thetas_resamp

Once you're done, plot the data as a 2-D histogram as before.



In [ ]:

    
# Plot draws as 2-D histogram.
# Remember to label your axes!
...

Finally, re-compute the number of goals needed to get to the finals for these new $\boldsymbol{\Theta}$ samples and plot the distribution of the results as a histogram using plt.hist with Nbins=50 bins.



In [ ]:

    
# compute goals needed
final_preds_resamp = ...

# plot histogram of results
# Remember to label your axes!
Nbins = 50
...

Extra challenge: See if you can directly compare the (1) 2-D histograms, (2) the 1-D histograms for the number of goals needed to go to the finals, and (3) the predicted lines for both cases.

Brute Force

We will cover this material if we have extra time.

Ironically, sometimes the best method for determining the errors is the most straightforward: just find out how good your fit is over a bunch of combinations of $a$ and $b$. More formally, over a grid in $a$ and $b$ you compute

$$ w(a, b) = \exp[-L(a,b)/2] $$

where $L(a,b)$ is the loss function from before and $w(a,b)$ is the relative weight.

Based on your results from above, define a grid of Nx=250 points for $a$ around the best fit and a grid of Ny=250 points for $b$ around the best fit. Then, compute the loss and relative weight for every $(a,b)$ combination in the grid.



In [ ]:

    
# define grids
a_grid = ...
b_grid = ...

# compute losses over grids
...

# compute weights
weights = ...

Plot the weighted 2-D histogram of the fits and the weighted 1-D histogram of the goals needed to get to the finals. Note that meshgrid might be helpful here.



In [ ]:

    
Nbins = 20

# Plot draws as 2-D histogram.
# Remember to label your axes!
...



In [ ]:

    
# compute goals needed
final_preds_grid = ...
final_preds_weights = ...

# plot histogram of results
# Remember to label your axes!
Nbins = 50
...