Linear Regression Analysis

Written by Jin Cheong & Luke Chang

In this lab we are going to learn how to do simple data analyses using python. This module includes learning how to run simple linear regression models, comparing models, & visualizing fits. - This notebook was adapted from material available here

After the tutorial you will have the chance to apply the methods to a new set of data.


In [ ]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.api as stats

We will load the data which is a comma delimited text file using the read_csv() method from pandas. The '../Data' indicates that we will move one folder up and then into the Data folder.


In [ ]:
# Load the data
df = pd.read_csv('../Data/salary.csv')

In [ ]:
df = df.dropna()
df = df[df.gender!=2]

In [ ]:
print 'There are %i rows and %i columns in this data set' % df.shape
df.isnull().sum()

In [ ]:
df[['salary','years']].plot(kind='scatter', x='years', y='salary')

In [ ]:
# Oneshot visualization
## create a new numericalSeries called dept_num !! Just for visualization !! 
df['dept_num'] = df.departm.map({'bio':0, 'chem':1,'geol':2,'neuro':3,'stat':4,'physics':5,'math':6})
df.tail()

In [ ]:
## Now plot all four categories
fig, axs = plt.subplots(1, 4, sharey=True)
df.plot(kind='scatter', x='gender', y='salary', ax=axs[0], figsize=(16, 4))
df.plot(kind='scatter', x='dept_num', y='salary', ax=axs[1])
df.plot(kind='scatter', x='years', y='salary', ax=axs[2])
df.plot(kind='scatter', x='age', y='salary', ax=axs[3])
# The problem is that it treats department as a continuous variable.

Linear Regression

Now let's move on to running an analysis.

Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:

$y = \beta_0 + \beta_1\cdot x $

What does each term represent?

  • $y$ is the response
  • $x$ is the feature
  • $\beta_0$ is the intercept
  • $\beta_1$ is the coefficient for x

Together, $\beta_0$ and $\beta_1$ are called the model coefficients. To create your model, you must "estimate" the values of these coefficients. And once we've estimated these parameters, we can use the model to predict things!

Estimating Model Coefficients

Generally speaking, coefficients are estimated using the least squares criterion, which means we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):

What elements are present in the diagram?

  • The black dots are the observed values of x and y.
  • The blue line is our least squares line.
  • The red lines are the residuals, which are the distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?

  • $\beta_0$ is the intercept (the value of $y$ when $x$=0)
  • $\beta_1$ is the slope (the change in $y$ divided by change in $x$)

Here is a graphical depiction of those calculations:

To run a regression we will be using the statsmodels module that was loaded above. We will first initalize a model object with the regression equation and the data to use. The format for specifying the regression equation is similar to R. (y ~ x).

Here we will estimate the effect of gender on salary. Gender is a "dummy variable", meaning that it is only ones and zeros.

After we have initialized the object instance we can run the fit method (this can be chained so you only need to run one line). After the parameters have been estimated we can query attributes of the model such as the estimated model parameters.


In [ ]:
lm = smf.ols(formula = "salary ~ gender",data=df).fit()

# print the coefficients
print(lm.params)
print(lm.params[1])

Interpreting Model Coefficients

How do we interpret the estimated coefficients for this model?

  • $\beta_0$, or the intercept, reflects the average salary for the reference condition of the dummy code (i.e., when gender = 0). This means that males make an average of $69108.5
  • to get the estimated female salary we need to add $\beta_1$ (-13388.83), which becomes $55719.67

We can also get a summary of all of the model statistics using the summary() method.


In [ ]:
print(lm.summary())

Confidence in our Model

Now let's examine the output. To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression.

Question: Is linear regression a high bias/low variance model, or a low bias/high variance model?

Answer: High bias/low variance. Under repeated sampling, the line will stay roughly in the same place (low variance), but the average of those models won't do a great job capturing the true relationship (high bias). Note that low variance is a useful characteristic when you don't have a lot of training data!

A closely related concept is confidence intervals. Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.

Hypothesis Testing and p-values

Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.

(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative hypothesis may indeed be true, except that you just don't have enough data to show that.)

As it relates to model coefficients, here is the conventional hypothesis test:

  • null hypothesis: There is no relationship between gender and salary (and thus $\beta_1$ equals zero)
  • alternative hypothesis: There is a relationship between TV ads and Sales (and thus $\beta_1$ is not equal to zero)

How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero. Conversely, the p-value represents the probability that the coefficient is actually zero:

Overall Model Fit

The overall goodness of how well the model fits the data can be described using the AIC or BIC which are information criterions that penalize for the number of free parameters in the model. A common way to evaluate the overall fit of a linear model is $R^2$ value. $R^2$ is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)

$R^2$ is between 0 and 1, and higher is better because it means that more variance is explained by the model. Here's an example of what R-squared "looks like":

In the model above, we see that $R^2$=0.09, which means that gender does not do a great job at predicting the overall salary.

Multiple Linear Regression

Simple linear regression can easily be extended to include multiple features. This is called multiple linear regression:

$y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$

Each $x$ represents a different feature, and each feature has its own coefficient. In this case:

$Salary = \beta_0 + \beta_1 \cdot Gender + \beta_2 \cdot Years$

Let's use Statsmodels to estimate these coefficients:


In [ ]:
lm2 = smf.ols(formula = "salary ~ gender + years",data=df).fit()
print(lm2.summary())

Here we see that years also significantly explains additional variance.

This is reflected in the $R^2$, which has now increased from 0.09 to 0.143. It is important to note that $R^2$ will always increase as you add more features to the model, even if they are unrelated to the response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.

There is alternative to $R^2$ called adjusted $R^2$ that penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.

Next week we will explore another approach known as Cross-validation. Importantly, cross-validation can be applied to any model, whereas the methods described above only apply to linear models.

Here, we will try and see if the new model significantly explains additional variance controlling for the extra free parameter using a nested model comparison.


In [ ]:
print(stats.anova_lm(lm,lm2))

We see that the addition of years in model2 significantly explains additional variance when compared to model 1. Notice that this is similar to the results provided by the t-tests on the individual parameters. Using model comparison to test incrementally adding a single predictor is a feature selection method known as stepwise regression. However, you can also perform model comparison on models with many different features as long as one is nested within the other.

When we only had one predictor we could investigate the relationship using a scatterplot. Now that we have 2 predictors, we need to make a 3 dimensional plot to see the effect of each predictor on salary.


In [ ]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(18,6))
idx = [131,132,133]
pos = {0:(-35,20),1:(-90,0),2:(0,0)}
for i,ii in enumerate(idx):
    ax = fig.add_subplot(ii, projection='3d')

    # generate a mesh
    x_surf = np.arange(0,11)/10.0
    y_surf = np.arange(0,35,3)
    x_surf, y_surf = np.meshgrid(x_surf, y_surf)

    exog = pd.core.frame.DataFrame({'gender': x_surf.ravel(), 'years': y_surf.ravel()})
    exog['intercept'] = np.ones(exog.shape[0])
    out = lm2.predict(exog = exog,transform=True)

    ax.plot_surface(x_surf, y_surf,
                out.reshape(x_surf.shape),
                rstride=1,
                cstride=1,
                color='None',
                alpha = 0.2)

    ax.scatter(df['gender'], df['years'], df['salary'],
           c='blue',
           marker='o',
           alpha=1)

    ax.set_xlabel('Gender')
    ax.set_ylabel('Years')
    ax.set_zlabel('Salary')
    ax.azim = pos[i][0]
    ax.elev = pos[i][1]

plt.tight_layout()

Notice how when we rotate the 3D plot, we can make projections into 2 dimensional space. This is the effect of each predictor on salary controlling for the other.

We can also generate additional diagnostic plots.


In [ ]:
f = sm.graphics.plot_regress_exog(lm2, 'years')
f.set_size_inches(10,5)

Lastly, we can create a standard ANOVA table with F statistics as if doing Anova. F stat tests whether at least one of the coefficients of the variables of a category is significantly different from 0 or not.

There are different times of Sum of Squared Error (SSQ) to consider. We recommend using Type 3 by default. See here for more in depth discussion.


In [ ]:
# Type 3 SSQ: valid for models with significant interaction terms
print('ANOVA table for Type 3 SSQ')
print(stats.anova_lm(lm2,typ=3))

Additional Resources for Learning about Regression

Exercises

Use the 'salary_exercise.csv' to apply the analyses we learned today. This dataset was adapted from material available here

These are the salary data used in Weisberg's book, consisting of observations on six variables for 52 tenure-track professors in a small college. The variables are:

  • sx = Sex, coded 1 for female and 0 for male
  • rk = Rank, coded
    • 1 for assistant professor,
    • 2 for associate professor, and
    • 3 for full professor
  • yr = Number of years in current rank
  • dg = Highest degree, coded 1 if doctorate, 0 if masters
  • yd = Number of years since highest degree was earned
  • sl = Academic year salary, in dollars.

Reference: S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194.

Make sure you check the data for missing values or other errors before beginning your analysis.

Exercise 1

Run a simple linear regression model in which you predict salary from sex of the professors.
Is there a gender difference in salary?

Exercise 2

Experiment with different combination of variables in your model to answer the following questions.
What is your best model to explain salary from the given data?
Is there a gender difference in salary?