Linear Regression in Python

There are two main packages you can use to run basic linear regression models in Python: statsmodels and scikit-learn. I'll focus here on statsmodels.

We'll also use the data management package Pandas. It allows the user to organize the data into matrices (called "dataframes") that mesh well with the statsmodels package.

There are two different versions of statsmodels you can import:

a. statsmodels.formula.api
b. statsmodels.api

I'll focus on statsmodels.formulas.api, which was designed to be similar in style to the regression programs in R. It also works nicely with Pandas.

Design

I'll use some random panel dataset I found on Professor Greene's website. It has both cross-sectional (U.S. states) and time-series (years 1970-1986) dimensions. Here are the details he provides:

Statewide Capital Productivity Data, lower 48 states, 17 years
Variables in the file are
STATE = state name
ST_ABB = state abbreviation
YR = year, 1970,...,1986
P_CAP = public capital
HWY = highway capital
WATER = water utility capital
UTIL = utility capital
PC = private capital
GSP = gross state product
EMP = employment
UNEMP = unemployment rate


I'll haphazardly run the following regression:

$$\text{GSP}_{i,t} = \alpha + \beta_1 \text{UNEMP}_{i,t} + \beta_2 \text{P_CAP}_{i,t} +\beta_3 \text{PC}_{i,t} + \epsilon_{i,t}, \quad i = 1, ..., 48 \quad t = 1970, ..., 1986 $$

For more info on the basics of Python, definitely take a look at Professor Backus's Data Bootcamp book and course


In [1]:
import pandas as pd                       #This is Pandas, we'll call it 'pd' for short
import statsmodels.formula.api as smf    #This is the linear regression program

"""
The following reads in the .csv file and saves it as a dataframe we call 'df'. You can read in other files
besides .csv, too. For example, .xls can be read in using pd.read_excel and Stata files as pd.read_stata, etc.
"""

df = pd.read_csv('http://people.stern.nyu.edu/wgreene/Econometrics/productivity.csv') 
                                                                
df = df.sort_values('YR')          #Ignore this, I'm just re-sorting the dataframe by year
df.index = range(1,len(df) + 1)    

# This prints the first 10 rows of the matrix
df.head(10)


Out[1]:
STATE YR P_CAP HWY WATER UTIL PC GSP EMP UNEMP
1 ALABAMA 1970 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5 4.7
2 TEXAS 1970 53639.88 25941.91 8352.71 19345.25 213676.41 160922 3624.9 4.4
3 INDIANA 1970 20813.65 8904.96 2814.86 9093.83 63646.79 56769 1849.0 5.0
4 TENNESSE 1970 20949.51 9280.24 2787.57 8881.70 38417.96 34769 1327.6 4.8
5 MICHIGAN 1970 44684.82 17924.92 8380.48 18379.42 84755.95 102172 2999.0 6.7
6 SOUTH_DAKOTA 1970 4173.17 3034.09 258.85 880.23 9202.46 5864 175.4 3.3
7 CALIFORNIA 1970 128545.36 42961.31 17837.26 67746.79 172791.92 263933 6946.2 7.2
8 MINNESOTA 1970 21948.43 10284.33 3682.74 7981.36 43287.56 41734 1315.3 4.2
9 FLORIDA 1970 29696.86 12622.30 4174.89 12899.67 57178.05 69641 2152.1 4.4
10 ILLINOIS 1970 52197.49 23582.87 6361.94 22252.68 114860.83 145792 4345.6 4.1

Above is just a snippet of our dataframe, df. It organizes the data into column vectors corresponding to the variable in the dataset. We can also select different subsets of the matrix, for example:

   df['P_CAP']

returns just the variable P_CAP, while

   df[df['YR']==1970]

returns only rows corresponding to the year 1970.

Simple OLS over all observations in our dataframe:

Here's the basic idea behind linear regression code in statsmodels:

model = smf.ols(formula = 'Y ~ Var1 + Var2 + Var3', data = dataframe).fit()

Key things to notice about the above is that

  1. The 'data = ' command tells statsmodels which Pandas dataframe to pull the variables Y, Var1, Var2, Var3 (etc.) from.

  2. The 'formula = ' command specifies the linear regression model you are trying to run. Y in this case is the dependent variable and the dependent variables are Var1, Var2, and Var3. Notice that the dependent and independent variables are seperated by a '~' and not an equal sign.

The variables 'Y', 'Var1', 'Var2', and 'Var3' correspond to the column names of the variables in our dataframe

So, in our case, we will have

formula = 'GSP ~ UNEMP + P_CAP + PC'

and

data = df

In [2]:
model = smf.ols(formula = 'GSP ~ UNEMP + P_CAP + PC', data = df).fit()     #Fits the model

print(model.summary())  #Prints off a nice summary


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    GSP   R-squared:                       0.979
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                 1.273e+04
Date:                Sun, 21 Feb 2016   Prob (F-statistic):               0.00
Time:                        16:31:29   Log-Likelihood:                -8680.8
No. Observations:                 816   AIC:                         1.737e+04
Df Residuals:                     812   BIC:                         1.739e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept   1294.7758   1117.749      1.158      0.247      -899.242  3488.794
UNEMP      -1079.1009    161.514     -6.681      0.000     -1396.135  -762.066
P_CAP          1.6092      0.025     63.226      0.000         1.559     1.659
PC             0.4564      0.012     38.637      0.000         0.433     0.480
==============================================================================
Omnibus:                      382.883   Durbin-Watson:                   1.901
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6543.639
Skew:                           1.686   Prob(JB):                         0.00
Kurtosis:                      16.457   Cond. No.                     2.88e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Simple OLS over a subset of the rows in our dataframe:

We can run the regression only over the observations corresponding to the year 1970 (for example) by specifying

data = df[df['YR']==1970]

in the code above. 'df[df['YR']==1970]' simply takes our original dataframe 'df' and returns only the rows that correspond to the year 1970.

We could also do this by first specifying a whole new dataframe, i.e.

df1970 = df[df['YR']==1970]

and then calling this new dataframe in the regression code, i.e.

data = df1970

In [3]:
model1970 = smf.ols(formula='GSP ~ UNEMP + P_CAP + PC', data=df[df['YR']==1970]).fit()     #Fits the model

"""
Notice that we've isolated a subset of the original dataframe (in this case the obs. corresponding to 1970)
by specificying 'data = df[df['YR']==1970]]' in the regression forumula above. 
"""

print(model1970.summary())  #Prints off a nice summary


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    GSP   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                     970.3
Date:                Sun, 21 Feb 2016   Prob (F-statistic):           3.41e-40
Time:                        16:31:31   Log-Likelihood:                -492.49
No. Observations:                  48   AIC:                             993.0
Df Residuals:                      44   BIC:                             1000.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept   7735.7899   4730.311      1.635      0.109     -1797.525  1.73e+04
UNEMP      -2263.7360    942.129     -2.403      0.021     -4162.473  -364.999
P_CAP          1.7329      0.085     20.470      0.000         1.562     1.904
PC             0.3664      0.047      7.859      0.000         0.272     0.460
==============================================================================
Omnibus:                       13.791   Durbin-Watson:                   1.678
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               18.155
Skew:                           0.947   Prob(JB):                     0.000114
Kurtosis:                       5.343   Cond. No.                     3.18e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.18e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Calling only select elements of the model (coefficents, standard errors, etc.):

If you type in,

model.

and then hit the 'tab' key, a box will pop up showing all the different items you can call from your regression.

(Remember that 'model' is what we saved our regression as, so it will different depending on what name you give it. i.e. if we saved our regression as 'regression_results' then we would type 'regression_results'.)


In [4]:
"""
Below are a few examples of calling speficic elements of our regression
"""
model.params          #Produces all coefficient estimates
model.params['P_CAP']  #Produces coefficient estimate for the regressor 'P_CAP'

model.bse               #Standard Errors for all regressors 
model.bse['P_CAP']      #Standard Errors for regressor 'P_CAP'

model.pvalues           #P-values for all regressors
model.pvalues['P_CAP']  #P-values for regressor 'P_CAP'

r_sqr = model.rsquared          #R-squared

print('The R^2 for our regression is',r_sqr)


The R^2 for our regression is 0.979182835329

Iterating over multiple regression models and storing the results:

Loops in Python are pretty standard if you're familiar with running loop arguments in other programming languages.

Below are two examples of different ways you can loop over multiple regression models.


In [5]:
# Method 1:  [action(x) for x in range(a,b)]
fitted_models = [smf.ols(formula='GSP ~ UNEMP + P_CAP + PC',
                         data=df[df['YR']==x]).fit() for x in range(1970,1987)]


# Method 2:  for item in range(a,b):
fitted_models = list()
for item in range(1970,1987):
    fitted_models.append(smf.ols(formula='GSP ~ UNEMP + P_CAP + PC',
                         data=df[df['YR']==item]).fit())

#We can then call/store the coefficients/standard errors/etc in the same way:
P_CAPcoefs = [fitted_models[x].params['P_CAP'] for x in range(17)]
P_CAPpval = [fitted_models[x].pvalues['P_CAP'] for x in range(17)]
P_CAPse = [fitted_models[x].bse['P_CAP'] for x in range(17)]

df_PCAPResults = pd.DataFrame({'P_Cap':P_CAPcoefs,'Std. Errors':P_CAPse}, index = list(range(1970,1987)))

df_PCAPResults


Out[5]:
P_Cap Std. Errors
1970 1.732911 0.084657
1971 1.660908 0.090573
1972 1.644309 0.086715
1973 1.623895 0.086172
1974 1.498429 0.078137
1975 1.408807 0.077810
1976 1.458184 0.073763
1977 1.497926 0.083528
1978 1.548694 0.098303
1979 1.583611 0.108554
1980 1.582065 0.117925
1981 1.560765 0.097244
1982 1.571359 0.098870
1983 1.711065 0.113624
1984 1.861790 0.130734
1985 1.914637 0.148073
1986 1.872685 0.153722