Linear Regression in Python

There are two main packages you can use to run basic linear regression models in Python: statsmodels and scikit-learn. I'll focus here on statsmodels.

We'll also use the data management package Pandas. It allows the user to organize the data into matrices (called "dataframes") that mesh well with the statsmodels package.

There are two different versions of statsmodels you can import:

a. statsmodels.formula.api
b. statsmodels.api

I'll focus on statsmodels.formulas.api, which was designed to be similar in style to the regression programs in R. It also works nicely with Pandas.

Design

I'll use some random panel dataset I found on Professor Greene's website. It has both cross-sectional (U.S. states) and time-series (years 1970-1986) dimensions. Here are the details he provides:

Statewide Capital Productivity Data, lower 48 states, 17 years
Variables in the file are
STATE = state name
ST_ABB = state abbreviation
YR = year, 1970,...,1986
P_CAP = public capital
HWY = highway capital
WATER = water utility capital
UTIL = utility capital
PC = private capital
GSP = gross state product
EMP = employment
UNEMP = unemployment rate

I'll haphazardly run the following regression:

$$\text{GSP}_{i,t} = \alpha + \beta_1 \text{UNEMP}_{i,t} + \beta_2 \text{P_CAP}_{i,t} +\beta_3 \text{PC}_{i,t} + \epsilon_{i,t}, \quad i = 1, ..., 48 \quad t = 1970, ..., 1986 $$

For more info on the basics of Python, definitely take a look at Professor Backus's Data Bootcamp book and course



In [1]:

    
import pandas as pd                       #This is Pandas, we'll call it 'pd' for short
import statsmodels.formula.api as smf    #This is the linear regression program

"""
The following reads in the .csv file and saves it as a dataframe we call 'df'. You can read in other files
besides .csv, too. For example, .xls can be read in using pd.read_excel and Stata files as pd.read_stata, etc.
"""

df = pd.read_csv('http://people.stern.nyu.edu/wgreene/Econometrics/productivity.csv') 
                                                                
df = df.sort_values('YR')          #Ignore this, I'm just re-sorting the dataframe by year
df.index = range(1,len(df) + 1)    

# This prints the first 10 rows of the matrix
df.head(10)









    Out[1]:






  
    
      
      STATE
      YR
      P_CAP
      HWY
      WATER
      UTIL
      PC
      GSP
      EMP
      UNEMP
    
  
  
    
      1
      ALABAMA
      1970
      15032.67
      7325.80
      1655.68
      6051.20
      35793.80
      28418
      1010.5
      4.7
    
    
      2
      TEXAS
      1970
      53639.88
      25941.91
      8352.71
      19345.25
      213676.41
      160922
      3624.9
      4.4
    
    
      3
      INDIANA
      1970
      20813.65
      8904.96
      2814.86
      9093.83
      63646.79
      56769
      1849.0
      5.0
    
    
      4
      TENNESSE
      1970
      20949.51
      9280.24
      2787.57
      8881.70
      38417.96
      34769
      1327.6
      4.8
    
    
      5
      MICHIGAN
      1970
      44684.82
      17924.92
      8380.48
      18379.42
      84755.95
      102172
      2999.0
      6.7
    
    
      6
      SOUTH_DAKOTA
      1970
      4173.17
      3034.09
      258.85
      880.23
      9202.46
      5864
      175.4
      3.3
    
    
      7
      CALIFORNIA
      1970
      128545.36
      42961.31
      17837.26
      67746.79
      172791.92
      263933
      6946.2
      7.2
    
    
      8
      MINNESOTA
      1970
      21948.43
      10284.33
      3682.74
      7981.36
      43287.56
      41734
      1315.3
      4.2
    
    
      9
      FLORIDA
      1970
      29696.86
      12622.30
      4174.89
      12899.67
      57178.05
      69641
      2152.1
      4.4
    
    
      10
      ILLINOIS
      1970
      52197.49
      23582.87
      6361.94
      22252.68
      114860.83
      145792
      4345.6
      4.1

Above is just a snippet of our dataframe, df. It organizes the data into column vectors corresponding to the variable in the dataset. We can also select different subsets of the matrix, for example:

   df['P_CAP']

returns just the variable P_CAP, while

   df[df['YR']==1970]

returns only rows corresponding to the year 1970.

Simple OLS over all observations in our dataframe:

Here's the basic idea behind linear regression code in statsmodels:

model = smf.ols(formula = 'Y ~ Var1 + Var2 + Var3', data = dataframe).fit()

Key things to notice about the above is that

The 'data = ' command tells statsmodels which Pandas dataframe to pull the variables Y, Var1, Var2, Var3 (etc.) from.
The 'formula = ' command specifies the linear regression model you are trying to run. Y in this case is the dependent variable and the dependent variables are Var1, Var2, and Var3. Notice that the dependent and independent variables are seperated by a '~' and not an equal sign.

The variables 'Y', 'Var1', 'Var2', and 'Var3' correspond to the column names of the variables in our dataframe

So, in our case, we will have

formula = 'GSP ~ UNEMP + P_CAP + PC'

and

data = df



In [2]:

    
model = smf.ols(formula = 'GSP ~ UNEMP + P_CAP + PC', data = df).fit()     #Fits the model

print(model.summary())  #Prints off a nice summary









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    GSP   R-squared:                       0.979
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                 1.273e+04
Date:                Sun, 21 Feb 2016   Prob (F-statistic):               0.00
Time:                        16:31:29   Log-Likelihood:                -8680.8
No. Observations:                 816   AIC:                         1.737e+04
Df Residuals:                     812   BIC:                         1.739e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept   1294.7758   1117.749      1.158      0.247      -899.242  3488.794
UNEMP      -1079.1009    161.514     -6.681      0.000     -1396.135  -762.066
P_CAP          1.6092      0.025     63.226      0.000         1.559     1.659
PC             0.4564      0.012     38.637      0.000         0.433     0.480
==============================================================================
Omnibus:                      382.883   Durbin-Watson:                   1.901
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6543.639
Skew:                           1.686   Prob(JB):                         0.00
Kurtosis:                      16.457   Cond. No.                     2.88e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Simple OLS over a subset of the rows in our dataframe:

We can run the regression only over the observations corresponding to the year 1970 (for example) by specifying

data = df[df['YR']==1970]

in the code above. 'df[df['YR']==1970]' simply takes our original dataframe 'df' and returns only the rows that correspond to the year 1970.

We could also do this by first specifying a whole new dataframe, i.e.

df1970 = df[df['YR']==1970]

and then calling this new dataframe in the regression code, i.e.

data = df1970



In [3]:

    
model1970 = smf.ols(formula='GSP ~ UNEMP + P_CAP + PC', data=df[df['YR']==1970]).fit()     #Fits the model

"""
Notice that we've isolated a subset of the original dataframe (in this case the obs. corresponding to 1970)
by specificying 'data = df[df['YR']==1970]]' in the regression forumula above. 
"""

print(model1970.summary())  #Prints off a nice summary









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    GSP   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                     970.3
Date:                Sun, 21 Feb 2016   Prob (F-statistic):           3.41e-40
Time:                        16:31:31   Log-Likelihood:                -492.49
No. Observations:                  48   AIC:                             993.0
Df Residuals:                      44   BIC:                             1000.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept   7735.7899   4730.311      1.635      0.109     -1797.525  1.73e+04
UNEMP      -2263.7360    942.129     -2.403      0.021     -4162.473  -364.999
P_CAP          1.7329      0.085     20.470      0.000         1.562     1.904
PC             0.3664      0.047      7.859      0.000         0.272     0.460
==============================================================================
Omnibus:                       13.791   Durbin-Watson:                   1.678
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               18.155
Skew:                           0.947   Prob(JB):                     0.000114
Kurtosis:                       5.343   Cond. No.                     3.18e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.18e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Calling only select elements of the model (coefficents, standard errors, etc.):

If you type in,

model.

and then hit the 'tab' key, a box will pop up showing all the different items you can call from your regression.

(Remember that 'model' is what we saved our regression as, so it will different depending on what name you give it. i.e. if we saved our regression as 'regression_results' then we would type 'regression_results'.)



In [4]:

    
"""
Below are a few examples of calling speficic elements of our regression
"""
model.params          #Produces all coefficient estimates
model.params['P_CAP']  #Produces coefficient estimate for the regressor 'P_CAP'

model.bse               #Standard Errors for all regressors 
model.bse['P_CAP']      #Standard Errors for regressor 'P_CAP'

model.pvalues           #P-values for all regressors
model.pvalues['P_CAP']  #P-values for regressor 'P_CAP'

r_sqr = model.rsquared          #R-squared

print('The R^2 for our regression is',r_sqr)









    



The R^2 for our regression is 0.979182835329

Iterating over multiple regression models and storing the results:

Loops in Python are pretty standard if you're familiar with running loop arguments in other programming languages.

Below are two examples of different ways you can loop over multiple regression models.



In [5]:

    
# Method 1:  [action(x) for x in range(a,b)]
fitted_models = [smf.ols(formula='GSP ~ UNEMP + P_CAP + PC',
                         data=df[df['YR']==x]).fit() for x in range(1970,1987)]


# Method 2:  for item in range(a,b):
fitted_models = list()
for item in range(1970,1987):
    fitted_models.append(smf.ols(formula='GSP ~ UNEMP + P_CAP + PC',
                         data=df[df['YR']==item]).fit())

#We can then call/store the coefficients/standard errors/etc in the same way:
P_CAPcoefs = [fitted_models[x].params['P_CAP'] for x in range(17)]
P_CAPpval = [fitted_models[x].pvalues['P_CAP'] for x in range(17)]
P_CAPse = [fitted_models[x].bse['P_CAP'] for x in range(17)]

df_PCAPResults = pd.DataFrame({'P_Cap':P_CAPcoefs,'Std. Errors':P_CAPse}, index = list(range(1970,1987)))

df_PCAPResults









    Out[5]:






  
    
      
      P_Cap
      Std. Errors
    
  
  
    
      1970
      1.732911
      0.084657
    
    
      1971
      1.660908
      0.090573
    
    
      1972
      1.644309
      0.086715
    
    
      1973
      1.623895
      0.086172
    
    
      1974
      1.498429
      0.078137
    
    
      1975
      1.408807
      0.077810
    
    
      1976
      1.458184
      0.073763
    
    
      1977
      1.497926
      0.083528
    
    
      1978
      1.548694
      0.098303
    
    
      1979
      1.583611
      0.108554
    
    
      1980
      1.582065
      0.117925
    
    
      1981
      1.560765
      0.097244
    
    
      1982
      1.571359
      0.098870
    
    
      1983
      1.711065
      0.113624
    
    
      1984
      1.861790
      0.130734
    
    
      1985
      1.914637
      0.148073
    
    
      1986
      1.872685
      0.153722

	STATE	YR	P_CAP	HWY	WATER	UTIL	PC	GSP	EMP	UNEMP
1	ALABAMA	1970	15032.67	7325.80	1655.68	6051.20	35793.80	28418	1010.5	4.7
2	TEXAS	1970	53639.88	25941.91	8352.71	19345.25	213676.41	160922	3624.9	4.4
3	INDIANA	1970	20813.65	8904.96	2814.86	9093.83	63646.79	56769	1849.0	5.0
4	TENNESSE	1970	20949.51	9280.24	2787.57	8881.70	38417.96	34769	1327.6	4.8
5	MICHIGAN	1970	44684.82	17924.92	8380.48	18379.42	84755.95	102172	2999.0	6.7
6	SOUTH_DAKOTA	1970	4173.17	3034.09	258.85	880.23	9202.46	5864	175.4	3.3
7	CALIFORNIA	1970	128545.36	42961.31	17837.26	67746.79	172791.92	263933	6946.2	7.2
8	MINNESOTA	1970	21948.43	10284.33	3682.74	7981.36	43287.56	41734	1315.3	4.2
9	FLORIDA	1970	29696.86	12622.30	4174.89	12899.67	57178.05	69641	2152.1	4.4
10	ILLINOIS	1970	52197.49	23582.87	6361.94	22252.68	114860.83	145792	4345.6	4.1

	P_Cap	Std. Errors
1970	1.732911	0.084657
1971	1.660908	0.090573
1972	1.644309	0.086715
1973	1.623895	0.086172
1974	1.498429	0.078137
1975	1.408807	0.077810
1976	1.458184	0.073763
1977	1.497926	0.083528
1978	1.548694	0.098303
1979	1.583611	0.108554
1980	1.582065	0.117925
1981	1.560765	0.097244
1982	1.571359	0.098870
1983	1.711065	0.113624
1984	1.861790	0.130734
1985	1.914637	0.148073
1986	1.872685	0.153722