There are two main packages you can use to run basic linear regression models in Python: statsmodels and scikit-learn. I'll focus here on statsmodels.
We'll also use the data management package Pandas. It allows the user to organize the data into matrices (called "dataframes") that mesh well with the statsmodels package.
There are two different versions of statsmodels you can import:
a. statsmodels.formula.api
b. statsmodels.api
I'll focus on statsmodels.formulas.api, which was designed to be similar in style to the regression programs in R. It also works nicely with Pandas.
I'll use some random panel dataset I found on Professor Greene's website. It has both cross-sectional (U.S. states) and time-series (years 1970-1986) dimensions. Here are the details he provides:
Statewide Capital Productivity Data, lower 48 states, 17 years
Variables in the file are
STATE = state name
ST_ABB = state abbreviation
YR = year, 1970,...,1986
P_CAP = public capital
HWY = highway capital
WATER = water utility capital
UTIL = utility capital
PC = private capital
GSP = gross state product
EMP = employment
UNEMP = unemployment rate
I'll haphazardly run the following regression:
$$\text{GSP}_{i,t} = \alpha + \beta_1 \text{UNEMP}_{i,t} + \beta_2 \text{P_CAP}_{i,t} +\beta_3 \text{PC}_{i,t} + \epsilon_{i,t}, \quad i = 1, ..., 48 \quad t = 1970, ..., 1986 $$For more info on the basics of Python, definitely take a look at Professor Backus's Data Bootcamp book and course
In [1]:
import pandas as pd #This is Pandas, we'll call it 'pd' for short
import statsmodels.formula.api as smf #This is the linear regression program
"""
The following reads in the .csv file and saves it as a dataframe we call 'df'. You can read in other files
besides .csv, too. For example, .xls can be read in using pd.read_excel and Stata files as pd.read_stata, etc.
"""
df = pd.read_csv('http://people.stern.nyu.edu/wgreene/Econometrics/productivity.csv')
df = df.sort_values('YR') #Ignore this, I'm just re-sorting the dataframe by year
df.index = range(1,len(df) + 1)
# This prints the first 10 rows of the matrix
df.head(10)
Out[1]:
Above is just a snippet of our dataframe, df. It organizes the data into column vectors corresponding to the variable in the dataset. We can also select different subsets of the matrix, for example:
df['P_CAP']
returns just the variable P_CAP, while
df[df['YR']==1970]
returns only rows corresponding to the year 1970.
Here's the basic idea behind linear regression code in statsmodels:
model = smf.ols(formula = 'Y ~ Var1 + Var2 + Var3', data = dataframe).fit()
Key things to notice about the above is that
The 'data = ' command tells statsmodels which Pandas dataframe to pull the variables Y, Var1, Var2, Var3 (etc.) from.
The 'formula = ' command specifies the linear regression model you are trying to run. Y in this case is the dependent variable and the dependent variables are Var1, Var2, and Var3. Notice that the dependent and independent variables are seperated by a '~' and not an equal sign.
The variables 'Y', 'Var1', 'Var2', and 'Var3' correspond to the column names of the variables in our dataframe
So, in our case, we will have
formula = 'GSP ~ UNEMP + P_CAP + PC'
and
data = df
In [2]:
model = smf.ols(formula = 'GSP ~ UNEMP + P_CAP + PC', data = df).fit() #Fits the model
print(model.summary()) #Prints off a nice summary
We can run the regression only over the observations corresponding to the year 1970 (for example) by specifying
data = df[df['YR']==1970]
in the code above. 'df[df['YR']==1970]' simply takes our original dataframe 'df' and returns only the rows that correspond to the year 1970.
We could also do this by first specifying a whole new dataframe, i.e.
df1970 = df[df['YR']==1970]
and then calling this new dataframe in the regression code, i.e.
data = df1970
In [3]:
model1970 = smf.ols(formula='GSP ~ UNEMP + P_CAP + PC', data=df[df['YR']==1970]).fit() #Fits the model
"""
Notice that we've isolated a subset of the original dataframe (in this case the obs. corresponding to 1970)
by specificying 'data = df[df['YR']==1970]]' in the regression forumula above.
"""
print(model1970.summary()) #Prints off a nice summary
If you type in,
model.
and then hit the 'tab' key, a box will pop up showing all the different items you can call from your regression.
(Remember that 'model' is what we saved our regression as, so it will different depending on what name you give it. i.e. if we saved our regression as 'regression_results' then we would type 'regression_results'.)
In [4]:
"""
Below are a few examples of calling speficic elements of our regression
"""
model.params #Produces all coefficient estimates
model.params['P_CAP'] #Produces coefficient estimate for the regressor 'P_CAP'
model.bse #Standard Errors for all regressors
model.bse['P_CAP'] #Standard Errors for regressor 'P_CAP'
model.pvalues #P-values for all regressors
model.pvalues['P_CAP'] #P-values for regressor 'P_CAP'
r_sqr = model.rsquared #R-squared
print('The R^2 for our regression is',r_sqr)
In [5]:
# Method 1: [action(x) for x in range(a,b)]
fitted_models = [smf.ols(formula='GSP ~ UNEMP + P_CAP + PC',
data=df[df['YR']==x]).fit() for x in range(1970,1987)]
# Method 2: for item in range(a,b):
fitted_models = list()
for item in range(1970,1987):
fitted_models.append(smf.ols(formula='GSP ~ UNEMP + P_CAP + PC',
data=df[df['YR']==item]).fit())
#We can then call/store the coefficients/standard errors/etc in the same way:
P_CAPcoefs = [fitted_models[x].params['P_CAP'] for x in range(17)]
P_CAPpval = [fitted_models[x].pvalues['P_CAP'] for x in range(17)]
P_CAPse = [fitted_models[x].bse['P_CAP'] for x in range(17)]
df_PCAPResults = pd.DataFrame({'P_Cap':P_CAPcoefs,'Std. Errors':P_CAPse}, index = list(range(1970,1987)))
df_PCAPResults
Out[5]: