Regression Modeling in Practice

Assignment: Test a Multiple Regression Model

Following is the Python program I wrote to fulfill the third assignment of the Regression Modeling in Practice online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Assignment research question

Using the Gapminder database, I would like to see if there is a relationship between the income per person (explanatory variable) and the residential consumption of electricity (response variable).

The following variables will be tested also to improve the prediction model and figure out potential confounders:

Employment rate (total employees age of 15+)
Oil consumption per person (tonnes per year per person)
Urban rate (Urban population in %)

Data management

For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replaced directly by NaN no special data treatment is needed.



In [2]:

    
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display



In [3]:

    
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')

General information on the Gapminder data



In [4]:

    
display(Markdown("Number of countries: {}".format(len(data))))









    




Number of countries: 213

Data managment

In order to carry out the regression analysis, we need to center the potential explanatory variables.



In [5]:

    
subdata = (data[['incomeperperson', 'relectricperperson', 'employrate', 'oilperperson', 'urbanrate']]
               .assign(income=lambda x: pd.to_numeric(x['incomeperperson'], errors='coerce'),
                       electricity=lambda x: pd.to_numeric(x['relectricperperson'], errors='coerce'),
                       employ=lambda x: pd.to_numeric(x['employrate'], errors='coerce'),
                       oil=lambda x: pd.to_numeric(x['oilperperson'], errors='coerce'),
                       urban=lambda x: pd.to_numeric(x['urbanrate'], errors='coerce'))
               .dropna())

display(Markdown("Number of countries after discarding countries with missing data: {}".format(len(subdata))))

centered_data = (subdata.assign(income_c=lambda x: x['income'] - subdata['income'].mean(),
                                employ_c=lambda x: x['employ'] - subdata['employ'].mean(),
                                oil_c=lambda x: x['oil'] - subdata['oil'].mean(),
                                urban_c=lambda x: x['urban'] - subdata['urban'].mean()))









    




Number of countries after discarding countries with missing data: 61

Let's check the variables are well centered



In [6]:

    
display(Markdown("Mean of income per person after centering: {:3g}".format(centered_data['income_c'].mean())))
display(Markdown("Mean of employment rate after centering: {:3g}".format(centered_data['employ_c'].mean())))
display(Markdown("Mean of oil consumption per person after centering: {:3g}".format(centered_data['oil_c'].mean())))
display(Markdown("Mean of urban rate after centering: {:3g}".format(centered_data['urban_c'].mean())))









    




Mean of income per person after centering: -1.77426e-12







    




Mean of employment rate after centering: -2.56261e-15







    




Mean of oil consumption per person after centering: 5.49651e-16







    




Mean of urban rate after centering: 1.9569e-14

Bivariate distribution

Before looking at the multiple regression analysis, an polynomial regression will be applied on the data to see if it fits better the results.



In [7]:

    
sns.regplot(x='income_c', y='electricity', data=centered_data)
sns.regplot(x='income_c', y='electricity', order=2, data=centered_data)
plt.xlabel('Centered income per person (2000 US$)')
plt.ylabel('Residential electricity consumption (kWh)')
plt.title('Scatterplot for the association between the income and the residential electricity consumption');

OLS regression model

Test the linear regression model.



In [8]:

    
reg1 = smf.ols('electricity ~ income_c', data=centered_data).fit()
reg1.summary()









    Out[8]:





OLS Regression Results

  Dep. Variable:        electricity      R-squared:             0.402


  Model:                    OLS          Adj. R-squared:        0.392


  Method:              Least Squares     F-statistic:           39.69


  Date:              Sun, 17 Jul 2016    Prob (F-statistic):  4.08e-08


  Time:                  19:02:59        Log-Likelihood:      -530.88


  No. Observations:           61         AIC:                   1066.


  Df Residuals:               59         BIC:                   1070.


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   1626.2603    189.670      8.574   0.000   1246.731  2005.790


  income_c       0.0950      0.015      6.300   0.000      0.065     0.125




  Omnibus:        80.246    Durbin-Watson:         2.135 


  Prob(Omnibus):   0.000    Jarque-Bera (JB):   1091.497 


  Skew:            3.660    Prob(JB):           9.65e-238


  Kurtosis:       22.387    Cond. No.           1.26e+04

Test a second order polynomial regression model.



In [9]:

    
reg2 = smf.ols('electricity ~ income_c + I(income_c**2)', data=centered_data).fit()
reg2.summary()









    Out[9]:





OLS Regression Results

  Dep. Variable:        electricity      R-squared:             0.413


  Model:                    OLS          Adj. R-squared:        0.393


  Method:              Least Squares     F-statistic:           20.43


  Date:              Sun, 17 Jul 2016    Prob (F-statistic):  1.92e-07


  Time:                  19:02:59        Log-Likelihood:      -530.31


  No. Observations:           61         AIC:                   1067.


  Df Residuals:               58         BIC:                   1073.


  Df Model:                    2                                     


  Covariance Type:       nonrobust                                   




                      coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept          1904.3071    325.761      5.846   0.000   1252.225  2556.389


  income_c              0.1117      0.022      5.095   0.000      0.068     0.156


  I(income_c ** 2)  -1.758e-06   1.68e-06     -1.049   0.298  -5.11e-06   1.6e-06




  Omnibus:        76.908    Durbin-Watson:         2.098 


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    912.307 


  Skew:            3.504    Prob(JB):           7.85e-199


  Kurtosis:       20.602    Cond. No.           3.92e+08

From the second OLS analysis, we can see that the coefficient corresponding to the second order term has a p-value of 0.3 > 0.05. Therefore we should keep only the linear term for our primary relation.

Multiple regression

The multiple regression will be now carried out.



In [10]:

    
reg3 = smf.ols('electricity ~ income_c + oil_c + employ_c + urban_c', data=centered_data).fit()
reg3.summary()









    Out[10]:





OLS Regression Results

  Dep. Variable:        electricity      R-squared:             0.452


  Model:                    OLS          Adj. R-squared:        0.412


  Method:              Least Squares     F-statistic:           11.53


  Date:              Sun, 17 Jul 2016    Prob (F-statistic):  6.76e-07


  Time:                  19:03:00        Log-Likelihood:      -528.25


  No. Observations:           61         AIC:                   1067.


  Df Residuals:               56         BIC:                   1077.


  Df Model:                    4                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   1626.2603    186.470      8.721   0.000   1252.716  1999.804


  income_c       0.0758      0.020      3.883   0.000      0.037     0.115


  oil_c        123.7307    135.564      0.913   0.365   -147.837   395.298


  employ_c      48.6765     25.901      1.879   0.065     -3.209   100.562


  urban_c        2.3571     13.864      0.170   0.866    -25.415    30.129




  Omnibus:        62.530    Durbin-Watson:         2.146 


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    537.653 


  Skew:            2.704    Prob(JB):           1.78e-117


  Kurtosis:       16.502    Cond. No.           1.26e+04

Unexpectedly all explanatory variables except the primary one (income per person) should be discarded as their coefficients have p-values higher than 0.05.

Finally we will look at the diagnostic graphs

Q-Q plot for normality



In [11]:

    
fig = sm.qqplot(reg1.resid, line='r')

The residuals do not follow correctly the line. Especially on the second half of the data. As our model is a single linear regression between residential electricity consumption and income per person, this means that the model is a poor predicator for country having larger income.

Residuals plot



In [12]:

    
stdres = pd.DataFrame(reg1.resid_pearson)
plt.plot(stdres, 'o')
plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number');

The previous plot highlights only one clear extreme outlier. So this confirm that the model is fine and could be improve.

Partial regression plots



In [13]:

    
fig = plt.figure(figsize=(12,8))
sm.graphics.plot_regress_exog(reg3, 'urban_c', fig);



In [19]:

    
fig = plt.figure(figsize=(12,8))
sm.graphics.plot_regress_exog(reg3, 'oil_c', fig);



In [20]:

    
fig = plt.figure(figsize=(12,8))
sm.graphics.plot_regress_exog(reg3, 'employ_c', fig);

The partial regression plots above are shown for the sake of the assignement as all variables but the income per person are non-significant in the multiple regression model. They all show that some extreme outliers will be present.

And the partial plot against the urban rate has a horizontal partial regression line. This confirms that urban rate cannot improve the model.

Leverage plot



In [14]:

    
sm.graphics.influence_plot(reg1);

The leverage plot above shows that our extreme outlier United Arab Emirates does not have a strong influence on the estimation of the model coefficient. On the contrary Norway, the second border highest residual, has a important influence on the model estimation.

Analysis of trouble case

To conclude this assignment, I would to take the same question but taking the oil consumption as primary explanatory variables.

Let's first see if a second order fits better than the linear regression model.



In [15]:

    
sns.regplot(x='oil_c', y='electricity', data=centered_data);
sns.regplot(x='oil_c', y='electricity', order=2, data=centered_data)
plt.xlabel('Centered oil consumption per person (tonnes)')
plt.ylabel('Residential electricity consumption (kWh)')
plt.title('Scatterplot for the association between the oil and the residential electricity consumption');



In [16]:

    
reg_oil1 = smf.ols('electricity ~ oil_c', data=centered_data).fit()
reg_oil1.summary()









    Out[16]:





OLS Regression Results

  Dep. Variable:        electricity      R-squared:             0.199


  Model:                    OLS          Adj. R-squared:        0.185


  Method:              Least Squares     F-statistic:           14.64


  Date:              Sun, 17 Jul 2016    Prob (F-statistic):  0.000317


  Time:                  19:03:39        Log-Likelihood:      -539.82


  No. Observations:           61         AIC:                   1084.


  Df Residuals:               59         BIC:                   1088.


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   1626.2603    219.583      7.406   0.000   1186.877  2065.644


  oil_c        487.8025    127.505      3.826   0.000    232.666   742.939




  Omnibus:        47.149    Durbin-Watson:         1.867


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    273.869


  Skew:            1.975    Prob(JB):           3.39e-60


  Kurtosis:       12.599    Cond. No.               1.72



In [17]:

    
reg_oil2 = smf.ols('electricity ~ oil_c + I(oil_c**2)', data=centered_data).fit()
reg_oil2.summary()









    Out[17]:





OLS Regression Results

  Dep. Variable:        electricity      R-squared:             0.595


  Model:                    OLS          Adj. R-squared:        0.581


  Method:              Least Squares     F-statistic:           42.58


  Date:              Sun, 17 Jul 2016    Prob (F-statistic):  4.18e-12


  Time:                  19:03:39        Log-Likelihood:      -519.02


  No. Observations:           61         AIC:                   1044.


  Df Residuals:               58         BIC:                   1050.


  Df Model:                    2                                     


  Covariance Type:       nonrobust                                   




                   coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept       2079.9567    168.620     12.335   0.000   1742.428  2417.486


  oil_c           1617.3225    175.683      9.206   0.000   1265.655  1968.990


  I(oil_c ** 2)   -152.9758     20.316     -7.530   0.000   -193.643  -112.309




  Omnibus:        46.176    Durbin-Watson:         1.686


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    230.695


  Skew:            2.004    Prob(JB):           8.04e-51


  Kurtosis:       11.643    Cond. No.               19.1

From the OLS analysis, the second order regression fits better the results. But the outlier far on the right seems to deteriorate the accuracy of the linear regression coefficient.

This is confirm by the leverage plot below. Singapore has a strong influence on the regression model. This country is a singularity having a huge oil consumption but a reasonable residential electricity consumption.



In [18]:

    
sm.graphics.influence_plot(reg_oil1);

Anyway the multiple regression shows that oil consumption is not a significant explanatory variable of the residential electricity consumption. Indeed in this case income per person is a cofounder; that is the real explanatory variable.

Conclusion

In this assignment we have seen that only our primary explanatory variable (income per person) is a good to build a regression model on. However the R-Squared value being 0.4, there is still 60% of the electricity consumption variations not explain with the model; in particular the model performs poorly for country with higher income.

In the latter part, we use the tools described in this week to the trouble than can be raised by outlier data.

Dep. Variable:	electricity	R-squared:	0.402
Model:	OLS	Adj. R-squared:	0.392
Method:	Least Squares	F-statistic:	39.69
Date:	Sun, 17 Jul 2016	Prob (F-statistic):	4.08e-08
Time:	19:02:59	Log-Likelihood:	-530.88
No. Observations:	61	AIC:	1066.
Df Residuals:	59	BIC:	1070.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1626.2603	189.670	8.574	0.000	1246.731 2005.790
income_c	0.0950	0.015	6.300	0.000	0.065 0.125

Omnibus:	80.246	Durbin-Watson:	2.135
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1091.497
Skew:	3.660	Prob(JB):	9.65e-238
Kurtosis:	22.387	Cond. No.	1.26e+04

Omnibus:	76.908	Durbin-Watson:	2.098
Prob(Omnibus):	0.000	Jarque-Bera (JB):	912.307
Skew:	3.504	Prob(JB):	7.85e-199
Kurtosis:	20.602	Cond. No.	3.92e+08

Omnibus:	62.530	Durbin-Watson:	2.146
Prob(Omnibus):	0.000	Jarque-Bera (JB):	537.653
Skew:	2.704	Prob(JB):	1.78e-117
Kurtosis:	16.502	Cond. No.	1.26e+04

Omnibus:	47.149	Durbin-Watson:	1.867
Prob(Omnibus):	0.000	Jarque-Bera (JB):	273.869
Skew:	1.975	Prob(JB):	3.39e-60
Kurtosis:	12.599	Cond. No.	1.72

Omnibus:	46.176	Durbin-Watson:	1.686
Prob(Omnibus):	0.000	Jarque-Bera (JB):	230.695
Skew:	2.004	Prob(JB):	8.04e-51
Kurtosis:	11.643	Cond. No.	19.1