Following is the Python program I wrote to fulfill the third assignment of the Regression Modeling in Practice online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
Using the Gapminder database, I would like to see if there is a relationship between the income per person (explanatory variable) and the residential consumption of electricity (response variable).
The following variables will be tested also to improve the prediction model and figure out potential confounders:
For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replaced directly by NaN
no special data treatment is needed.
In [2]:
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display
In [3]:
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')
General information on the Gapminder data
In [4]:
display(Markdown("Number of countries: {}".format(len(data))))
In [5]:
subdata = (data[['incomeperperson', 'relectricperperson', 'employrate', 'oilperperson', 'urbanrate']]
.assign(income=lambda x: pd.to_numeric(x['incomeperperson'], errors='coerce'),
electricity=lambda x: pd.to_numeric(x['relectricperperson'], errors='coerce'),
employ=lambda x: pd.to_numeric(x['employrate'], errors='coerce'),
oil=lambda x: pd.to_numeric(x['oilperperson'], errors='coerce'),
urban=lambda x: pd.to_numeric(x['urbanrate'], errors='coerce'))
.dropna())
display(Markdown("Number of countries after discarding countries with missing data: {}".format(len(subdata))))
centered_data = (subdata.assign(income_c=lambda x: x['income'] - subdata['income'].mean(),
employ_c=lambda x: x['employ'] - subdata['employ'].mean(),
oil_c=lambda x: x['oil'] - subdata['oil'].mean(),
urban_c=lambda x: x['urban'] - subdata['urban'].mean()))
Let's check the variables are well centered
In [6]:
display(Markdown("Mean of income per person after centering: {:3g}".format(centered_data['income_c'].mean())))
display(Markdown("Mean of employment rate after centering: {:3g}".format(centered_data['employ_c'].mean())))
display(Markdown("Mean of oil consumption per person after centering: {:3g}".format(centered_data['oil_c'].mean())))
display(Markdown("Mean of urban rate after centering: {:3g}".format(centered_data['urban_c'].mean())))
In [7]:
sns.regplot(x='income_c', y='electricity', data=centered_data)
sns.regplot(x='income_c', y='electricity', order=2, data=centered_data)
plt.xlabel('Centered income per person (2000 US$)')
plt.ylabel('Residential electricity consumption (kWh)')
plt.title('Scatterplot for the association between the income and the residential electricity consumption');
In [8]:
reg1 = smf.ols('electricity ~ income_c', data=centered_data).fit()
reg1.summary()
Out[8]:
Test a second order polynomial regression model.
In [9]:
reg2 = smf.ols('electricity ~ income_c + I(income_c**2)', data=centered_data).fit()
reg2.summary()
Out[9]:
From the second OLS analysis, we can see that the coefficient corresponding to the second order term has a p-value of 0.3 > 0.05. Therefore we should keep only the linear term for our primary relation.
In [10]:
reg3 = smf.ols('electricity ~ income_c + oil_c + employ_c + urban_c', data=centered_data).fit()
reg3.summary()
Out[10]:
In [11]:
fig = sm.qqplot(reg1.resid, line='r')
The residuals do not follow correctly the line. Especially on the second half of the data. As our model is a single linear regression between residential electricity consumption and income per person, this means that the model is a poor predicator for country having larger income.
In [12]:
stdres = pd.DataFrame(reg1.resid_pearson)
plt.plot(stdres, 'o')
plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number');
The previous plot highlights only one clear extreme outlier. So this confirm that the model is fine and could be improve.
In [13]:
fig = plt.figure(figsize=(12,8))
sm.graphics.plot_regress_exog(reg3, 'urban_c', fig);
In [19]:
fig = plt.figure(figsize=(12,8))
sm.graphics.plot_regress_exog(reg3, 'oil_c', fig);
In [20]:
fig = plt.figure(figsize=(12,8))
sm.graphics.plot_regress_exog(reg3, 'employ_c', fig);
The partial regression plots above are shown for the sake of the assignement as all variables but the income per person are non-significant in the multiple regression model. They all show that some extreme outliers will be present.
And the partial plot against the urban rate has a horizontal partial regression line. This confirms that urban rate cannot improve the model.
In [14]:
sm.graphics.influence_plot(reg1);
The leverage plot above shows that our extreme outlier United Arab Emirates does not have a strong influence on the estimation of the model coefficient. On the contrary Norway, the second border highest residual, has a important influence on the model estimation.
In [15]:
sns.regplot(x='oil_c', y='electricity', data=centered_data);
sns.regplot(x='oil_c', y='electricity', order=2, data=centered_data)
plt.xlabel('Centered oil consumption per person (tonnes)')
plt.ylabel('Residential electricity consumption (kWh)')
plt.title('Scatterplot for the association between the oil and the residential electricity consumption');
In [16]:
reg_oil1 = smf.ols('electricity ~ oil_c', data=centered_data).fit()
reg_oil1.summary()
Out[16]:
In [17]:
reg_oil2 = smf.ols('electricity ~ oil_c + I(oil_c**2)', data=centered_data).fit()
reg_oil2.summary()
Out[17]:
From the OLS analysis, the second order regression fits better the results. But the outlier far on the right seems to deteriorate the accuracy of the linear regression coefficient.
This is confirm by the leverage plot below. Singapore has a strong influence on the regression model. This country is a singularity having a huge oil consumption but a reasonable residential electricity consumption.
In [18]:
sm.graphics.influence_plot(reg_oil1);
Anyway the multiple regression shows that oil consumption is not a significant explanatory variable of the residential electricity consumption. Indeed in this case income per person is a cofounder; that is the real explanatory variable.
In this assignment we have seen that only our primary explanatory variable (income per person) is a good to build a regression model on. However the R-Squared value being 0.4, there is still 60% of the electricity consumption variations not explain with the model; in particular the model performs poorly for country with higher income.
In the latter part, we use the tools described in this week to the trouble than can be raised by outlier data.