Regression Modeling in Practice

Assignment: Test a Basic Linear Regression Model

Following is the Python program I wrote to fulfill the second assignment of the Regression Modeling in Practice online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Assignment research question

Using the Gapminder database, I would like to see if there is a linear relationship between the income per person (explanatory variable) and the residential consumption of electricity (response variable).

Data management

For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replace directly by NaN no special data treatment is needed.


In [1]:
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display

In [2]:
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')

General information on the Gapminder data


In [3]:
display(Markdown("Number of countries: {}".format(len(data))))
display(Markdown("Number of variables: {}".format(len(data.columns))))


Number of countries: 213

Number of variables: 15

Variables distribution

Before computing the linear regression between the icome per person and the residential electricity consumption, let's have a look at the distributions of the variables.


In [4]:
subdata2 = (data[['incomeperperson', 'relectricperperson']]
                .assign(income=lambda x: pd.to_numeric(data['incomeperperson'], errors='coerce'),
                        electricity=lambda x: pd.to_numeric(data['relectricperperson'], errors='coerce'))
                .dropna())

Income per person


In [5]:
sns.distplot(subdata2['income'], kde=False)
plt.xlabel('Income per person (2000 US$)');


From the distribution graph, we can see that the distribution is skewed-right and bimodal. There is also a singular case with high income per capita. That country is :


In [13]:
subdata2.loc[subdata2['income'] > 45000]


Out[13]:
incomeperperson relectricperperson electricity income
country
Luxembourg 52301.5871789984 1566.10613860734 1566.106139 52301.587179

In [15]:
subdata2['income'].describe()


Out[15]:
count      130.000000
mean      8784.531620
std      11420.725637
min        103.775857
25%       1156.551493
50%       3172.679153
75%      11606.578377
max      52301.587179
Name: income, dtype: float64

For this assignment the explanatory variable income per person will have to be centered as its mean is 8784.5.

Residential electricity consumption


In [6]:
sns.distplot(subdata2['electricity'], kde=False)
plt.xlabel('Residential electricity consumption (kWh)');



In [16]:
subdata2['electricity'].describe()


Out[16]:
count      130.000000
mean      1148.615676
std       1584.959430
min          0.000000
25%        226.318460
50%        609.335172
75%       1484.702878
max      11154.755033
Name: electricity, dtype: float64

The residential electricity consumption is also skewed-right. And there are also a couple of countries presenting unusual higher values. Those countries are :


In [19]:
subdata2.loc[subdata2['electricity'] > 6000]


Out[19]:
incomeperperson relectricperperson electricity income
country
Bahrain 12505.2125447354 7314.3556367342 7314.355637 12505.212545
Norway 39972.3527684608 7432.13085189506 7432.130852 39972.352768
United Arab Emirates 21087.3941247583 11154.7550328078 11154.755033 21087.394125

Bivariate distribution

If we look now at the distribution of the two variables in a scatter plot. We can see that our singular countries for the two variables of study result in outlier points. Nothing special was done about them as there are no managment error on them and no reason to discard them. Moreover, the distribution of the data along the regression line shows that the homoscedasticity hypothesis is not well met as the spread of the data along the line is bigger for higher values of income.


In [7]:
sns.regplot(x='income', y='electricity', data=subdata2)
plt.xlabel('Income per person (2000 US$)')
plt.ylabel('Residential electricity consumption (kWh)')
plt.title('Scatterplot for the association between the income and the residential electricity consumption');


Centering the explanatory variable

For the assignment, the explanatory variable (here the income per person) has to be centered by subracting its mean.


In [20]:
subdata3 = subdata2.assign(income_centered=lambda x: x['income']-subdata2['income'].mean())

display(Markdown("Income mean after centereing : {:3g}".format(subdata3['income_centered'].mean())))

sns.regplot(x='income_centered', y='electricity', data=subdata3)

plt.xlabel('Centered income per person (2000 US$)')
plt.ylabel('Residential electricity consumption (kWh)')
plt.title('Scatterplot for the association between the income and the residential electricity consumption');


Income mean after centereing : -1.88895e-12

OLS regression model


In [9]:
reg1 = smf.ols('electricity ~ income_centered', data=subdata3).fit()
reg1.summary()


Out[9]:
OLS Regression Results
Dep. Variable: electricity R-squared: 0.425
Model: OLS Adj. R-squared: 0.420
Method: Least Squares F-statistic: 94.47
Date: Tue, 12 Jul 2016 Prob (F-statistic): 4.63e-17
Time: 13:37:53 Log-Likelihood: -1105.9
No. Observations: 130 AIC: 2216.
Df Residuals: 128 BIC: 2222.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1148.6157 105.855 10.851 0.000 939.164 1358.067
income_centered 0.0904 0.009 9.719 0.000 0.072 0.109
Omnibus: 148.000 Durbin-Watson: 2.123
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4079.319
Skew: 4.030 Prob(JB): 0.00
Kurtosis: 29.232 Cond. No. 1.14e+04

Conclusion

From the OLS analysis above, there is a significant positive linear relationship between income per person and residential electricity consumption. The slope coefficient is 0.0904. And the p-value is far below 0.05.