Following is the Python program I wrote to fulfill the second assignment of the Regression Modeling in Practice online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
Using the Gapminder database, I would like to see if there is a linear relationship between the income per person (explanatory variable) and the residential consumption of electricity (response variable).
For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replace directly by NaN
no special data treatment is needed.
In [1]:
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display
In [2]:
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')
General information on the Gapminder data
In [3]:
display(Markdown("Number of countries: {}".format(len(data))))
display(Markdown("Number of variables: {}".format(len(data.columns))))
In [4]:
subdata2 = (data[['incomeperperson', 'relectricperperson']]
.assign(income=lambda x: pd.to_numeric(data['incomeperperson'], errors='coerce'),
electricity=lambda x: pd.to_numeric(data['relectricperperson'], errors='coerce'))
.dropna())
In [5]:
sns.distplot(subdata2['income'], kde=False)
plt.xlabel('Income per person (2000 US$)');
From the distribution graph, we can see that the distribution is skewed-right and bimodal. There is also a singular case with high income per capita. That country is :
In [13]:
subdata2.loc[subdata2['income'] > 45000]
Out[13]:
In [15]:
subdata2['income'].describe()
Out[15]:
For this assignment the explanatory variable income per person will have to be centered as its mean is 8784.5.
In [6]:
sns.distplot(subdata2['electricity'], kde=False)
plt.xlabel('Residential electricity consumption (kWh)');
In [16]:
subdata2['electricity'].describe()
Out[16]:
The residential electricity consumption is also skewed-right. And there are also a couple of countries presenting unusual higher values. Those countries are :
In [19]:
subdata2.loc[subdata2['electricity'] > 6000]
Out[19]:
If we look now at the distribution of the two variables in a scatter plot. We can see that our singular countries for the two variables of study result in outlier points. Nothing special was done about them as there are no managment error on them and no reason to discard them. Moreover, the distribution of the data along the regression line shows that the homoscedasticity hypothesis is not well met as the spread of the data along the line is bigger for higher values of income.
In [7]:
sns.regplot(x='income', y='electricity', data=subdata2)
plt.xlabel('Income per person (2000 US$)')
plt.ylabel('Residential electricity consumption (kWh)')
plt.title('Scatterplot for the association between the income and the residential electricity consumption');
In [20]:
subdata3 = subdata2.assign(income_centered=lambda x: x['income']-subdata2['income'].mean())
display(Markdown("Income mean after centereing : {:3g}".format(subdata3['income_centered'].mean())))
sns.regplot(x='income_centered', y='electricity', data=subdata3)
plt.xlabel('Centered income per person (2000 US$)')
plt.ylabel('Residential electricity consumption (kWh)')
plt.title('Scatterplot for the association between the income and the residential electricity consumption');
In [9]:
reg1 = smf.ols('electricity ~ income_centered', data=subdata3).fit()
reg1.summary()
Out[9]: