In this assignment I've chosen the Gapminder dataset. Looking through its codebook we've decided to study the relationship of the numeric variables incomeperperson
and lifeexpectancy
taking into account the numeric variable urbanrate
as a potential moderator:
2010 Gross Domestic Product per capita in constant 2000 US$. The World Bank Work Development inflation but not the differences in the cost of living between countries Indicators has been taken into account.
2011 life expectancy at birth (years). The average number of years a newborn child would live if current mortality patterns were to stay the same.
2008 urban population (% of total). Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects).
In [3]:
# Import all ploting and scientific library,
# and embed figures in this file.
%pylab inline
# Package to manipulate dataframes.
import pandas as pd
# Nice looking plot functions.
import seaborn as sn
# The Pearson correlation function.
from scipy.stats import pearsonr
# Read the dataset.
df = pd.read_csv('data/gapminder.csv')
# Set the country name as the index of the dataframe.
df.index = df.country
# This column is no longer needed.
del df['country']
# Select only the variables we're interested.
df = df[['lifeexpectancy','incomeperperson', 'urbanrate']]
# Convert the types.
df.lifeexpectancy = pd.to_numeric(df.lifeexpectancy, errors='coerce')
df.incomeperperson = pd.to_numeric(df.incomeperperson, errors='coerce')
df.urbanrate = pd.to_numeric(df.urbanrate, errors='coerce')
# Remove missing values.
df = df.dropna()
In order to verifify whether the moderator variabel, urbanrate
, plays a role into the interaction between incomeperperon
and lifeexpectancy
, we'll subset our dataset into two groups: onde group for countries below 50% of urbanrate population, and the other group with countries equal or above 50% of urbanrate population.
In [6]:
# Dataset with low urban rate.
df_low = df[df.urbanrate < 50]
# Dataset with high urban rate.
df_high = df[df.urbanrate >= 50]
For each subset, we'll conduct the Pearson correlation analysis and verify the results.
In [7]:
r_low = pearsonr(df_low.incomeperperson, df_low.lifeexpectancy)
r_high = pearsonr(df_high.incomeperperson, df_high.lifeexpectancy)
In [10]:
print('Correlation in LOW urban rate: {}'.format(r_low))
In [11]:
print('Correlation in HIGH urban rate: {}'.format(r_high))
In [13]:
print('Percentage of variability LOW urban rate: {:2}%'.
format(round(r_low[0]**2*100,2)))
In [15]:
print('Percentage of variability HIGH urban rate: {:2}%'.
format(round(r_high[0]**2*100,2)))
In [34]:
# Silent matplotlib warning.
import warnings
warnings.filterwarnings('ignore',category=FutureWarning)
# Setting an apropriate size for the graph.
f,a = subplots(1, 2)
f.set_size_inches(12,6)
# Plot the graph.
sn.regplot(df_low.incomeperperson, df_low.lifeexpectancy, ax=a[0]);
a[0].set_title('Countries with LOW urbanrate', fontweight='bold');
sn.regplot(df_high.incomeperperson, df_high.lifeexpectancy, ax=a[1]);
a[1].set_title('Countries with HIGH urbanrate', fontweight='bold');
As we can see above, the correlation in countries with high urban rate, urbanrate
$\ge 50\%$, is higher than in countries with low urban rate, and in both cases the $pvalue$ is significan. So, we can say the variable urbanrate
moderates the relationship between lifeexpectancy
and incomeperperson
.
End of assignment.