In this assignment I've chosen the Gapminder dataset. Looking through its codebook we've decided to study two variables, incomeperperson and lifeexpectancy:
2010 Gross Domestic Product per capita in constant 2000 US$. The World Bank Work Development inflation but not the differences in the cost of living between countries Indicators has been taken into account.
2011 life expectancy at birth (years). The average number of years a newborn child would live if current mortality patterns were to stay the same.
We'll be studying the relationship between the income and life expectancy. To fulfil the assignment requirements, we'll transformate the numeric variable incomeperperson
into categorical using the US poverty threshold as a guideline to divide this variable up in three values: countries wich income per capita is less than 40% of this threshold (not included) will be classified as low income, those between 40% (included) and the thresold (not included) will be classified as medium income, and those equal or above will be high income.
The null hypothesis $H_o$ is that the life expectancy between the countries with low, medium and high income are the same, or in other words, income is not a drive for life expectancy. The alternative $H_a$ hypothesis is that life expectancy between the classified countries are not the same.
In [87]:
# Import all ploting and scientific library,
# and embed figures in this file.
%pylab inline
# Package to manipulate dataframes.
import pandas as pd
# Nice looking plot functions.
import seaborn as sn
# Read the dataset.
df = pd.read_csv('data/gapminder.csv')
# Set the country name as the index of the dataframe.
df.index = df.country
# This column is no longer needed.
#del df['country']
# Select only the variables we're interested.
df = df[['incomeperperson','lifeexpectancy']]
# Convert the types properly.
df = df.convert_objects(convert_numeric=True)
# Remove missing values.
df = df.dropna()
In [88]:
df.describe()
Out[88]:
In [89]:
# http://www.irp.wisc.edu/faqs/faq1.htm
income_threshold= 11720
income_level = pd.cut(df.incomeperperson,
[0, income_threshold*0.4, income_threshold, 110000 ],
labels=['Low income', 'Medium income', 'High income'])
Looking at the distribution of countries by the new categorical variable income_level
in the graph below, the majority of them take low income per capita, followed by high income and medium income.
In [120]:
il = income_level.value_counts()
f, a = subplots()
f.set_size_inches(6,3)
sn.barplot(il.values, il.index.tolist(), ax=a);
a.set_title('Number of countries by income_level', fontsize=14);
yticks(fontsize=12),xticks(fontsize=12);
Let's save the income_level
variable in our data frame. We must explicitly convert it to object
because of an actual misunderstanding betwen pandas
and stastmodels
packages.
In [121]:
import numpy as np
df['income_level'] = income_level.astype(np.object)
Let's take a look at the population means by the income_level
categories. The table and graph below show that the life expectancy means are diferent among the countries income levels, as our alternative hyphothesis $H_a$ states.
In [137]:
g = df.groupby('income_level')
income_mean = g.mean()
income_mean
Out[137]:
In [138]:
sn.boxplot(df.income_level, df.lifeexpectancy);
title('Life expectancy by income level groups', fontsize=14, fontweight='bold');
xticks(fontsize=12)
Out[138]:
In the next session, we'll see whether $H_o$ can be rejected or not.
In [139]:
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
model = smf.ols('lifeexpectancy ~ C(income_level)', df)
result = model.fit()
result.summary()
Out[139]:
In [140]:
res = multi.pairwise_tukeyhsd(df.lifeexpectancy, df.income_level)
print(res.summary())
By the F statistic and post hoc test above, we can say that the incompe per capita is related to life expectancy. Remembering that this income variable is the ratio of Gross Domestic Prodcut by population, the more productive and economic developed a country is, the more its citizens can live.