In this assignment I've chosen the Gapminder dataset. Looking through its codebook we've decided to study two variables, incomeperperson and lifeexpectancy relationship:
2010 Gross Domestic Product per capita in constant 2000 US$. The World Bank Work Development inflation but not the differences in the cost of living between countries Indicators has been taken into account.
2011 life expectancy at birth (years). The average number of years a newborn child would live if current mortality patterns were to stay the same.
In ordr to met the assignment requirements, we'll transform these numeric variables into categorical. The incomeperperson
will be categorized using the US poverty threshold as a guideline to divide this variable as those countries below the US threshold, those near and those equal or above. The lifeexpectancy
variable will be categorized in the same way: those countries below USA life expectancy and those equal or above. We'll take USA out of the dataset as it was used as the baseline comparison.
The null hypothesis $H_o$ is that the life expectancy is independent of the income per capita. The alternative hypothesis $H_a$ is that life expectancy is related to income.
In [1]:
# Import all ploting and scientific library,
# and embed figures in this file.
%pylab inline
# Package to manipulate dataframes.
import pandas as pd
# Nice looking plot functions.
import seaborn as sn
# The Chi-squared test function.
from scipy.stats import chi2_contingency
# Read the dataset.
df = pd.read_csv('data/gapminder.csv')
# Set the country name as the index of the dataframe.
df.index = df.country
# This column is no longer needed.
#del df['country']
# Select only the variables we're interested.
df = df[['lifeexpectancy','incomeperperson']]
# Convert the types properly.
#df.incomeperperson = pd.to_numeric(df.incomeperperson, errors='coerce')
#df.lifeexpectancy = pd.to_numeric(df.lifeexpectancy, errors='coerce')
df = df.convert_objects(convert_numeric=True)
# Remove missing values.
df = df.dropna()
# Save the life expectancy threshold for later use.
lifeexpectancy_threshold = df.loc['United States', 'lifeexpectancy']
# Finally, remove the USA from the dataset.
df.drop('United States', inplace=True)
Let's take a look at the variables.
In [2]:
df.describe()
Out[2]:
We'll create the categorical variable income_level
based on USA poverty threshold and lifeexpectancy_level
based on USA life expectancy.
This variable describe the level of income per capita perceived by a country. We'll categorize it in three values: "Low income", "Medium income" and "High income". "Low income" if the income per person is below that 70% of USA threshold of poverty, "Medium income" if between the later and the threshold (not included), and "High income" if equal of above the trheshold.
In [3]:
# http://www.irp.wisc.edu/faqs/faq1.htm
income_threshold= 11720
income_level = pd.cut(df.incomeperperson,
[0,income_threshold*0.7, income_threshold, df.incomeperperson.max() ],
labels=['Low income', 'Medium income', 'High income'])
In [4]:
# Ghaph the new variable.
il = income_level.value_counts()
f, a = subplots()
f.set_size_inches(6,3)
sn.barplot(il.values, il.index.tolist(), ax=a);
a.set_title('Number of countries by income_level', fontsize=14);
yticks(fontsize=12),xticks(fontsize=12);
This variable describe the level of life exepctancy in a country. It can have two values: "Low life expct" and "High life expct". "Low life expct" if the life expectancy is belo the USA life expectancy presened in the same dataset, and "High life expct" otherwise.
In [5]:
lifeexpectancy_level = pd.cut(df.lifeexpectancy,
[0, lifeexpectancy_threshold, df.lifeexpectancy.max()],
labels=['Low life expct','High life expct'])
In [6]:
ll = lifeexpectancy_level.value_counts()
f, a = subplots()
f.set_size_inches(6,3)
sn.barplot(ll.values, ll.index.tolist(), ax=a);
a.set_title('Number of countries by lifeexpectancy_level', fontsize=14);
yticks(fontsize=12),xticks(fontsize=12);
Let's save the income_level
and lifeexpectancy_level
variables in our data frame. We must explicitly convert them to object
because of an actual misunderstanding betwen pandas
and stastmodels
packages.
In [7]:
import numpy as np
df['income_level'] = income_level.astype(np.object)
df['lifeexpectancy_level'] = lifeexpectancy_level.astype(np.object)
Below we have the contigency table of the variables in this study.
In [8]:
# Numeric cross tabulation.
xtab = pd.crosstab(df.lifeexpectancy_level, df.income_level)
# Sort manually the columns.
xtab = xtab[['High income', 'Medium income', 'Low income']]
# Print to output.
xtab
Out[8]:
In [9]:
# Percentage cross tabulation
xtabsum = xtab.sum(axis=0)
xtabpct = xtab/xtabsum
xtabpct
Out[9]:
In the next session, we'll see whether $H_o$ can be rejected or not.
The $Xˆ2$ test will tell us whether the two variables are independent or not. For a 2x2 comparison, the $X^2$ value is large, $123.00$, and the p-value
is really small, $1.95*10^{-27}$, thus, the life expectancy level and income level aren't independent, they're related.
In [10]:
x2 = chi2_contingency(xtab)
x2
Out[10]:
In [11]:
print("Chi-squared: {:.3}\nP-value: {:.3}".format(x2[0], x2[1]))
Our explanatory variable has 3 possible levels, and the $X^2$ does'nt give us insight into why the $H_o$ can be rejected. We do know the life expectancy varies across the income levels but to understand how, we'll have to conduct a post hoc test.
We'll use the Bonferroni Adjustemnt to control the family wise error rate. As our explanatory variable has 3 levels, we must adjust the p-value dividing the alpha significance $0.05$ level by $3$:
In [12]:
# The p-value adjusted.
pvalue_adj = 0.05/3.0
print("The p-value adjusted is: {:.3}".format(pvalue_adj))
Now we have to perform paiwise $X^2$ tests.
In [13]:
recode = {'Low income': 'Low income', 'Medium income':'Medium income'}
df['income_LvM'] = df.income_level.map(recode)
In [14]:
# Crosstab.
xtab_LvM = pd.crosstab(df.lifeexpectancy_level, df.income_LvM)
xtab_LvM
Out[14]:
In [15]:
# Percentage crosstab.
xtab_LvM/xtab_LvM.sum(axis=0)
Out[15]:
In [16]:
# Run the chi-square test.
x2_LvM = chi2_contingency(xtab_LvM)
# Print the results.
print("Chi-squared: {:.3}\nP-value: {:.3}\nP-value ajd: {:.3}".format(x2_LvM[0], x2_LvM[1], pvalue_adj))
As you can see above, the $X^2$ is not large, and the p-value
is greater than the adjusted p-value
. So, we can't reject $H_o$ and state that life expectancy doesn't differ significantly between "Low income" and "Medium income" countries.
In [17]:
recode = {'Low income': 'Low income', 'High income':'High income'}
df['income_LvH'] = df.income_level.map(recode)
In [18]:
# Crosstab.
xtab_LvH = pd.crosstab(df.lifeexpectancy_level, df.income_LvH)
xtab_LvH
Out[18]:
In [19]:
# Percentage crosstab.
xtab_LvH/xtab_LvH.sum(axis=0)
Out[19]:
In [20]:
# Run the chi-square test.
x2_LvH = chi2_contingency(xtab_LvH)
# Print the results.
print("Chi-squared: {:.3}\nP-value: {:.3}\nP-value ajd: {:.3}".format(x2_LvH[0], x2_LvH[1], pvalue_adj))
As you can see above, the $X^2$ is large, and the p-value
is below than the adjusted p-value
. So, we can reject $H_o$ and state that life expectancy does differ significantly between "Low income" and "High income" countries.
In [21]:
recode = {'Medium income': 'Medium income', 'High income':'High income'}
df['income_MvH'] = df.income_level.map(recode)
In [22]:
# Crosstab.
xtab_MvH = pd.crosstab(df.lifeexpectancy_level, df.income_MvH)
xtab_MvH
Out[22]:
In [23]:
# Percentage crosstab.
xtab_MvH/xtab_MvH.sum(axis=0)
Out[23]:
In [24]:
# Run the chi-square test.
x2_MvH = chi2_contingency(xtab_MvH)
# Print the results.
print("Chi-squared: {:.3}\nP-value: {:.3}\nP-value ajd: {:.3}".format(x2_MvH[0], x2_MvH[1], pvalue_adj))
As you can see above, the $X^2$ is large, and the p-value
is below than the adjusted p-value
. So, we can reject $H_o$ and state that life expectancy does differ significantly between "Medium income" and "High income" countries.
By the $X^2$ tests conducted above, the life expectancy is dependent of the income per person. Countries with high income per person benefit from a longer life expectancy when compared to countries with low and medium income per person. There is no difference in life expectancy between countries with low and medium income per person.
End of assignment.