Week 1 Assignment: Running an analysis of variance

In this assignment I've chosen the Gapminder dataset. Looking through its codebook we've decided to study two variables, incomeperperson and lifeexpectancy:

incomeperperson

2010 Gross Domestic Product per capita in constant 2000 US$. The World Bank Work Development inflation but not the differences in the cost of living between countries Indicators has been taken into account.

lifeexpectancy

2011 life expectancy at birth (years). The average number of years a newborn child would live if current mortality patterns were to stay the same.

We'll be studying the relationship between the income and life expectancy. To fulfil the assignment requirements, we'll transformate the numeric variable incomeperperson into categorical using the US poverty threshold as a guideline to divide this variable up in three values: countries wich income per capita is less than 40% of this threshold (not included) will be classified as low income, those between 40% (included) and the thresold (not included) will be classified as medium income, and those equal or above will be high income.

The null hypothesis $H_o$ is that the life expectancy between the countries with low, medium and high income are the same, or in other words, income is not a drive for life expectancy. The alternative $H_a$ hypothesis is that life expectancy between the classified countries are not the same.



In [87]:

    
# Import all ploting and scientific library,
# and embed figures in this file.
%pylab inline

# Package to manipulate dataframes.
import pandas as pd

# Nice looking plot functions.
import seaborn as sn

# Read the dataset.
df = pd.read_csv('data/gapminder.csv')

# Set the country name as the index of the dataframe.
df.index = df.country

# This column is no longer needed.
#del df['country']

# Select only the variables we're interested.
df = df[['incomeperperson','lifeexpectancy']]

# Convert the types properly.
df = df.convert_objects(convert_numeric=True)

# Remove missing values.
df = df.dropna()









    



Populating the interactive namespace from numpy and matplotlib






    



/Users/sergio/anaconda3/lib/python3.4/site-packages/IPython/kernel/__main__.py:24: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

Descriptive analysis

Let's take a look at the variables.



In [88]:

    
df.describe()









    Out[88]:






  
    
      
      incomeperperson
      lifeexpectancy
    
  
  
    
      count
      176.000000
      176.000000
    
    
      mean
      7327.444414
      69.654733
    
    
      std
      10567.304022
      9.729521
    
    
      min
      103.775857
      47.794000
    
    
      25%
      702.366463
      63.041500
    
    
      50%
      2385.184105
      73.126500
    
    
      75%
      8497.779228
      76.569500
    
    
      max
      52301.587179
      83.394000

Create the categorical variable

We'll create the categorical variable income_level based on the incomeperperson and the USA poverty threshold.



In [89]:

    
# http://www.irp.wisc.edu/faqs/faq1.htm
income_threshold= 11720
income_level = pd.cut(df.incomeperperson, 
    [0, income_threshold*0.4, income_threshold, 110000 ], 
    labels=['Low income', 'Medium income', 'High income'])

# http://www.irp.wisc.edu/faqs/faq1.htm income_threshold= 11720 income_level = pd.cut(df.incomeperperson, [0, income_threshold, 110000 ], labels=['Low income', 'High income'])

Looking at the distribution of countries by the new categorical variable income_level in the graph below, the majority of them take low income per capita, followed by high income and medium income.



In [120]:

    
il = income_level.value_counts()
f, a = subplots()
f.set_size_inches(6,3)
sn.barplot(il.values, il.index.tolist(), ax=a);

a.set_title('Number of countries by income_level', fontsize=14);
yticks(fontsize=12),xticks(fontsize=12);

Let's save the income_level variable in our data frame. We must explicitly convert it to object because of an actual misunderstanding betwen pandas and stastmodels packages.



In [121]:

    
import numpy as np
df['income_level'] = income_level.astype(np.object)

Let's take a look at the population means by the income_level categories. The table and graph below show that the life expectancy means are diferent among the countries income levels, as our alternative hyphothesis $H_a$ states.



In [137]:

    
g = df.groupby('income_level')
income_mean = g.mean()
income_mean









    Out[137]:






  
    
      
      incomeperperson
      lifeexpectancy
    
    
      income_level
      
      
    
  
  
    
      High income
      25730.523512
      80.185000
    
    
      Low income
      1389.885849
      64.937373
    
    
      Medium income
      7014.797566
      74.315400



In [138]:

    
sn.boxplot(df.income_level, df.lifeexpectancy);
title('Life expectancy by income level groups', fontsize=14, fontweight='bold');
xticks(fontsize=12)









    Out[138]:





(array([0, 1, 2]), <a list of 3 Text xticklabel objects>)

In the next session, we'll see whether $H_o$ can be rejected or not.

Calculate the F statistic

The F value calculated below shows the variability between the three groups is 64 times greater than the variability within groups. The p-value $1.02*10^{-21}$ is far below 0.05, thus we can reject $H_o$ and select $H_a$: life expectancy varies across income per capita.



In [139]:

    
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
model = smf.ols('lifeexpectancy ~ C(income_level)', df)
result = model.fit()
result.summary()









    Out[139]:





OLS Regression Results

  Dep. Variable:      lifeexpectancy     R-squared:             0.428


  Model:                    OLS          Adj. R-squared:        0.421


  Method:              Least Squares     F-statistic:           64.74


  Date:              Sun, 08 Nov 2015    Prob (F-statistic):  1.02e-21


  Time:                  09:47:51        Log-Likelihood:      -600.49


  No. Observations:          176         AIC:                   1207.


  Df Residuals:              173         BIC:                   1216.


  Df Model:                    2                                     


  Covariance Type:       nonrobust                                   




                                      coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept                            80.1850      1.233     65.011   0.000     77.751    82.619


  C(income_level)[T.Low income]       -15.2476      1.421    -10.730   0.000    -18.052   -12.443


  C(income_level)[T.Medium income]     -5.8696      1.829     -3.208   0.002     -9.481    -2.259




  Omnibus:        13.620    Durbin-Watson:         1.982


  Prob(Omnibus):   0.001    Jarque-Bera (JB):     14.847


  Skew:           -0.708    Prob(JB):           0.000597


  Kurtosis:        3.135    Cond. No.               5.10

Post hoc test

Finally, we'll conduct a post hoc test to see wich group of income level is diferent from the others. By the table below we can reject the equality between all pairwise comparison. So, the life expectancy mean are diferent among income_level groups.



In [140]:

    
res = multi.pairwise_tukeyhsd(df.lifeexpectancy, df.income_level)
print(res.summary())









    



     Multiple Comparison of Means - Tukey HSD,FWER=0.05    
===========================================================
   group1       group2    meandiff  lower    upper   reject
-----------------------------------------------------------
High income   Low income  -15.2476 -18.6072 -11.8881  True 
High income Medium income -5.8696  -10.1948 -1.5444   True 
 Low income Medium income  9.378    5.7743  12.9818   True 
-----------------------------------------------------------

Conclusion

By the F statistic and post hoc test above, we can say that the incompe per capita is related to life expectancy. Remembering that this income variable is the ratio of Gross Domestic Prodcut by population, the more productive and economic developed a country is, the more its citizens can live.

	incomeperperson	lifeexpectancy
count	176.000000	176.000000
mean	7327.444414	69.654733
std	10567.304022	9.729521
min	103.775857	47.794000
25%	702.366463	63.041500
50%	2385.184105	73.126500
75%	8497.779228	76.569500
max	52301.587179	83.394000

	incomeperperson	lifeexpectancy
income_level
High income	25730.523512	80.185000
Low income	1389.885849	64.937373
Medium income	7014.797566	74.315400

Dep. Variable:	lifeexpectancy	R-squared:	0.428
Model:	OLS	Adj. R-squared:	0.421
Method:	Least Squares	F-statistic:	64.74
Date:	Sun, 08 Nov 2015	Prob (F-statistic):	1.02e-21
Time:	09:47:51	Log-Likelihood:	-600.49
No. Observations:	176	AIC:	1207.
Df Residuals:	173	BIC:	1216.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	80.1850	1.233	65.011	0.000	77.751 82.619
C(income_level)[T.Low income]	-15.2476	1.421	-10.730	0.000	-18.052 -12.443
C(income_level)[T.Medium income]	-5.8696	1.829	-3.208	0.002	-9.481 -2.259

Omnibus:	13.620	Durbin-Watson:	1.982
Prob(Omnibus):	0.001	Jarque-Bera (JB):	14.847
Skew:	-0.708	Prob(JB):	0.000597
Kurtosis:	3.135	Cond. No.	5.10