Week 3 Assignment: Generating a Correlation Coefficient

In this assignment I've chosen the Gapminder dataset. Looking through its codebook we've decided to study two numeric variables, incomeperperson and lifeexpectancy relationship:

  • incomeperperson

2010 Gross Domestic Product per capita in constant 2000 US$. The World Bank Work Development inflation but not the differences in the cost of living between countries Indicators has been taken into account.

  • lifeexpectancy

2011 life expectancy at birth (years). The average number of years a newborn child would live if current mortality patterns were to stay the same.


In [1]:
# Import all ploting and scientific library,
# and embed figures in this file.
%pylab inline

# Package to manipulate dataframes.
import pandas as pd

# Nice looking plot functions.
import seaborn as sn

# The Pearson correlation function.
from scipy.stats import pearsonr

# Read the dataset.
df = pd.read_csv('data/gapminder.csv')

# Set the country name as the index of the dataframe.
df.index = df.country

# This column is no longer needed.
del df['country']

# Select only the variables we're interested.
df = df[['lifeexpectancy','incomeperperson']]

# Convert the types.
df.lifeexpectancy = pd.to_numeric(df.lifeexpectancy, errors='coerce')
df.incomeperperson = pd.to_numeric(df.incomeperperson, errors='coerce')

# Remove missing values.
df = df.dropna()


Populating the interactive namespace from numpy and matplotlib

Pearson correlation $r$

This is just straightfoward.


In [2]:
r = pearsonr(df.incomeperperson, df.lifeexpectancy)

In [3]:
print('Correlation between incomeperperson and lifeexpectancy: {}'.format(r))


Correlation between incomeperperson and lifeexpectancy: (0.60151634019643963, 1.0653418935026235e-18)

In [4]:
print('Percentage of variability in the reponse variable given by the explanatory variable is {:2}%'.
      format(round(r[0]**2*100,2)))


Percentage of variability in the reponse variable given by the explanatory variable is 36.18%

As we can see above, $r = 0.60$ with $pvalue=1.06*10^{-18}$, shows a moderately strong correlation between life expectancy and income per person. Let's take a look at the scatter plot to see how this correlation is formed.


In [5]:
# Setting an apropriate size for the graph.
factor = 1.3
figsize(6*factor, 4*factor)

# Plot the graph.
sn.regplot(df.incomeperperson, df.lifeexpectancy);


/Users/sergio/anaconda3/lib/python3.4/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Conclusion

The scatter plot clearly shows a positive relationship between incomeperperson and lifeexpectancy but, as you can see, the association isn't as linear as expected. A lot of countries is stacked below 5000 income. Above this, the life expectancy keeps in its top values. This show us that the lifeexpectancy rises to a limit of around 80 years as the incomeperperson rises.

End of assignment.