Data Analysis Tools

Assignment: Generating a Correlation Coefficient

Following is the Python program I wrote to fulfill the third assignment of the Data Analysis Tools online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Research question

Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.

So for this assignment, the three following variables will be analyzed:

Internet Usage Rate (per 100 people)
Suicide Rate (per 100 000 people)
Unemployment Rate (% of the population of age 15+)

Data management

For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replace directly by NaN no special data treatment is needed.



In [1]:

    
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display



In [2]:

    
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')

General information on the Gapminder data



In [3]:

    
display(Markdown("Number of countries: {}".format(len(data))))
display(Markdown("Number of variables: {}".format(len(data.columns))))









    




Number of countries: 213







    




Number of variables: 15



In [4]:

    
# Convert interesting variables in numeric format
for variable in ('internetuserate', 'suicideper100th', 'employrate'):
    data[variable] = pd.to_numeric(data[variable], errors='coerce')

But the unemployment rate is not provided directly. In the database, the employment rate (% of the popluation) is available. So the unemployement rate will be computed as 100 - employment rate:



In [5]:

    
data['unemployrate'] = 100. - data['employrate']

The first records of the data restricted to the three analyzed variables are:



In [6]:

    
subdata = data[['internetuserate', 'suicideper100th', 'unemployrate']]
subdata.tail(10)









    Out[6]:






  
    
      
      internetuserate
      suicideper100th
      unemployrate
    
    
      country
      
      
      
    
  
  
    
      United States
      74.247572
      9.927033
      37.700001
    
    
      Uruguay
      47.867469
      14.537270
      42.500000
    
    
      Uzbekistan
      19.445021
      5.213720
      42.500000
    
    
      Vanuatu
      7.988367
      4.983422
      NaN
    
    
      Venezuela
      35.850437
      4.119620
      40.099998
    
    
      Vietnam
      27.851822
      11.653322
      29.000000
    
    
      West Bank and Gaza
      36.422772
      NaN
      68.000000
    
    
      Yemen, Rep.
      12.349750
      6.265789
      61.000000
    
    
      Zambia
      10.124986
      12.019036
      39.000000
    
    
      Zimbabwe
      11.500415
      13.905267
      33.199997

Data analysis

The distribution of the three variables have been analyzed previously.

Variance analysis

As all variables are quantitative, the Pearson correlation test is the one to apply.

Let's first focus on the primary research question;

The explanatory variable is the internet use rate (quantitative variable)
The response variable is the suicide per 100,000 people (quantitative variable)

From the scatter plot, a slope slightly positive is seen. But will the Pearson test confirm this is significant?



In [7]:

    
sns.regplot(x='internetuserate', y='suicideper100th', data=subdata)
plt.xlabel('Internet use rate (%)')
plt.ylabel('Suicide per 100 000 people (-)')
_ = plt.title('Scatterplot for the association between the Internet use rate and suicide per 100,000 people')



In [8]:

    
data_clean = subdata.dropna()
correlation, pvalue = stats.pearsonr(data_clean['internetuserate'], data_clean['suicideper100th'])

display(Markdown("The correlation coefficient is {:.3g} and the associated p-value is {:.3g}.".format(correlation, pvalue)))









    




The correlation coefficient is 0.0735 and the associated p-value is 0.351.

The correlation coefficient is 0.0735 confirming the small positive correlation. But the Pearson test tells us that the null hypothesis cannot be rejected as the p-value is 0.351 >> 0.05.

This confirms the conclusion found when grouping the internet use rate in quartile and applying ANOVA test.

If we look now at the relationship between unemployment and suicide, it seems that there is no relationship looking at the scatterplot below.



In [9]:

    
sns.regplot(x='unemployrate', y='suicideper100th', data=subdata)
plt.xlabel('Unemployment rate (%)')
plt.ylabel('Suicide per 100 000 people (-)')
_ = plt.title('Scatterplot for the association between the unemployment rate and suicide per 100,000 people')

Does the Pearson test confirms that conclusion?



In [10]:

    
correlation, pvalue = stats.pearsonr(data_clean['unemployrate'], data_clean['suicideper100th'])

display(Markdown("The correlation coefficient is {:.3g} and the associated p-value is {:.3g}.".format(correlation, pvalue)))









    




The correlation coefficient is -0.0121 and the associated p-value is 0.878.

The correlation coefficient is negative but really small and the p-value is large. So we can safetly conclude that there is no relationship between the unemployment rate and the suicide per 100,000 people.

Another test case

In order to look at the coefficient of determination, an another relationship that is significant will be analyzed below: Is the residential electricity consumption (response variable) related to the income per person (explanatory variable)?



In [11]:

    
subdata2 = (data[['incomeperperson', 'relectricperperson']]
                .assign(income=lambda x: pd.to_numeric(data['incomeperperson'], errors='coerce'),
                        electricity=lambda x: pd.to_numeric(data['relectricperperson'], errors='coerce'))
                .dropna())

sns.regplot(x='income', y='electricity', data=subdata2)
plt.xlabel('Income per person (2000 US$)')
plt.ylabel('Residential electricity consumption (kWh)')
_ = plt.title('Scatterplot for the association between the income and the residential electricity consumption')



In [12]:

    
correlation, pvalue = stats.pearsonr(subdata2['income'], subdata2['electricity'])

display(Markdown("The correlation coefficient is {:.3g} and the associated p-value is {:.3g}.".format(correlation, pvalue)))
display(Markdown("And the coefficient of determination is {:.3g}.".format(correlation**2)))









    




The correlation coefficient is 0.652 and the associated p-value is 4.63e-17.







    




And the coefficient of determination is 0.425.

The Pearson test proves a significant positive relationship between income per person and residential electricity consumption as the p-value is below 0.05.

Moreover, the square of the correlation coefficient, i.e. the coefficient of determination, is 0.425. This means that we can predict 42.5% of the variability of residential electricity consumption knowing the income per person.

And this concludes this third assignment.

If you are interested into data sciences, follow me on Tumblr.

	internetuserate	suicideper100th	unemployrate
country
United States	74.247572	9.927033	37.700001
Uruguay	47.867469	14.537270	42.500000
Uzbekistan	19.445021	5.213720	42.500000
Vanuatu	7.988367	4.983422	NaN
Venezuela	35.850437	4.119620	40.099998
Vietnam	27.851822	11.653322	29.000000
West Bank and Gaza	36.422772	NaN	68.000000
Yemen, Rep.	12.349750	6.265789	61.000000
Zambia	10.124986	12.019036	39.000000
Zimbabwe	11.500415	13.905267	33.199997