Following is the Python program I wrote to fulfill the first assignment of the Data Analysis Tools online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
Using the Gapminder database, I would like to see if an increasing Internet usage results in an increasing suicide rate. A study shows that other factors like unemployment could have a great impact.
So for this assignment, the three following variables will be analyzed:
For the question I'm interested in, the countries for which data are missing will be discarded. As missing data in Gapminder database are replace directly by NaN
no special data treatment is needed.
In [1]:
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display
In [2]:
# Read the data
data_filename = r'gapminder.csv'
data = pd.read_csv(data_filename, low_memory=False)
data = data.set_index('country')
General information on the Gapminder data
In [3]:
display(Markdown("Number of countries: {}".format(len(data))))
display(Markdown("Number of variables: {}".format(len(data.columns))))
In [4]:
# Convert interesting variables in numeric format
for variable in ('internetuserate', 'suicideper100th', 'employrate'):
data[variable] = pd.to_numeric(data[variable], errors='coerce')
But the unemployment rate is not provided directly. In the database, the employment rate (% of the popluation) is available. So the unemployement rate will be computed as 100 - employment rate
:
In [5]:
data['unemployrate'] = 100. - data['employrate']
The first records of the data restricted to the three analyzed variables are:
In [6]:
subdata = data[['internetuserate', 'suicideper100th', 'unemployrate']]
subdata.head(10)
Out[6]:
The distribution of the three variables have been analyzed previously.
Now that the univariate distribution as be plotted and described, the bivariate graphics will be plotted in order to test our research hypothesis.
Let's first focus on the primary research question;
From the scatter plot, a slope slightly positive has been seen. And as most of the countries have no or very low internet use rate, an effect is maybe seen only on the countries having the higher internet use rate.
In [7]:
subdata2 = subdata.assign(internet_grp4 = pd.qcut(subdata.internetuserate, 4,
labels=["1=25th%tile", "2=50th%tile",
"3=75th%tile", "4=100th%tile"]))
sns.factorplot(x='internet_grp4', y='suicideper100th', data=subdata2,
kind="bar", ci=None)
plt.xlabel('Internet use rate (%)')
plt.ylabel('Suicide per 100 000 people (-)')
_ = plt.title('Average suicide per 100,000 people per internet use rate quartile')
This case falls under the Categorical to Quantitative case of interest for this assignement. So ANOVA analysis can be performed here.
In [8]:
model1 = smf.ols(formula='suicideper100th ~ C(internet_grp4)', data=subdata2).fit()
model1.summary()
Out[8]:
The p-value found is 0.143 > 0.05. Therefore the null hypothesis cannot be rejected. There is no relationship between the internet use rate and the suicide rate.
In [9]:
nesarc = pd.read_csv('nesarc_pds.csv', low_memory=False)
In [10]:
races = {1 : 'White',
2 : 'Black',
3 : 'American India/Alaska',
4 : 'Asian/Native Hawaiian/Pacific',
5 : 'Hispanic or Latino'}
subnesarc = (nesarc[['S3BQ4', 'ETHRACE2A']]
.assign(ethnicity=lambda x: pd.Categorical(x['ETHRACE2A'].map(races)),
nb_joints_day=lambda x: (pd.to_numeric(x['S3BQ4'], errors='coerce')
.replace(99, np.nan)))
.dropna())
In [11]:
g = sns.factorplot(x='ethnicity', y='nb_joints_day', data=subnesarc,
kind="bar", ci=None)
g.set_xticklabels(rotation=90)
plt.ylabel('Number of cannabis joints per day')
_ = plt.title('Average number of cannabis joints smoked per day depending on the ethnicity')
The null hypothesis is There is no relationship between the number of joints smoked per day and the ethnicity.
The alternate hypothesis is There is a relationship between the number of joints smoked per day and the ethnicity.
In [12]:
model2 = smf.ols(formula='nb_joints_day ~ C(ethnicity)', data=subnesarc).fit()
model2.summary()
Out[12]:
The p-value is much smaller than 5%. Therefore the null hypothesis is rejected. We can now look at which group are really different from the other.
In [13]:
import statsmodels.stats.multicomp as multi
multi1 = multi.MultiComparison(subnesarc['nb_joints_day'], subnesarc['ethnicity'])
result1 = multi1.tukeyhsd()
result1.summary()
Out[13]:
From the Tukey's Honestly Significant Difference Test, we can conclude there are 3 relationship presenting a real difference:
Using the ANOVA test on the research question do countries with a high internet use rate have a higher number of suicides? brought me to the conclusion that there is no relationship between the internet use rate and the number of suicide.
So in order to fulfill this assignment, I switch to the NESARC database. My interest focus on a possible relationship between ethnicity and the number of cannabis joints smoked per day. After verifying that there is a significant relationship, I applied the Tukey HSD method to figure out which groups were really different one from the other.
There are 3 relationship presenting a real difference:
If you are interested into data sciences, follow me on Tumblr.