Following is the Python program I wrote to fulfill the second assignment of the Data Analysis Tools online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
As the previous assignment brought me to conclude my initial research question, I will look at a possible relationship between ethnicity (explanatory variable) and use of cannabis (response variable) from the NESARC database. As both variables are categoricals, the Chi-Square Test of Independence is the method to use.
In [1]:
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
from IPython.display import Markdown, display
In [2]:
nesarc = pd.read_csv('nesarc_pds.csv', low_memory=False)
In [3]:
races = {1 : 'White',
2 : 'Black',
3 : 'American India \n Alaska',
4 : 'Asian \n Native Hawaiian \n Pacific',
5 : 'Hispanic or Latino'}
subnesarc = (nesarc[['S3BQ1A5', 'ETHRACE2A']]
.assign(S3BQ1A5=lambda x: pd.to_numeric(x['S3BQ1A5'].replace((2, 9), (0, np.nan)), errors='coerce'))
.assign(ethnicity=lambda x: pd.Categorical(x['ETHRACE2A'].map(races)),
use_cannabis=lambda x: pd.Categorical(x['S3BQ1A5']))
.dropna())
subnesarc.use_cannabis.cat.rename_categories(('No', 'Yes'), inplace=True)
First, the distribution of both the use of cannabis and the ethnicity will be shown.
In [4]:
g = sns.countplot(subnesarc['ethnicity'])
_ = plt.title('Distribution of the ethnicity')
In [5]:
g = sns.countplot(subnesarc['use_cannabis'])
_ = plt.title('Distribution of ever use cannabis')
Now that the univariate distribution as be plotted and described, the bivariate graphics will be plotted in order to test our research hypothesis.
From the bivariate graphic below, it seems that there are some differences. For example American Indian versus Asian seems quite different.
In [6]:
g = sns.factorplot(x='ethnicity', y='S3BQ1A5', data=subnesarc,
kind="bar", ci=None)
g.set_xticklabels(rotation=90)
plt.ylabel('Ever use cannabis')
_ = plt.title('Average number of cannabis user depending on the ethnicity')
In [7]:
ct1 = pd.crosstab(subnesarc.use_cannabis, subnesarc.ethnicity)
display(Markdown("Contingency table of observed counts"))
ct1
Out[7]:
In [8]:
# Note: normalize keyword is available starting from pandas version 0.18.1
ct2 = ct1/ct1.sum(axis=0)
display(Markdown("Contingency table of observed counts normalized over each columns"))
ct2
Out[8]:
The Chi-Square test will be applied on the all data to test the following hypothesis :
In [9]:
stats.chi2_contingency(ct1)
Out[9]:
The p-value of 3.7e-91 confirm that the null hypothesis can be safetly rejected.
The next obvious questions is which ethnic groups have a statistically significant difference regarding the use of cannabis. For that, the Chi-Square test will be performed on each pair of group thanks to the following code.
In [10]:
list_races = list(races.keys())
p_values = dict()
for i in range(len(list_races)):
for j in range(i+1, len(list_races)):
race1 = races[list_races[i]]
race2 = races[list_races[j]]
subethnicity = subnesarc.ETHRACE2A.map(dict(((list_races[i], race1),(list_races[j], race2))))
comparison = pd.crosstab(subnesarc.use_cannabis, subethnicity)
display(Markdown("Crosstable to compare {} and {}".format(race1, race2)))
display(comparison)
display(comparison/comparison.sum(axis=0))
chi_square, p, _, expected_counts = stats.chi2_contingency(comparison)
p_values[(race1, race2)] = p
If we put together all p-values results and test them against our threshold of 0.005, we got the table below.
The threshold is the standard 0.05 threshold divided by the number of pairs in the explanatory variables (here 10).
In [11]:
df = pd.DataFrame(p_values, index=['p-value', ])
(df.stack(level=[0, 1])['p-value']
.rename('p-value')
.to_frame()
.assign(Ha=lambda x: x['p-value'] < 0.05 / len(p_values)))
Out[11]:
In this particular case, we can conclude that all ethnic group have a significant relationship with the use of cannabis.
If you are interested into data sciences, follow me on Tumblr.