Following is the Python program I wrote to fulfill the fourth assignment of the Regression Modeling in Practice online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
For this assignment, I decided to use the NESARC database with the following question : Are people from white ethnicity more likely to have ever used cannabis?
The potential other explanatory variables will be:
The data will be managed to get cannabis usage recoded from 0 (never used cannabis) and 1 (used cannabis). The non-answering recordings (reported as 9) will be discarded.
The response variable having 2 categories, categories grouping is not needed.
The other categorical variable (sex) will be recoded such that 0 means female and 1 equals male. And the two quantitative explanatory variables (age and family income) will be centered.
In [1]:
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display
In [2]:
nesarc = pd.read_csv('nesarc_pds.csv')
In [3]:
canabis_usage = {1 : 1, 2 : 0, 9 : 9}
sex_shift = {1 : 1, 2 : 0}
white_race = {1 : 1, 2 : 0}
subnesarc = (nesarc[['AGE', 'SEX', 'S1Q1D5', 'S1Q7D', 'S3BQ1A5', 'S1Q11A']]
.assign(sex=lambda x: pd.to_numeric(x['SEX'].map(sex_shift)),
white_ethnicity=lambda x: pd.to_numeric(x['S1Q1D5'].map(white_race)),
used_canabis=lambda x: (pd.to_numeric(x['S3BQ1A5'], errors='coerce')
.map(canabis_usage)
.replace(9, np.nan)),
family_income=lambda x: (pd.to_numeric(x['S1Q11A'], errors='coerce')))
.dropna())
centered_nesarc = subnesarc.assign(age_c=subnesarc['AGE']-subnesarc['AGE'].mean(),
family_income_c=subnesarc['family_income']-subnesarc['family_income'].mean())
In [4]:
display(Markdown("Mean age : {:.0f}".format(centered_nesarc['AGE'].mean())))
display(Markdown("Mean family income last year: {:.0f}$".format(centered_nesarc['family_income'].mean())))
Let's check that the quantitative variable are effectively centered.
In [5]:
print("Centered age")
print(centered_nesarc['age_c'].describe())
print("\nCentered family income")
print(centered_nesarc['family_income_c'].describe())
In [6]:
g = sns.factorplot(x='white_ethnicity', y='used_canabis', data=centered_nesarc,
kind="bar", ci=None)
g.set_xticklabels(['Non White', 'White'])
plt.xlabel('White ethnicity')
plt.ylabel('Ever used cannabis')
plt.title('Ever used cannabis dependance on the white ethnicity');
In [7]:
g = sns.factorplot(x='sex', y='used_canabis', data=centered_nesarc,
kind="bar", ci=None)
g.set_xticklabels(['Female', 'Male'])
plt.ylabel('Ever used cannabis')
plt.title('Ever used cannabis dependance on the sex');
In [8]:
g = sns.boxplot(x='used_canabis', y='family_income', data=centered_nesarc)
g.set_yscale('log')
g.set_xticklabels(('No', 'Yes'))
plt.xlabel('Ever used cannabis')
plt.ylabel('Family income ($)');
In [9]:
g = sns.boxplot(x='used_canabis', y='AGE', data=centered_nesarc)
g.set_xticklabels(('No', 'Yes'))
plt.xlabel('Ever used cannabis')
plt.ylabel('Age');
The four plots above show the following trends:
The plots showed the direction of a potential relationship. But a rigorous statistical test has to be carried out to confirm the four previous hypothesis.
The following code will test a logistic regression model on our hypothesis.
In [10]:
model = smf.logit(formula='used_canabis ~ family_income_c + age_c + sex + white_ethnicity', data=centered_nesarc).fit()
model.summary()
Out[10]:
In [11]:
params = model.params
conf = model.conf_int()
conf['Odds Ratios'] = params
conf.columns = ['Lower Conf. Int.', 'Upper Conf. Int.', 'Odds Ratios']
np.exp(conf)
Out[11]:
As all four variables coefficient have significant p-value (<< 0.05), no confounders are present in this model.
But as the pseudo R-Square has a really low value, the model does not really explain well the response variable. And so there is maybe a confounder variable that I have not test for.
From the oods ratios results, we can conclude that:
So the results support the hypothesis between our primary explanatory variable (white ethnicity) and the reponse variable (ever used cannabis)
Regarding the last explanatory variable (family income), I don't if I can really conclude. Indeed from the strict resuts, people coming from richer family are more likely to have ever used cannabis (OR=1.000002, 95% CI=[1.000002, 1.000003], p<.0005). But the odds ratio is so close to 1.0 than I don't know if the difference is significant.