In [1]:
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
import numpy as np
import neiss
import plotly.offline
plotly.offline
In [2]:
#loading in data and preparations
raw = pd.read_csv('/home/datauser/cpsc/data/processed/neiss/neiss-2015.csv')
cleaned = neiss.cleaner(raw)
data = neiss.query(cleaned.processed_data, cleaned.crosstab)
This analysis was done by DataKind DC on behalf of the Consumer Product Safety Commission. This serves as a preliminary study of the NEISS dataset. We have been been contact with the CPSC and figuring out what questions of importance that we can offer insight to. The questions that were analyzed were:
To answer this question, I approached it two ways. One way is to tabulate the total number of producted queried by hospitals and another is to look at the top items reported by each item.
The top ten producted reported by hospitals are listed below. It appears that 1842 and 1807 are the top products that most hospital report.
In [3]:
data.data['product'].value_counts()[0:9]
Out[3]:
Looking further, I examine what hospitals report this the most, so we can examine hospitals that report these products the most.
In [4]:
data.get_hospitals_by_product('product_1842')
Out[4]:
In [5]:
data.get_hospitals_by_product('product_1807')
Out[5]:
We can also view these as plots and compare the incident rates of these products through different hospitals
In [6]:
data.plot_product('product_1842')
Out[6]:
In [7]:
data.plot_product('product_1807')
Out[7]:
Looking at these, it appears that there are some overlap between the hospitals. Hospital 17, 21, 42, and 95 are the 4 common hospital that are in the top ten of both these products. We will turn to a hospital examination down the road.
In [8]:
set(data.get_hospitals_by_product('product_1842').index.tolist()) & set(data.get_hospitals_by_product('product_1807').index.tolist())
Out[8]:
Another way of examining product harm would not only to count the total numbers of products but also to see what is the top product that is reported for each hosptial. Here we can look at not only the sheer number which could be due to over reporting or awareness but also to see if there are geographic differences for product harm. However, after examining this, we see that 70 out of the 82 hospitals surveyed have product 1842 and 1807 as the top product.
However an interesting finding is that product_1267, product_3299, and product_3283 are in the top ten list of top products by hospital but not in the top ten overall. However, the number is small as it only affects 5 hospitals and 14,844 reported cases. It would be interesting to see where these five hospital are at and why these products are the top of their product harm.
In [9]:
data.top_product_for_hospital()
Out[9]:
Another way of approaching would be to fit a Negative Binomial Regression to see if there are any meaningful differences between the sizes of the hospitals. I use a negative binomial regression rather than a poisson regression because there is strong evidence of overdispersion, that is, the variance of the data is much higher than the mean, as shown below. This also occurs across all stratum (only shown for small, medium, and large).
In [10]:
counts = data.data.ix[data.data['product'] == 'product_1842',:]['hospital'].value_counts()
print('variance of product 1842 counts:', np.var(counts.values))
print('mean of product 1842 counts:', np.mean(counts.values))
In [11]:
data.plot_stratum_dist('product_1842', 'S')
Out[11]:
In [12]:
data.plot_stratum_dist('product_1842', 'M')
Out[12]:
In [13]:
data.plot_stratum_dist('product_1842', 'L')
Out[13]:
In [14]:
df = data.prepare_stratum_modeling('product_1842')
df.head()
Out[14]:
In [15]:
model = smf.glm("counts ~ stratum", data=df,
family=sm.families.NegativeBinomial()).fit()
model.summary()
Out[15]:
From the model, we see that there are only significant differences between Medium and Small hospital. Given the coefficients, the log count difference between Medium and Small hospitals is -1.55. Other than that there doesn't seem to be any other signficant differences between hospital sizes for Product 1842.
We can do the same to examine the 2nd most reported product, Product 1807. Below I check the assumption to fit a negative binomial regression, that the variance is far greater than the mean. In this case we see that it is the case.
In [16]:
data.plot_stratum_dist('product_1807', 'S')
Out[16]:
In [17]:
data.plot_stratum_dist('product_1807', 'M')
Out[17]:
In [18]:
data.plot_stratum_dist('product_1807', 'L')
Out[18]:
The assumptions have been met and after building the model, we see very similar results as the previous model, that there are only significant differences between the small and large hospitals. For future research, we can use similar techniques to see significant differences between hospital sizes for all products.
In [19]:
df2 = data.prepare_stratum_modeling('product_1807')
model = smf.glm("counts ~ stratum", data=df,
family=sm.families.NegativeBinomial()).fit()
model.summary()
Out[19]:
From the top items, we don't see any meaningful differences between the top ten items for people who have race reported and race not reported. Even among the data where we do have race reported, there doesn't seem to be much variation when it comes to the top ten products causes most harm.
In [23]:
data.retrieve_query('race_reported', 'reported', 'product')
Out[23]:
In [24]:
data.retrieve_query('race_reported', 'not reported', 'product')
Out[24]:
In [25]:
races = ['white', 'black', 'hispanic', 'other']
for race in races:
print(race)
print(data.retrieve_query('new_race', race, 'product'))
The analysis here is still preliminary and exploratory. Most of the analysis revolved around two product, Product 1842 and 1807, because they vastly outnumbered all the other reported products. Future analysis could include running more negative binomial or Poisson (if the mean and variance are similar) regression and more standard hypothesis tests to see evaluate statistical significant differences. One question that I could not answer is to figure out any regional differences because we do not know the exact location of the hospital.
We have also attached a document that conducts a much more comprehensive break down of product harm by various segments. This document serves as a starting point for all the analysis done here and will be a value reference for any future research on the NEISS dataset.