In [1]:
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
import numpy as np

import neiss
import plotly.offline

plotly.offline


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-190991a72cdd> in <module>()
      1 import pandas as pd
----> 2 import statsmodels.formula.api as smf
      3 import statsmodels.api as sm
      4 import numpy as np
      5 

ImportError: No module named 'statsmodels'

In [2]:
#loading in data and preparations
raw = pd.read_csv('/home/datauser/cpsc/data/processed/neiss/neiss-2015.csv')
cleaned = neiss.cleaner(raw)
data = neiss.query(cleaned.processed_data, cleaned.crosstab)

This analysis was done by DataKind DC on behalf of the Consumer Product Safety Commission. This serves as a preliminary study of the NEISS dataset. We have been been contact with the CPSC and figuring out what questions of importance that we can offer insight to. The questions that were analyzed were:

  • Are there products we should be aware of?
  • Are there differences between the sizes of hospitals?
  • Are there differences where race was reported or between different races?

Are there products we should be aware of?

To answer this question, I approached it two ways. One way is to tabulate the total number of producted queried by hospitals and another is to look at the top items reported by each item.

The top ten producted reported by hospitals are listed below. It appears that 1842 and 1807 are the top products that most hospital report.


In [3]:
data.data['product'].value_counts()[0:9]


Out[3]:
product_1842    28712
product_1807    28351
product_4076    16784
product_1205    14147
product_5040    12787
product_1211    11664
product_4074     8271
product_1884     7783
product_1893     7723
Name: product, dtype: int64

Looking further, I examine what hospitals report this the most, so we can examine hospitals that report these products the most.


In [4]:
data.get_hospitals_by_product('product_1842')


Out[4]:
hosp_21    2132
hosp_95    1762
hosp_38    1171
hosp_3     1104
hosp_51     920
hosp_61     920
hosp_31     914
hosp_17     777
hosp_42     721
Name: hospital, dtype: int64

In [5]:
data.get_hospitals_by_product('product_1807')


Out[5]:
hosp_21    2281
hosp_95    2043
hosp_17    1679
hosp_89    1598
hosp_14    1495
hosp_63    1109
hosp_73    1071
hosp_2      946
hosp_42     785
Name: hospital, dtype: int64

We can also view these as plots and compare the incident rates of these products through different hospitals


In [6]:
data.plot_product('product_1842')


Out[6]:

In [7]:
data.plot_product('product_1807')


Out[7]:

Looking at these, it appears that there are some overlap between the hospitals. Hospital 17, 21, 42, and 95 are the 4 common hospital that are in the top ten of both these products. We will turn to a hospital examination down the road.


In [8]:
set(data.get_hospitals_by_product('product_1842').index.tolist()) & set(data.get_hospitals_by_product('product_1807').index.tolist())


Out[8]:
{'hosp_17', 'hosp_21', 'hosp_42', 'hosp_95'}

Could be useful to compare stratum types - Do large hospitals see different rates of injury than small hospitals?

Another way of examining product harm would not only to count the total numbers of products but also to see what is the top product that is reported for each hosptial. Here we can look at not only the sheer number which could be due to over reporting or awareness but also to see if there are geographic differences for product harm. However, after examining this, we see that 70 out of the 82 hospitals surveyed have product 1842 and 1807 as the top product.

However an interesting finding is that product_1267, product_3299, and product_3283 are in the top ten list of top products by hospital but not in the top ten overall. However, the number is small as it only affects 5 hospitals and 14,844 reported cases. It would be interesting to see where these five hospital are at and why these products are the top of their product harm.


In [9]:
data.top_product_for_hospital()


Out[9]:
product_1842    42
product_1807    28
product_4076     3
product_1267     2
product_1211     2
product_5040     2
product_3299     2
product_3283     1
dtype: int64

Another way of approaching would be to fit a Negative Binomial Regression to see if there are any meaningful differences between the sizes of the hospitals. I use a negative binomial regression rather than a poisson regression because there is strong evidence of overdispersion, that is, the variance of the data is much higher than the mean, as shown below. This also occurs across all stratum (only shown for small, medium, and large).


In [10]:
counts = data.data.ix[data.data['product'] == 'product_1842',:]['hospital'].value_counts()
print('variance of product 1842 counts:', np.var(counts.values))
print('mean of product 1842 counts:', np.mean(counts.values))


variance of product 1842 counts: 139451.173706
mean of product 1842 counts: 350.146341463

In [11]:
data.plot_stratum_dist('product_1842', 'S')


Variance: 5805.36033951
Mean: 90.0277777778
Out[11]:

In [12]:
data.plot_stratum_dist('product_1842', 'M')


Variance: 18568.6666667
Mean: 423.666666667
Out[12]:

In [13]:
data.plot_stratum_dist('product_1842', 'L')


Variance: 68488.9795918
Mean: 645.142857143
Out[13]:

In [14]:
df = data.prepare_stratum_modeling('product_1842')
df.head()


Out[14]:
counts hospital stratum
38 2132 hosp_21 V
13 1762 hosp_95 V
0 1171 hosp_38 V
42 1104 hosp_3 L
2 920 hosp_51 L

In [15]:
model = smf.glm("counts ~ stratum", data=df,
                family=sm.families.NegativeBinomial()).fit()
model.summary()


Out[15]:
Generalized Linear Model Regression Results
Dep. Variable: counts No. Observations: 82
Model: GLM Df Residuals: 77
Model Family: NegativeBinomial Df Model: 4
Link Function: log Scale: 0.572368368102
Method: IRLS Log-Likelihood: -534.00
Date: Sat, 22 Oct 2016 Deviance: 43.333
Time: 17:49:42 Pearson chi2: 44.1
No. Iterations: 7
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 6.0488 0.268 22.587 0.000 5.524 6.574
stratum[T.L] 0.4206 0.392 1.073 0.283 -0.348 1.189
stratum[T.M] 9.835e-05 0.368 0.000 1.000 -0.721 0.721
stratum[T.S] -1.5487 0.296 -5.227 0.000 -2.129 -0.968
stratum[T.V] 0.3891 0.313 1.244 0.213 -0.224 1.002

From the model, we see that there are only significant differences between Medium and Small hospital. Given the coefficients, the log count difference between Medium and Small hospitals is -1.55. Other than that there doesn't seem to be any other signficant differences between hospital sizes for Product 1842.

We can do the same to examine the 2nd most reported product, Product 1807. Below I check the assumption to fit a negative binomial regression, that the variance is far greater than the mean. In this case we see that it is the case.


In [16]:
data.plot_stratum_dist('product_1807', 'S')


Variance: 36427.9183673
Mean: 112.285714286
Out[16]:

In [17]:
data.plot_stratum_dist('product_1807', 'M')


Variance: 193238.469136
Mean: 522.555555556
Out[17]:

In [18]:
data.plot_stratum_dist('product_1807', 'L')


Variance: 177492.77551
Mean: 658.714285714
Out[18]:

The assumptions have been met and after building the model, we see very similar results as the previous model, that there are only significant differences between the small and large hospitals. For future research, we can use similar techniques to see significant differences between hospital sizes for all products.


In [19]:
df2 = data.prepare_stratum_modeling('product_1807')
model = smf.glm("counts ~ stratum", data=df,
                family=sm.families.NegativeBinomial()).fit()
model.summary()


Out[19]:
Generalized Linear Model Regression Results
Dep. Variable: counts No. Observations: 82
Model: GLM Df Residuals: 77
Model Family: NegativeBinomial Df Model: 4
Link Function: log Scale: 0.572368368102
Method: IRLS Log-Likelihood: -534.00
Date: Sat, 22 Oct 2016 Deviance: 43.333
Time: 17:49:46 Pearson chi2: 44.1
No. Iterations: 7
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 6.0488 0.268 22.587 0.000 5.524 6.574
stratum[T.L] 0.4206 0.392 1.073 0.283 -0.348 1.189
stratum[T.M] 9.835e-05 0.368 0.000 1.000 -0.721 0.721
stratum[T.S] -1.5487 0.296 -5.227 0.000 -2.129 -0.968
stratum[T.V] 0.3891 0.313 1.244 0.213 -0.224 1.002

From the top items, we don't see any meaningful differences between the top ten items for people who have race reported and race not reported. Even among the data where we do have race reported, there doesn't seem to be much variation when it comes to the top ten products causes most harm.


In [23]:
data.retrieve_query('race_reported', 'reported', 'product')


Out[23]:
product_1842    17393
product_1807    15691
product_4076    10108
product_1205     9108
product_5040     7939
product_1211     7872
product_4074     5142
product_1884     4877
product_1893     4874
Name: product, dtype: int64

In [24]:
data.retrieve_query('race_reported', 'not reported', 'product')


Out[24]:
product_1807    12660
product_1842    11319
product_4076     6676
product_1205     5039
product_5040     4848
product_1211     3792
product_3299     3715
product_4074     3129
product_611      3018
Name: product, dtype: int64

In [25]:
races = ['white', 'black', 'hispanic', 'other']
for race in races:
    print(race)
    print(data.retrieve_query('new_race', race, 'product'))


white
product_1807    11719
product_1842    11388
product_4076     6523
product_5040     5026
product_1211     4125
product_1205     3867
product_4074     3560
product_464      3104
product_4057     3008
Name: product, dtype: int64
black
product_1842    4389
product_1205    4267
product_1211    3048
product_1807    2615
product_4076    2302
product_5040    1882
product_1893    1462
product_1884    1332
product_4057    1039
Name: product, dtype: int64
hispanic
product_1842    936
product_1267    798
product_4076    730
product_1807    639
product_5040    555
product_1205    484
product_1211    389
product_1884    352
product_4074    310
Name: product, dtype: int64
other
product_1807    13378
product_1842    11999
product_4076     7229
product_1205     5529
product_5040     5324
product_1211     4102
product_3299     3887
product_4074     3381
product_611      3199
Name: product, dtype: int64

Conclusion

The analysis here is still preliminary and exploratory. Most of the analysis revolved around two product, Product 1842 and 1807, because they vastly outnumbered all the other reported products. Future analysis could include running more negative binomial or Poisson (if the mean and variance are similar) regression and more standard hypothesis tests to see evaluate statistical significant differences. One question that I could not answer is to figure out any regional differences because we do not know the exact location of the hospital.

We have also attached a document that conducts a much more comprehensive break down of product harm by various segments. This document serves as a starting point for all the analysis done here and will be a value reference for any future research on the NEISS dataset.