This notebook serves as a reporting tool for the CPSC. In this notebook, I laid out the questions CPSC is interested in learning from their SaferProduct API. The format will be that there are a few questions presented and each question will have a findings section where there is a quick summary of the findings while in Section 4, there will be further information on how on the findings were conducted.
Given that the API was down during this time of the reporting, I obtained data from Ana Carolina Areias via Dropbox link. Here I cleaned up the pure JSON format and converted it into a dataframe (the cleaning code can be found in the exploratory.ipynb in the /notebook directory. After that I saved the data using pickle where I can easily load it up for analysis.
The questions answered here are the result of a conversation between DKDC and the CPSC regarding their priorities and what information is available from the data.
The main takeaways from this analysis is that:
In [104]:
import pickle
import operator
import numpy as np
import pandas as pd
import gensim.models
In [3]:
data = pickle.load(open('/home/datauser/cpsc/data/processed/cleaned_api_data', 'rb'))
data.head()
Out[3]:
In [67]:
pd.crosstab(data['GenderDescription'], data['age_range'])
Out[67]:
From the data, it seems that there's not much underrepresentation by gender. There are only around a thousand less males than females in a dataset of 28,000. Age seems to be a bigger issue. There appears to be a lack of representation of older people using the API. Given that older folks may be less likely to self report, or if they wanted to self report, they may not be tech-savvy enough to use with a web interface. My assumption that people over 70 are probably experience product harm at a higher rate and are not reporting this.
To construct this, I removed any incidents that did not cause any bodily harm and taking the top ten categories. There were several levels of severity. We can remove complaints that does not involve any physical harm. After removing these complaint, it is really interesting to see that "Footwear" was the product category of harm.
In [80]:
#removing minor harm incidents
no_injuries = ['Incident, No Injury', 'Unspecified', 'Level of care not known',
'No Incident, No Injury', 'No First Aid or Medical Attention Received']
damage = data.ix[~data['SeverityTypePublicName'].isin(no_injuries), :]
damage.ProductCategoryPublicName.value_counts()[0:9]
Out[80]:
This is actually preplexing, so I decided to investigate further by analyzing the complaints filed for the "Footwear" category. To do this, I created a Word2Vec model that uses a convolution neural network for text analysis. This process maps a word and the linguistic context it is in to be able to calculate similarity between words. The purpose of this is to find words that related to each other. Rather than doing a simple cross tab of product categories, I can ingest the complaints and map out their relationship. For instance, using the complaints that resulted in bodily harm, I found that footwear was associated with pain and walking. It seems that there is injuries related to Sketcher sneakers specifically since it was the only brand that showed up enough to be included in the model's dictionary. In fact, there was a lawsuit regarding Sketchers and their toning shoes
Look below, we see that a vast majority are incidents with any bodily harm. Over 60% of all complaints were categorized as Incident, No Injury.
In [81]:
data.SeverityTypeDescription.value_counts()
Out[81]:
Although, while it is label to have no injury, it does not necessarily mean that we can't take precaution. What I did was take the same approach as the previous model, I subsetted the data to only complaints that had "no injury" and ran a model to examine words used. From the analysis, we see that the word to, was, and it were the top three words. At first glance, it may seem that these words are meaningless, however if we examine words that are similar to it, we can start seeing a connection.
For instance, the word most closely related to "to" was "unable" and "trying", which conveys a sense of urgency in attempting to turn something on or off. Examining the words "unable," I was able to see it was related to words such as "attempted" and "disconnect." Further investigation lead me to find it was dealing with a switch or a plug, possibly dealing with an electrical item.
A similar picture is painted when trying to examine the word "was." The words that felt out of place was "emitting", "extinguish," and "smelled." It is not surprise that after a few investigations of these words, that words like "sparks" and "smoke" started popping up more. This leads me to believe that these complaints have something to do with encounters closely related to fire.
So while these complaints are maybe encounters with danger, it may be worthwile to review these complaints further with an eye out for fire related injuries or products that could cause fire.
In [121]:
model.most_similar('was')
Out[121]:
This question is difficult to answer because of a lack of data on the reporter. From the cross tabulation in Section 3.1, we see that the majority of our the respondents are female and the largest age group are 40-60. That is probably the best guess of who are the people who are using the API.
This is meant to serve as a starting point on examining the API data. The main findings were that:
While text analysis is helpful, it is often not sufficient. What would really help the analysis process would be include more information from the user. The following information would be helpful to collect to make conduct more actionable insight.
A great next step would be a deeper text analysis on shoes. It may be possible to train a neural network to consider smaller batches of words so we can capture the context better. Other steps that I would do if I had more time would be to find a way to fix up unicode issues with some of the complaints (there were special characters that prevented some of the complaints to be converted into strings). I would also look further into the category that had the most overall complaints: "Electric Ranges and Stoves" and see what the complaints were.
If we could implement these challenges, there is no doubt we could gain some valuable insights on products that are harming Americans. This report serves as the first step. I would like to thank CPSC for this data set and DKDC for the opportunity to conduct this analysis.
Question 2.1
The data that we worked with had limited information regarding the victim's demographics beside age and gender. However, that was enough to draw some base inferences. Below we can grab a counts of gender, which a plurality is females.
Age is a bit tricky, we have the victim's birthday in months. I converted it into years and break them down into 10 year age ranges so we can better examine the data.
In [4]:
data.GenderDescription.value_counts()
Out[4]:
In [46]:
data['age'] = map(lambda x: x/12, data['VictimAgeInMonths'])
labels = ['under 10', '10-20', '20-30', '30-40', '40-50', '50-60',
'60-70','70-80', '80-90', '90-100', 'over 100']
data['age_range'] = pd.cut(data['age'], bins=np.arange(0,120,10), labels=labels)
data['age_range'][data['age'] > 100] = 'over 100'
In [66]:
counts = data['age_range'].value_counts()
counts.sort()
counts
Out[66]:
However after doing this, we still have around 13,000 people with an age of zero, whether it is that they did not fill in the age or that the incident involves infant is still unknown but looking at the distribution betweeen of the product that are affecting people with an age of 0 and the overall dataset, it appears that null values in the age range represents people who did not fill out an age when reporting
In [58]:
#Top products affect by people with 0 age
data.ix[data['age_range'].isnull(), 'ProductCategoryPublicName'].value_counts()[0:9]
Out[58]:
In [59]:
#top products that affect people overall
data.ProductCategoryPublicName.value_counts()[0:9]
Out[59]:
Question 2.2
At first glance, we can look at the products that were reported, like below. And see that Eletric Ranges or Ovens is at top in terms of harm. However, there are levels of severity within the API that needs to be filtered before we can assess which products causes the most harm.
In [70]:
#overall products listed
data.ProductCategoryPublicName.value_counts()[0:9]
Out[70]:
In [73]:
#removing minor harm incidents
no_injuries = ['Incident, No Injury', 'Unspecified', 'Level of care not known',
'No Incident, No Injury', 'No First Aid or Medical Attention Received']
damage = data.ix[~data['SeverityTypePublicName'].isin(no_injuries), :]
damage.ProductCategoryPublicName.value_counts()[0:9]
Out[73]:
This shows that incidents where there are actually injuries and medical attention was given was that in footwear, which was weird. To explore this, I created a Word2Vec model that maps out how certain words relate to each other. To train the model, I used the comments that were made from the API. This will train a model and help us identify words that are similar. For instance, if you type in foot, you will get left and right as these words are most closely related to the word foot. However after some digging around, I found out that the word "walking" was associated with "painful". I have some reason to believe that there are orthopedic injuries associated with shoes and people have been experience pain while walking with Sketchers that were supposed to tone up their bodies and having some instability or balance issues.
In [115]:
model = gensim.models.Word2Vec.load('/home/datauser/cpsc/models/footwear')
model.most_similar('walking')
Out[115]:
In [84]:
model.most_similar('injury')
Out[84]:
In [94]:
model.most_similar('instability')
Out[94]:
Question 2.3
In [122]:
model = gensim.models.Word2Vec.load('/home/datauser/cpsc/models/severity')
items_dict = {}
for word, vocab_obj in model.vocab.items():
items_dict[word] = vocab_obj.count
sorted_dict = sorted(items_dict.items(), key=operator.itemgetter(1))
sorted_dict.reverse()
sorted_dict[0:5]
Out[122]: