Team Colonials

Final Project Presentation

Awilda Lopez, Bala Venkatesan, Faith Bradley, Rebecca Bilbro

Cohort 2, Fall 2014

The Problem:

In the first three decades after OSHA was created in 1971, workplace fatalities dropped more than 65%, even as US employment doubled. Now worker deaths are down from about 38 a day in 1970 to 12 a day in 2013. But we've hit a plateau. OSHA's budget is small and not likely to increase. The agency can only do so much outreach, training, enforcement, and standard-setting. How can the agency make the most of its limited resources?

The Goal:

Use OSHA data to build a tool that will help the agency develop targeting schemas for outreach, training, enforcement, and regulation.

The Hypotheses:

1. Variation by state or area - Maybe fewer workers are being killed in State X while more workers are being killed in State Y? Or certain field offices are more overwhelmed than others?
2. Impact of the recession - Are there fewer fatalities in construction because of the housing market collapse? Has the recession depressed the number of worker fatalities, artificially diminishing the impact of hazardous workplaces?
3. Time of year - Are many of the most hazardous jobs seasonal?
4. Industries are changing - As hazards decrease in some traditional industries (e.g. traditional manufacturing), are more workers being killed in new industries (e.g. green energy)?

The Data:

The primary data sources are the OSHA website (http://www.osha.gov) and the Department of Labor Enforcement Data Catalog (http://ogesdw.dol.gov/).

Understanding Workplace Fatality/Catastrophe Data


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from sklearn.linear_model import LinearRegression
from scipy import stats
%matplotlib inline
What does the dataset look like?

In [2]:
oshadata = pd.read_csv("incidents_by_year.csv", index_col="event_year")
oshadata[:5]


/Library/Python/2.7/site-packages/pandas/io/parsers.py:1154: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
Out[2]:
area_office state fatality industry_code description keywords summary_nr
event_year
1972 854910 UTAH NaN 7359 Employee suffers partial thumb amputation in t... SLIP,TABLE SAW,AMPUTATED,THUMB,GUARD,BLADE,SAW 170683817
1973 524700 OHIO True 3444 Employee fatally crushed by falling automobile CRUSHED,STRUCK BY,FALLING OBJECT,AUTOMOBILE,BO... 14514202
1973 316400 WEST VIRGINIA True 3334 Employee killed while dismantling crane gearbox DISMANTLING,CRANE,STRUCK BY,FALL,CRUSHED,UNTRA... 14481709
1974 627400 TEXAS True 1381 One employee killed, two injured when struck b... OIL WELL DRILLING,TONGS,STRUCK BY,HAND TOOL,FL... 14413066
1974 627400 TEXAS True 1381 Employee killed in trailer house fire FIRE,BURN,SMOKING,FLAMMABLE VAPORS,OIL WELL DR... 14413074
Yeah, but what does it LOOK like? Let's check out a visualization of these incidents as a time series.

In [3]:
df = pd.DataFrame.from_csv('incident_totals_year.csv', parse_dates=False)
df.osha_incidents.plot(color='r',lw=1.3)
plt.title('Workplace Fatalities and Catastrophes by Year')
plt.xlabel("1983-2012")
plt.ylabel("Fatal/Catastrophic Incidents")
plt.show()


It turns out that, because of irregularities in the way that these reports were logged before 1990, we should ignore what appears to be a trend in volume from 1983-1990. We know from the Bureau of Labor Statistics that there were actually more like 10,000 fatalities per year during that period. The graph also suggests that fatal events trail off after 2009, but it turns our that this is due to data maturation issues. BLS reports show that the number of fatal workplace incidents has been fairly steady at around 4,500 per year since 2010.
So let's do a histogram to see the frequency of events by year, but we'll just focus in on 1990-2010

In [4]:
subset = df.loc[1990:2010]
subset.plot(kind='bar', stacked=False)
plt.title("Frequency of fatalities and catastrophic events by year")
plt.xlabel("1990-2010")
plt.ylabel("Frequency")
plt.show()


The Analysis:

Hypothesis One (Rebecca): Understanding Regional Variations in Workplace Fatality/Catastrophe Data
One of our hypotheses about the data is that there may be significant regional variation in the incidence of workplace fatalities and catastrophes. By getting a sense of where the most workers are dying by location, we could assist OSHA in directing more staff and resources to the most dangerous parts of the country.
To test this hypothesis, we can analyze our data in several ways. We can look at the frequency of events by state, or at a more granular level, by area office.
Let's start by looking at variations by state

In [5]:
statedata = pd.read_csv("incidents_state_totals.csv")
statedata


Out[5]:
state totals
0 ALABAMA 1115
1 ALASKA 446
2 ARIZONA 1307
3 ARKANSAS 722
4 CALIFORNIA 38477
5 COLORADO 848
6 CONNECTICUT 1065
7 DELAWARE 125
8 DISTRICT OF COLUMBIA 33
9 FLORIDA 3225
10 GEORGIA 1828
11 HAWAII 395
12 IDAHO 417
13 ILLINOIS 2553
14 INDIANA 1724
15 IOWA 1779
16 KANSAS 1396
17 KENTUCKY 1029
18 LOUISIANA 1257
19 MAINE 274
20 MARYLAND 3908
21 MASSACHUSETTS 1491
22 MICHIGAN 1485
23 MINNESOTA 885
24 MISSISSIPPI 690
25 MISSOURI 879
26 MONTANA 273
27 NEBRASKA 615
28 NEVADA 1042
29 NEW HAMPSHIRE 386
30 NEW JERSEY 1754
31 NEW MEXICO 350
32 NEW YORK 2928
33 NORTH CAROLINA 2833
34 NORTH DAKOTA 421
35 OHIO 2630
36 OKLAHOMA 768
37 OREGON 4052
38 PENNSYLVANIA 1961
39 RHODE ISLAND 156
40 SOUTH CAROLINA 1591
41 SOUTH DAKOTA 0
42 TENNESSEE 1268
43 TEXAS 4728
44 UTAH 1580
45 VERMONT 159
46 VIRGINIA 2402
47 WASHINGTON 2173
48 WEST VIRGINIA 440
49 WISCONSIN 1167
50 WYOMING 276

In [6]:
statedata.plot(kind='bar', stacked=False)
plt.title("Frequency of fatalities and catastrophic events by state")
plt.xlabel("United States")
plt.ylabel("Frequency")
plt.show()


So California is the clear standout in terms of sheer volume. Why is that? California is a state plan state, meaning that it is not under the jurisdiction of Federal OSHA, but that alone doesn't really explain why they would have so many more fatalities (or reports of fatalities).
And hmmm...something strange is definitely going on in South Dakota. Let's try to figure out why those reports are missing from this dataset.
In any case, let's look a little deeper into the areas within the states where the most workers are dying on the job.
Hypothesis Two (Bala): Examining the Relationship Between the Housing Market Collapse and Workplace Fatality/Catastrophe Data
Hypothesis Three (Awilda): Investigating the Seasonality of Workplace Fatalities and Catastrophes
Hypothesis Four (Faith): Exploring the Changing Industrial Landscape in Relation to Workplace Fatalities and Catastrophes

The Interpretations:

(to be added later)

The Conclusions:

(to be added later)

In [6]: